From keld@dkuug.dk Tue Aug 30 23:53:47 1994 Received: by dkuug.dk id AA24863 (5.65c8/IDA-1.4.4j for wg15rin); Tue, 30 Aug 1994 21:53:48 +0200 Message-Id: <199408301953.AA24863@dkuug.dk> From: keld@dkuug.dk (Keld J|rn Simonsen) Date: Tue, 30 Aug 1994 21:53:47 +0200 X-Charset: ASCII X-Char-Esc: 29 Mime-Version: 1.0 Content-Type: Text/Plain; Charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Mnemonic-Intro: 29 X-Mailer: Mail User's Shell (7.2.2 4/12/91) To: wg15rin@dkuug.dk Subject: X/Open paper on 10646 Hello RINners! Here is the X/Open paper I talked about distributing to you. I will use most of this, together with SC22 resolutions in the paper I am going to draft. Keld ---- The author is Sandra Martin O'Donnel, OSF. Subject: (XoJIG 944) Draft ISO 10646 Paper Date: Tue, 04 May 93 09:55:42 -0400 From: martin@osf.org Folks -- Attached is the draft of the ISO 10646 strategy paper that I got conned into agreeing to write. This completes action item JIG 930203. What? You say you remember that I was supposed to complete this in early *April* rather than early *May*? Details, details. :-) Those of you who are OSF members will probably be able to tell that this document is a descendant of a similar one I did for OSF. However, there are major revisions, and the recommendations definitely are different (given the differences in the kinds of organizations OSF and X/Open are, and the beliefs of XoJIG). I also have incorporated bits and pieces of many of the documents that were circulated during the March, 1993 email deluge. Note that for several topics, XoJIG does not have a concensus position. In those cases, I wrote my own recommendation, but I definitely took into account our group's opinions. Some of the recommendations may change once the whole group sees them. I'm sure you will have suggestions for changes to this draft, but at this point, I'm particularly interested in hearing whether you think any major topics need to be added. One that I deliberately left out was the issue of code set conversion. This took up a lot of the email volume in March. However, in re-reading the messages, it appeared that, while there are numerous issues to face in doing conversions, every code set brings with it its own set of issues. Therefore, I didn't see what I needed to bring out that was ISO 10646-specific. If you disagree, I'm sure you will let me know. :-) FYI, I will be away on business May 10-16 and will not be reading email or responding to comments during that time. Okay, enjoy. -- Sandra --------------------------------------------------------------------- Sandra Martin email: martin@osf.org Open Software Foundation phone: +1 (617) 621-8707 11 Cambridge Center fax: +1 (617) 225-2782 Cambridge, MA 02142 USA --------------------------------------------------------------------- *****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT X/OPEN STRATEGY FOR THE UNIVERSAL CODE SET ISO/IEC 10646 -------------------------------------------------------- X/Open-UniForum Joint Internationalization Group (XoJIG) Version 1.0 May, 1993 With the approval of ISO/IEC 10646 universal code set as an international standard, and with activities underway at some companies to implement systems that use forms of the code set, there is increasing interest in having X/Open define its strategy toward ISO 10646. This paper covers some of the possibilities, their benefits and costs, and recommends a course of action. The paper is organized as follows: Definitions Overview of ISO 10646 Possible Uses for ISO 10646 As a Multi-byte Code As a Wide Character (wchar_t) Process Code As an Interchange Code Other Ways to Use ISO 10646 ISO 10646-Specific Data Types and Interfaces ISO 10646 as a Well-Known Process Code Implementations of ISO 10646 or Unicode Systems Recommendations [Note: In this paper, a leading "0x" denotes a hexadecimal number.] DEFINITIONS ----------- Byte -- (from ISO C) The unit of data storage large enough to hold any member of the basic character set of the execution environment.... A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. ... Character -- a member of a set of elements used for the organization, control, or representation of data. Combining character -- a member of subset of the coded character set which is intended for combination with a preceding non-combining graphic character, or with a sequence of combining characters preceded by a non-combining character. Composite character sequence -- a sequence of graphic characters consisting of a non-combining character followed by one or more combining characters. Graphic symbol -- the visual representation of a graphic character or a composite sequence. Octet -- an ordered sequence of eight bits considered to be a unit. Pre-composed character (or non-combining character) -- a character which has an independent graphic symbol. Unicode -- defined by Unicode Consortium. A profile of ISO 10646, same encoding as ISO 10646 BMP (Basic Multilingual Plane where group=0 and plane=0). OVERVIEW OF ISO 10646 --------------------- Commonly used code sets/encoding methods such as ASCII, ISO 8859-1, and Japanese EUC include characters for a single language or small group of languages. Because of this, users are limited to the languages their current code set supports. If they use ISO 8859-1, which supports Western European languages only, it is not possible to include, say, Japanese, Greek, or Arabic characters in their text. Some applications and users need mixtures of languages that current code sets do not support. Therefore, the goal in creating ISO 10646 was to include all characters from all significant languages; to be what the standard calls a "Universal Coded Character Set" (UCS). The initial version of 10646 contains approximately 33,000 characters covering a long list of languages including European, Asian ideographic, Middle Eastern, Indian, and others. It also reserves 6,000 code spaces for private use. ISO 10646 is based heavily on a code set called Unicode. Unicode was developed primarily by Xerox and Apple, although other companies contributed to its design. People often use "10646" and "Unicode" interchangeably, although there are differences between the two sets. This paper uses each term as appropriate. ISO 10646 differs in some ways from code sets currently used on XPG-compliant systems. Many currently supported code sets include portable characters as single-octet entities and with code values matching either ISO 646 IRV:1991 or a form of EBCDIC. The ISO 646 IRV values are in the range 0x00-0x7f (0-127 decimal). It is common for existing software to depend on one or more ISO 646 IRV values (particularly control characters), and on the fact that such characters are always one octet (the de facto standard size of a byte) each. Characters in ISO 10646, in contrast, are encoded in multiple octets. Code space is divided into four units like this: +--------------+---------------+-------------+--------------+ | Group-octet | Plane-octet | Row-octet | Cell-octet | +--------------+---------------+-------------+--------------+ 10646 allows two basic forms for characters: 1. UCS-2 (Universal Coded Character Set-2). Also known as the Basic Multilingual Plane (BMP). Characters are encoded in the lower two octets (row and cell). Predictions are that this will be the most commonly used form of 10646. 2. UCS-4 (Universal Coded Character Set-4). Characters are encoded in the full four octets. These are the encodings uppercase "A" in ISO 646 IRV, UCS-2, and UCS-4: ISO 646 UCS-2 UCS-4 +----------+--------------------+-------------------------------------+ Binary | 01000001 | 00000000 01000001 | 00000000 00000000 00000000 01000001 | Hex | 0x41 | 0x00 0x41 | 0x00 0x00 0x00 0x41 | +----------+--------------------+-------------------------------------+ UCS-2 and UCS-4 presently encode exactly the same set of characters, but that is expected to change over time. Unlike the ISO 646-based code sets and encoding methods that many implementations currently support, ISO 10646 encodes portable characters in two or four octets each. In addition, many code sets prohibit any octet of any printable character from being in the control character range (0x00-0x1f and 0x7f), but ISO 10646 makes no such restriction. Notice that in the example above, "A" includes one NULL octet in UCS-2 and three NULL octets in UCS-4. In addition to the UCS-2 and UCS-4 forms, ISO 10646 also includes an encoding technique in which multiple characters can be combined to form "composite character sequences." The Unicode developers wanted to fit all characters in 16 bits, but they also wanted the code set design to be flexible enough to allow a nearly infinite variety of character combinations. They therefore added the concept of "combining characters." Suppose you want to encode the letter (lowercase a with acute accent). This letter-with-diacritic exists in ISO 10646 (code value 0x00 0xe1), but it also is possible to encode it as the plain "a" followed by an acute accent; that is: +--------+--------+ | a | ' | = +--------+--------+ In this case, the code value of is: Character: a ' UCS-2 Code Value: 0x00 0x61 0x03 0x01 The resulting composite character consumes four octets -- two for the "a" and two for the acute accent. In ISO 10646, certain characters are defined as "combining diacritical marks", and it is permissible to combine these marks with any non-combining character. Any number of combining marks can follow a base character. For example, although this "character" does not exist in any language, this is a permissible encoding in ISO 10646: +--------+--------+--------+--------+-------+ | p | ' | ~ | ^ | ` | = Some languages are only fully supportable in ISO 10646 through the use of combining characters. Examples include Korean, Arabic, and Thai. Although combining characters give ISO 10646 great flexibility, they also create programming challenges that do not exist in many commonly used code sets. Because not all want to revise software to handle composite character sequences, ISO 10646 has three conformance levels: Level 1: Combining characters are not allowed Level 2: Combining characters are allowed for these scripts only: Arabic, Hebrew, Indic, and Thai Level 3: Combining characters are allowed, no restrictions Thus, with ISO 10646, it is possible for an implementation to support one or more of the following: UCS-2, Level 1: Two-octet form, no combining characters UCS-2, Level 2: Two-octet form, combining characters allowed with restrictions UCS-2, Level 3: Two octet form, combining characters allowed, no restrictions UCS-4, Level 1: Four-octet form, no combining characters UCS-4, Level 2: Two-octet form, combining characters allowed with restrictions UCS-4, Level 3: Two octet form, combining characters allowed, no restrictions Unicode R1.1 only allows two-octet code elements and does not support levels. It is equivalent to UCS-2, Level 3. In addition to the official forms of 10646, one of the standard's informative annexes defines an unofficial form called UTF (UCS Transformation Format). Commonly known as UTF-1, this form provides some compatibility between UCS-2 or UCS-4 and ISO 646 IRV. In UTF-1 form, portable ISO 646 IRV characters shrink back from being two or four octets to being a single octet (that is, they are encoded exactly the same as ISO 646 IRV). In addition, no octets of any UTF-1 characters can be in the range 0x00-0x20 or 0x7f-0x9f, so no UTF-1 octets have the same values as control characters. Consider these examples: Code values for "A" ------------------- ISO 646 IRV: 0x41 UTF-1: 0x41 UCS-2: 0x00 0x41 Code values for "" --------------------------- ISO 8859-1: 0xe1 UTF-1: 0xa0 0xe1 UCS-2: 0x00 0xe1 Code values for Asian ideograph "" ------------------------------------------ JIS X0208: 0x30 0x6c /* Japanese national std code set */ UTF-1: 0xf6 0x21 0xd0 UCS-2: 0x4e 0x00 ISO 646 IRV and UTF-1 values are identical, but other UTF-1 characters consume more octets than do those same characters in other existing standards -- two octets in UTF-1 versus one in ISO 8859-1; three octets in UTF-1 versus two in JIS X0208. Also note that UTF-1 does not restrict octets from having the same value as ISO 646 IRV slash (/, code value 0x2f). This means UTF-1 is not directly usable as an encoding for file names on most UNIX-based systems because most implementations search one eight-bit byte at a time for slashes. Because of this and other limitations with UTF-1, a group from XoJIG created a second transformation format. Called FSS-UTF (File System Safe UCS Transformation Format) or UTF-2, in this version, the MBS (Most Significant Bit) of an ISO 646 IRV character is 0; the MSB for all octets of all other characters is 1. The following shows how to map UCS-* characters in a given hex range to a UTF-2 value. In the binary values, a "0" or "1" indicates that bit must have the listed value; an "x" indicates the bit can be either 0 or 1. Hex Min Hex Max UTF-2 Binary Encoding ------- ------- ----------------------- 00000000 0000007F 0xxxxxxx 00000080 000007FF 110xxxxx 10xxxxxx 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 00010000 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx Although this mapping shows UTF-2 characters being up to six octets long, the current version of ISO 10646 only has UCS characters up to a hex value of FFFF. UTF-2 characters thus would be a maximum of three octets. UTF-2 provides full ISO 646 IRV (ASCII) transparency (any octet that looks like ISO 646 IRV is ISO 646 IRV), and also is compatible with UNIX-based and other file systems. Another aspect of ISO 10646 that merits mention is that the standard allows subsetting. An implementation can choose to support a subset of the character code positions within ISO 10646, and be ISO 10646 conformant. Such an implementation must identify the characters in its repertoire. POSSIBLE USES FOR ISO 10646 --------------------------- There are three major potential uses for any code set -- as a multi-byte code, as a wide character process code, or for data interchange. Here is a discussion of how ISO 10646 fits into each using existing standards and interfaces. ISO 10646 as a Multi-byte Encoding ---------------------------------- ISO C (ISO 9899:1990) defines the form in which source files are written and specifies that they contain either single-byte or multi-byte characters. It also specifies that a NULL byte (all bits set to zero) terminates character strings, and that the second or subsequent bytes of a multi-byte character may not equal NULL. The common char-based string functions (e.g., strcat(), strcpy() ) all are defined to terminate when they encounter a NULL byte. This definition, together with common practice, makes it nearly impossible to use UCS-2 or UCS-4 as a multi-byte code. As earlier examples show, it is common for one or more UCS-* octets to have all bits set to zero. While ISO C does not define the size of a byte, it says that an implementation does. In implementations that define byte to be eight bits -- all common implementations -- UCS-2 or UCS-4 are not permissible as multi-byte encodings. That's because on an eight-bit byte system, the UCS-* data contains octets that are interpreted as NULL bytes. It is theoretically possible to fix this problem, but it requires changing an implementation so that it defines the size of byte as 16 or 32 bits. In addition, since a lot of existing software assumes byte equals eight bits, those assumptions also would have to be removed. Note, too, that an implementation which changed the size of a byte might have trouble interoperating with the existing installed base of eight-bit byte systems. Although it would be very difficult to use UCS-* as a multi-byte encoding, the UTF forms can be stored as multi-byte characters. Several current or planned implementations are using a UTF form (usually, UTF-2) as the multi-byte encoding of the repertoire of ISO 10646 characters. Some implementations require that UTF-2 be the only multi-byte encoding, while others support UTF-2 in addition to other common encodings (ISO 8859-n, Japanese AJEC, Korean EUC, etc.). ISO 10646 as a Wide Character (wchar_t) Process Code ---------------------------------------------------- Process code is the form data takes when a program is processing it. The process code may be identical to the multi-byte code, but there are circumstances under which it is more efficient to convert multi-byte characters to a wide character process code. While multi-byte characters can be varying widths (some one byte, some two bytes, and so on), wide characters are all a single, fixed width. The ISO C type wchar_t (usually implemented via a typedef) holds wide character data, and XPG4 defines a set of wchar_t-based functions (for example, wcscat() and iswalpha() ). Given the current definition of the wchar_t interfaces, it is possible to use either UCS-2 or UCS-4 as a wide character process code as long as combining characters are not allowed. The interfaces implicitly assume that one wchar_t equals one complete character. However, with combining characters, one composite character sequence may span several wchar_t's. In that case, while the interfaces do not fail, they may not return the correct answer. Consider two possible UCS encodings for the French word "co^te'": 1. UCS-2, Level 1 (four wchar_t's) +--------+--------+--------+---------+ | c || | |cumflex>| | | +--------+--------+--------+---------+ 2. UCS-2, Level 3 (six wchar_t's) +--------+--------+--------+--------+-------+-------+ | c | o | ^ | t | e | ' | | | | | | | | +--------+--------+--------+--------+-------+-------+ (Note that it is possible to encode "co^te'" exactly the same way in Level 3 as it is encoded in Level 1. However, the possibility exists that it could be encoded using combining characters as shown above.) Now suppose these strings are the input to iswalpha() and that the function is run under a French locale. Assume the locale defines and as alphabetic characters -- a realistic assumption, since both letters exist in French. iswalpha() has no problem with any of the wchar_t's in Example 1; it correctly returns true for all four. However, the fact that the second example uses combining characters means that the results may be different from what a user expects. iswalpha() returns true for "c" and "o", but false for "^". Likewise, it returns true for "t" and "e" and false for "'". Since the input string contains only alphabetic characters, the fact that iswalpha() returns false twice is at best an anomaly; at worst, a mistake. Suppose a program permits only alphabetic characters in a given field and uses iswalpha() to validate user input. iswalpha() might reject values not because of what they are, but because of the way they are encoded. Most wchar_t interfaces are not designed to accommodate the possibility that a single complete character can span multiple wide characters. While it is fine to store UCS-*, Level 1 data in a wchar_t, the only way to ensure the expected results from the XPG4 interfaces when using UCS as a wide character process code is to prohibit the use of Level 2 or 3. In recent months, there have been several proposals to add new wchar_t interfaces to process composite character sequences. Gary Miller of IBM-Austin has proposed interfaces including: wcstxcpy() - Copy Composite Character Sequences wcstxcat() - Concatenate Composite Character Sequences wcstxcnt() - Determine Code Element Count wcstxwidth() - Determine the Display Width of a String wcstxnext() - Next Composite Character Sequence wcstxnorm - Normalize String wcstxattr - Retrieve Code Element Attributes If ISO or X/Open adopt these or similar interfaces, it would be possible to use UCS-*, Level 2 or 3 as a process code to the extent that the new interfaces allow. Since ISO 10646 has a larger repertoire than do other existing code sets, if it is the process code, it can represent the characters of more languages than other wide character codes. ISO 10646 as an Interchange Code -------------------------------- The last major use for a code set is as an interchange code. This might be the form data takes in any of these: 1. When it is travelling on the network 2. When going between processes (as in a cut-and-paste operation between two X windows) 3. When it is part of a Remote Procedure Call (RPC) There are three major ways to implement interchange codes. 1. Single interchange code. A system defines a single interchange code and either requires that all data be converted to the format, or blindly treats all data as being in that format. Many UNIX-based email systems assume all character data is encoded in ASCII only. Since ASCII fits in the lower seven bits of an eight-bit byte, these systems automatically strip off the high bit of all data bytes. 2. Multiple codes with identifying tags. An alternative implementation is to allow multiple interchange codes and add a tag to each data packet that identifies which code the packet contains. The X Consortium's Compound Text is an example of this implementation. 3. Multiple codes, no tags. A third alternative is for the system to support multiple interchange codes, but do nothing to identify them. Most modern networks work this way -- a user can send a Japanese EUC-encoded file followed by a French ISO 8859-1 file, and the network neither knows nor cares about the difference. It simply pushes the bytes along. It is possible to use ISO 10646 in any of these implementations. In fact, using a single form of ISO 10646 may be the only logical choice for a single interchange code implementation. After all, a system that only supports one interchange form should be capable of representing all characters. ISO 10646 is the only set that comes close to being able to do that. There is, however, a significant performance disadvantage to using a single interchange code. This model requires two conversions for all interchange tasks: into the interchange format at the beginning of a trip, and back out when the data reaches its destination. (Of course, if the data already is encoded in the form of ISO 10646 used for interchange, no conversions are necessary.) Since most interchanges involve data travelling between processes or systems using the same encoding, the conversion overhead may be an excessive price to pay. That is, it is wasteful to convert into and out of an interchange code when data is just going from ASCII to ASCII, or from Japanese EUC to Japanese EUC. Given the performance penalty associated with a single-interchange model, some systems are adopting ISO 10646 as a fallback universal interchange code. When the source and destination nodes use the same code set, there is no conversion. When they differ, any of several conversion models may be used, but such systems must support conversion into and out of ISO 10646. Some of these systems use tags to identify the data's format on the wire. The use of ISO 10646 as a fallback universal interchange code simplifies code-set-knowledgeable interchange. When converting from one known encoding to another, there either need to be direct to-and-from converters between that pair of code sets, or else an acceptable intermediate form. Requiring direct converters leads to combinatorial explosion, but using ISO 10646 as a fallback removes the requirement for direct converters. One issue to be resolved in interchanging data is the possibility of big endian and little endian mismatch. ISO 10646 defines big endian as the interchange format, but a heterogeneous distributed environment may contain both big and little endian nodes. Little endian nodes therefore must reverse the octets of UCS-* data going to and coming from the network. OTHER WAYS TO USE ISO 10646 --------------------------- The previous sections describe ways that it is or isn't possible to deploy ISO 10646 using current interfaces. There are at least two additional ways to use the code set: with code-set-specific data types and interfaces, or as a well-known process code. ISO 10646-Specific Data Types and Interfaces -------------------------------------------- In XPG4, X/Open added the WPI interfaces as a way to support requests for code set independence. While the XPG3 interfaces were oriented toward single-byte code sets, XPG4 removes many such dependencies. Since code set usage differs from country-to-country and user-to-user, code set independent software has been seen as a good way to provide a single system that works around the world. While the WPI interfaces can handle many encodings, they are not completely code set independent (CSI). For example, truly CSI interfaces would be able to support Level 2 and 3 combining characters. The XPG4 interfaces cannot. Although some still stress the importance of code set independence, there is growing sentiment for ISO 10646-only, or code set dependent (CSD) interfaces. Advocates of ISO 10646-only data types and interfaces often reason that it is easier to write programs that know, and take advantage of, a single character encoding. They note that prior to the advent of internationalization, most programs were written with the assumption that all data was ASCII-encoded. ASCII-only software was unacceptable because it basically only handles English, but since ISO 10646 is a universal set, it is supposed to support all characters in all languages. Some believe it will become "the new ASCII" and therefore remove the need for code set independence. Another reason for adding ISO 10646-only data types/interfaces is that several future OSes, including Microsoft's NT and a future Apple Computer OS, have such types and interfaces. Among the proposals for CSD support are: 1. That a new data type be created that can contain UCS-2, Level 3 data only. 2. That two new types be created -- say, ucs2-char (16 bits) and ucs4-char (32 bits). There are varying opinions on the level or levels these types should support. In addition to the data types, there also are proposals for ISO 10646-specific APIs. These would be in addition to char- and wchar_t-based interfaces. For example: Existing: char *strcat (char *s1, const char *s2); wchar_t *wcscat (wchar_t *ws1, const wchar_t *ws2); New [names are placeholders only]: ucs2-char *ucs2cat (ucs2-char *u2s1, const ucs2-char *u2s2); ucs4-char *ucs4cat (ucs4-char *u4s1, const ucs4-char *u4s2); There are two scenarios under which code set specific interfaces could be added: 1. If/when ISO 10646 becomes "the new ASCII," or 2. As an implementation alternative to the existing code set independent APIs. If a single form of ISO 10646 becomes "the new ASCII," code set specific data types and APIs may be widely used. XPG4's WPIs must be somewhat general in nature in order to support many different encodings. Such generality at times has an adverse impact on efficiency and performance, and means that underlying implementations often are more complex than they would be if the interfaces supported a single code set. Dedicated ISO 10646 interfaces could be tuned to a single form or group of forms. (Note that as the number of supported forms increases, implementations become more complex and the efficiency and performance gains may decrease.) The question is, how likely is it that a form of ISO 10646 will become the only supported encoding on all or most computer systems? Complete uniformity seems unlikely. Even when ASCII dominated the U.S. market, it wasn't the only supported code set -- EBCDIC-based systems had a significant share of the market. Today, people around the world use many different code sets, and it often is difficult to move them from one to another. Although users resist code set changes for numerous reasons, an important one is if the change results in a difference in the amount of space their data consumes. Data stored in national or regional code sets usually consumes less space than does the equivalent UCS or UTF encoding. Given that most users work in a single language, and that they have terabytes of existing data, they may see no advantage to giving up national sets (particularly single-byte ones) in favor of ISO 10646. Another obstacle to ISO 10646 becoming the new ASCII is its support of multiple forms. There is only one encoding for ASCII, so it was possible to write programs that depended on that one encoding. ISO 10646, however, includes the two- and four-octet forms, three conformance levels, and allows subsetting. There also are the semi-official UTF forms. Although XoJIG members seem to be unanimous in supporting UTF-2 as the multi-byte form of ISO 10646, there is no concensus in the group on the form or forms to support within a wchar_t or in ISO 10646-specific interfaces. It seems likely that other groups or computer vendors also will have varying opinions on the form(s) to support. Because of legacy programs, storage size, user inertia, and other factors, it seems likely that ISO 10646 will be one several supported code sets rather than being the only one. In addition, the trend in information technology is toward increasing interoperability and distributed computing, which implies that ISO 10646-based systems and those that support other code sets will have to interoperate. To enable interoperability between such systems, a number of issues must be resolved, based on both customer requirements and technology trends. Despite the prediction that ISO 10646 will be one of many code sets rather than the only one, there still is an argument for adding ISO 10646-only data types or interfaces. As noted, the WPIs were not designed to handle ISO 10646's combining characters, and it is not feasible in existing implementations to use UCS forms as a multi-byte encoding. New data types or interfaces could be designed to take advantage of ISO 10646's specific attributes. They also would make it possible for those who want to write CSD programs to do so. The advantage of CSD programs is that they can include logic that is geared toward the single supported code set. Instead of having generalized routines that are capable of handling many code sets, such programs can streamline the logic and hard-code in character-handling information. Such programs may need a way, however, to get data into the single form they support. If an application processes a single form of UCS only, it either can assume all data is in that form, or it needs the appropriate converter modules. If the application makes assumptions, they may be incorrect, and if it requires the use of converter modules and such modules are unavailable, it will not be able to process the data. This is analogous to the current locale model -- data is assumed (sometimes incorrectly) to match the current locale, and a requested locale may not always be available. A disadvantage of CSD data types and interfaces is that they cannot metamorphose over time. They therefore may not be able to keep up with changing requirements. In early 1992, if a vendor had chosen to write programs that depended on UTF-1 as the multi-byte form of ISO 10646, it would have had difficulty moving to the technically superior UTF-2 version that was approved later in the year. Similarly, ISO 10646 had two levels in early 1992, but an extra one was added late in the year and there are proposals to add another. If CSD interfaces and data types had been created to support one of the early levels, it might be difficult or impossible to change to the other levels. Another disadvantage to CSD types and interfaces is that they may not always support the full range of user-required characters. Although ISO 10646 has a very large repertoire, it does not currently include all the ideographs in the 1992 version of CNS 11643 (the Taiwanese code set). If these ideographs are not added to ISO 10646, CSD interfaces would not be able to process such characters. As requirements, level definitions, and character repertoires change over time, CSD data types and interfaces that are geared toward today's realities may become less useful. It also is important to consider how CSD types and interfaces might fit in with existing I18N interfaces. Developers often find the current dual set of char- and wchar_t-based interfaces confusing, and XoJIG is proposing a third o_* set in the Distributed Internationalization Services Snapshot (DISS). A fourth set would add to that confusion. ISO 10646 as the Well-Known Process Code ---------------------------------------- wchar_t is a semi-opaque type in that ISO C does not define its size or contents. Two sizes are common in current implementations: 16 bits and 32 bits. As implemented in several OSes, the contents of wchar_t differ depending on locale. Because multi-byte encodings have different characteristics, these implementations have separate algorithms for converting the multi-byte characters to wchar_t representations. Thus, an AJEC (Japanese EUC) wchar_t encoding differs from the Japanese SJIS version, and both differ from the ISO 8859-1 version. Maintaining flexibility in wchar_t's size and its contents comes with a cost. It means wide characters are not exchangeable between processes and that programs must be written such that they do not depend on or take advantage of any wchar_t encoding. There have been proposals to handle process code another way: to designate a single size and a single encoding for wchar_t so applications can exchange process code and developers can write programs that take advantage of the single represenation. The most common suggestions are that either UCS-2, Level 1, or UCS-2, Level 3 be designated as the single, "well-known" process code. Under this proposal, whenever converting a multi-byte character to wide character form (that is, whenever using mbtowc() or mbstowcs() ), the only valid wide character encoding would be the selected UCS form. If there is to be a well-known process code, Unicode or a form of ISO 10646 are the only viable candidates. They are the only code sets with a large enough repertoire to satisfy most users' requirements. On an implementation level, there are both similarities and differences between the current, multiple wchar_t representation model, and an ISO 10646-based only model. For systems that support more than one multi-byte encoding, both models require multiple sets of converters to the wide character form. For example, if a system supports ISO 8859-1..n, Japanese AJEC and SJIS, and Taiwanese EUC, it needs the appropriate mbtowc converters for each of these, regardless of the destination process code. Note that in a multiple wchar_t representation model, some code sets share the same converter, but that isn't the case in an ISO 10646-only model. Consider the ISO 8859 code sets. In a multiple wchar_t model, all these sets usually use the same mbtowc() logic (typically, just a zero-padding of the single-byte characters out to the system's wchar_t size). In an ISO 10646-only model, each set needs its own converter because the ISO 8859 values do not match the ISO 10646 values (except for ISO 8859-1). A difference between the multiple and ISO 10646-only models is that the former often relies on algorithms for conversion while the latter often needs table lookup. The latter's performance may therefore be slower than the former. Note that UTF to UCS conversions are done via an algorithm. Therefore, a way to remove the performance disadvantage with an ISO 10646 only model is to allow only UTF form(s) as multi-byte encodings. In addition to implementation considerations around a single well-known process code, there also is the question of which one (if any) to designate as that code. As noted earlier, the XPG4 wchar_t interfaces do not currently support combining characters, so it isn't feasible now to designate UCS-*, Level 2 or 3 as the well-known process code. UCS-*, Level 1 is less-than-ideal for other reasons. With combining characters, ISO 10646 has a nearly infinite repertoire of composite character sequences, but UCS-*, Level 1 necessarily has a smaller repertoire. If UCS-*, Level 1 is designated as the well-known process code, technologies will not be able to process languages that are only fully supportable in ISO 10646 through the use of combining characters, and won't be able to support charcters (like the ideographs in the revised Taiwanese set) that have not been added to UCS-*. Even assuming one form of ISO 10646 (or any other encoding) is ideal as a process code, there are reasons not to specify a well-known process code. Since ISO C does not define the size or contents of wchar_t, OS suppliers implement it differently. While X/Open might select one size and one encoding, other suppliers are free to choose something different. This means any code that X/Open members write to take advantage of an X/Open definition may not be portable to other ISO C-compliant OSes. Naturally, this is not a disadvantage for companies for whom application portability is not a goal. IMPLEMENTATIONS OF ISO 10646 OR UNICODE SYSTEMS ----------------------------------------------- In considering a strategy for handling ISO 10646, X/Open needs to be aware of what some computer vendors are implementing. For example, there is a general belief that Microsoft's NT will be a major force in the industry, and that X/Open-compliant technologies must be capable of interacting with that force. As mentioned earlier, Microsoft is providing Unicode support in its upcoming NT release. It does so using a single source, dual object model, where one object supports Unicode only (currently without combining characters) and the other handles ASCII-based encodings. Support for combining characters is planned for a future release. The NT implementation also includes these features: 1. NT uses wchar_t, but only allows it to be 16 bits and to contain only Unicode. NT hard-codes in dependencies on the size and contents of wchar_t. In essence, NT has chosen Unicode to be its single, well-known process code. 2. NT uses Unicode as a multi-byte file code. As noted above, Unicode and UCS-* are not permissible as multi-byte file codes on ISO C-compliant eight-bit byte systems. 3. NT uses wide character functions that have names similar to the XPG4 WPI functions, but are different because they can only process Unicode data. In some cases, the Microsoft names and syntax match the XPG4 versions; in others, they don't. Microsoft can make some of these design decisions because application portability is not a high priority in NT. X/Open faces different constraints, but there are things X/Open can do to ease interoperability with NT or other ISO 10646-based implementations. See the next section for specific recommendations. RECOMMENDATIONS --------------- X/Open should: 1. Endorse UTF-2 as the preferred multi-byte representation of the repertoire of characters in ISO 10646. 2. Endorse the continuing support of other multi-byte encodings. UTF-2 is an addition to other code sets, not a replacement of them. 3. Develop additional wchar_t APIs for handling composite character sequences. This would make it possible to use Level 2 or 3 data in a wide character process code. Gary Miller of IBM-Austin has made an initial proposal; XoJIG should work from this proposal. 4. Continue to allow multiple wchar_t representations. That is, X/Open should not designate a single, well-known process code. This means existing implementations of XPG4 interfaces continue to be viable, and also allows support for characters which may not exist in the chosen process code. 5. Avoid creating ISO 10646-specific data types or interfaces. For now, such interfaces add too much confusion to the existing (char- and wchar_t-based) and proposed (DISS) sets of I18N interfaces. 6. Do nothing with respect to the use of ISO 10646 as an interchange code. This is in line with current XPG functionality. The current WPIs have no knowledge of interchange codes, so there is no need to change the interfaces. However, if X/Open adds specific support for interchange codes in the future, it must ensure that such support accommodates ISO 10646's needs. *****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT*****DRAFT