ISO/IEC JTC1/SC22/WG15RIN N133 Title: Use of ISO 10646 in POSIX interfaces Source: WG15 RIN Rapporteur Date: 1994-10-23 Action: for WG15 RIN to discuss ISO/IEC 10646, the Universal Multiple-Octet Coded Character Set (UCS), provides the capability to encode multi-script text within a single coded character set. However, because UCS is fully encoded, null bytes and the code values of the other ASCII characters, including the code value of the ASCII slash ("/") character, are not protected. This makes the UCS charac- ter encoding incompatible with many existing ASCII based POSIX operating system implementations. There are several challenges presented by UCS which must be dealt with by present implementations of the POSIX operating system. The most significant of these challenges is the encoding scheme used by UCS. More precisely, the challenge is the marrying of the UCS standard with existing programming languages and existing operating systems. Prominent among the operating system UCS handling concerns is the representation of the data within the file system. An underlying assumption is that there is an absolute re- quirement to maintain the existing operating system software investments while at the same time taking advantage of the use the large number of characters provided by UCS. The UTF-8 transformation format was originally conceived as a file system safe transformation format of UCS to allow historically ASCII based POSIX operating systems to cope with representation and handling of the large number of characters that are possible to be encoded by UCS. As UTF-8 can represent the full encoding of IS 10646, there is no immediate need for support of other representations of IS 10646. The objective of UTF-8 is to provide an UCS transformation format which also meets the requirement of being usable on historical POSIX operating system file systems in a non-disruptive manner. The UTF-8 transformation format encodes both UCS-2 and UCS-4 in a com- patible format using multibyte characters of lengths 1, 2, 3, 4, 5, and 6 bytes: Bits Hex Min Hex Max Byte Sequence in Binary 1 7 00000000 0000007F 0vvvvvvv 2 11 00000080 000007FF 110vvvvv 10vvvvvv 3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv 4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv 5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv The UCS value is the concatenation of the v-bits in the multibyte encoding, where the v-bits are the 0's and 1's that constitute the UCS value. Thus UTF-8 has the capability of handling existing ASCII files without change, and all codes in the ASCII range (having an octet value in the range 0-127) can be safely assumed to be representing the normal ASCII character. Other forms of IS 10646: IS 10646 has two forms: UCS-2 and UCS-4, a 16-bit and 31-bit encoding of the character set, respectively. It is clear from work in JTC1/SC2/WG2 that IS 10646 will have more characters than what is representable in 64 k, so a 16-bit representation like UCS-2 is not acceptable for general use. ISO/IEC 10646-1:1993 has a transformation format UTF-1, which is infor- mative, and it is proposed that this format will be removed, as UTF-8 is aimed at the same purpose, and has more capability. UTF-8 is currently under vote as a PDAM in JTC1/SC2. A new Transformation Format of 10646, UTF-16, is being introduced in another PDAM currently under vote in JTC1/SC2, but this cannot accomodate all of IS 10646 (it only accomodates about 1 million characters) and it will employ techniques like in UTF-8 with ranges indicating how many octets are required to form one character, without the added functionality of being backwards compatible with ASCII and ISO 2022 encodings (which is a functionality of UTF-8). The only candidate for general use of the above encodings of IS 10646, is then UCS-4. It has the property of being constant-width, which may be easier to handle than the multi-octet UTF-8. As a file and as an inter- change code it has the problematic property of using codes in conflict with ISO 646, ISO 2022 and ISO 6429, and also of using 4 octets per character. Here UTF-8 is clearly superior. It may have advantages as an internal processing code, and as an inter-process encoding, for C widechar-like en- codings, but with current ISO C amendment with full support for multibyte coded character sets, that advantage may be diminuishing. UTF-8 is here as well defined and capable of representing all IS 10646 characters, and given its strengths in other areas it may well be chosen also for the internal processing, and inter-process communication. Internal processing is not in the scope of POSIX interfaces, anyway. IS 10646 has 3 levels of support, level 1 without combining characters, level 2 with combining characters in some scripts, and level 3 with un- restricted use of combing characters. SC22 has by resolutions from the 1993 Paris plenary, recommended that all SC22 standards be enabled for level 3 data, but that the semantics of combining characters not be addressed cur- rently. Thus there is not specific SC22 request for further support of level 2 and 3, but eventually there will be a need for support of these levels. SC22 also recommended use of IS 10646 terminology thruout SC22 standards, and this may need an alignment of current POSIX work, though it is the belief that current POSIX work is already well aligned with 10646 w.r.t terminology. JTC1/SC22/WG20 is revising TR 10176 with guidelines for support of IS 10646, and there may be further recommendations in this work of relevance to POSIX. The work is only on the initial working draft stage, so it cannot be expected that there be something more firm before 1996.