ISO/IEC JTC1/SC22/WG15RIN N146 Title: Use of IS 10646 in POSIX interfaces Source: WG15 RIN Date: 1995-05-11 Action: for WG15 RIN to vote by RIN email ballot 1. Introduction and scope For servicing the widest possible audience, POSIX standards should be able to handle the most encompassing character set, and the best candidate for this is the new ISO/IEC 10646-1:1993 standard. ISO/IEC 10646-1:1993, the Universal Multiple-Octet Coded Character Set (UCS), provides the capability to encode multi-script text within a single coded character set. WG15RIN was asked by WG15 to give guidance on how to utilize UCS in POSIX standards, also as requested by SC22 policies. RIN believes this to be of use in many areas such as global organisations interested in just one character set organisationwide, in European government institutions, in eastern Asia and many other places. However, because UCS is designed to use all code points available, null bytes and the code values of the other ISO/IEC 646:1991 IRV (also known as ASCII) characters, including the code value of the ISO 646 solidus ("/") character, are not protected. This makes the UCS character encoding incom- patible with many existing ISO 646 based POSIX operating system implemen- tations. That UCS also uses code points also used for ISO 6429 control characters introduces further problems for communication and application software. From these problems it was clear that a POSIX internal encoding was required. This paper gives first a survey of the possible coded representation forms of UCS and UCS transformation formats and their respective characteristics. Then each of the handling areas (data storage, file names, internal proces- sing, communications, interprocess communication) of the POSIX operation is analyzed. Finally a recommendation is given for POSIX standards. JTC1/SC22/WG20 is revising TR 10176 with guidelines for support of IS 10646, and there may be further recommendations in this work of relevance to POSIX. The work is only on the initial working draft stage, so it cannot be expected that there be something more firm before 1996. 2. UCS coded representation forms and UCS transformation formats 2.1. POSIX internal encoding For the POSIX internal encoding UTF-8 was considered suitable. The objective of UTF-8 is to provide an UCS transformation format which also meets the requirement of being usable on historical POSIX operating system file systems in a non-disruptive manner. The UTF-8 transformation format represents both UCS-2 and UCS-4 in a com- patible format using multiple-octet coded characters of lengths 1, 2, 3, 4, 5, and 6 octets: Bits Hex Min Hex Max Byte Sequence in Binary 1 7 00000000 0000007F 0vvvvvvv 2 11 00000080 000007FF 110vvvvv 10vvvvvv 3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv 4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv 5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vv- vvvv The UCS value is the concatenation of the v-bits in the multiple-octet encoding, where the v-bits are the 0's and 1's that constitute the UCS value. Thus UTF-8 has the capability of handling existing ISO 646 files without change, and all codes in the ISO 646 range (having an octet value in the range 0-127) can be safely assumed to be representing the normal ISO 646 character. 2.2. Other forms of IS 10646 IS 10646 has two forms: UCS-2 and UCS-4, a 16-bit and 31-bit coded repre- sentation of the character set, respectively. It is clear from work in JTC1/SC2/WG2 that IS 10646 may have more characters than what is represen- table in 64 k, so we are here considering the general case of UCS-4. ISO/IEC 10646-1:1993 has a transformation format UTF-1, which is infor- mative, and it is proposed that this format will be removed, as UTF-8 is aimed at the same purpose, and has more capability. UTF-8 is currently scheduled for ballot as a DAM in JTC1/SC2. A new Transformation Format of IS 10646, UTF-16, is being introduced in another DAM, but this cannot accommodate all of IS 10646 (it only accom- modates about 1 million characters) and it will employ techniques like in UTF-8 with ranges indicating how many octets are required to form one character, without the added functionality of being backwards compatible with ISO 646 and ISO 2022 encodings (which is a functionality of UTF-8). The most general of the above encodings of IS 10646, is then UCS-4. It has the property of being constant-width, which may be easier to handle than the multiple-octet UTF-8. As a file and as an interchange code it has the problematic property of using codes in conflict with ISO 646, ISO 2022 and ISO 6429, dependency on byte-ordering (little-endian vs big-endian) of the hosting machine architecture, and also of using 4 octets per character. Here UTF-8 is clearly superior for POSIX internal encoding. UCS-4 may have advantages as an internal processing code, and as an inter-process en- coding, for C language widechar-like encodings, but with the new ISO C language amendment with full support for multibyte coded character sets, that advantage may be diminishing. UTF-8 is here as well defined and capable of representing all IS 10646 characters, and given its strengths in other areas it may well be chosen also for the internal processing, and inter-process communication. Internal processing is not in the scope of POSIX interfaces, anyway. 2.3. UCS levelling IS 10646 has 3 levels of support, level 1 without combining characters, level 2 with combining characters in some scripts, and level 3 with un- restricted use of combing characters. SC22 has by resolutions from the 1993 Paris plenary, recommended that all SC22 standards be enabled for level 3 data, but that the semantics of combining characters not be addressed cur- rently. Thus there is not specific SC22 request for further support of level 2 and 3, but eventually there could be a need for support of these levels. SC22 also recommended use of IS 10646 terminology thruout SC22 standards, and this may need an alignment of current POSIX work, though it is the belief that current POSIX work is already well aligned with IS 10646 with respect to terminology. 3. Problems in POSIX handling of UCS There are several challenges presented by UCS which must be dealt with by present implementations of the POSIX operating system. 3.1. Data storage The most significant of these challenges is the encoding scheme used by UCS. More precisely, the challenge is the marrying of the UCS standard with existing programming languages and existing operating systems. Prominent among the operating system UCS handling concerns is the representation of contents of data in files. An underlying assumption is that there is an absolute requirement to maintain the existing operating system software investments while at the same time taking advantage of the use the large number of characters provided by UCS. For UTF-8 the representation of ISO 646 data is exactly the same, and for ISO/IEC 8859 parts right hand side characters will need two octets for representation. For idiographic characters in the BMP the representation will be three octets. This does not give a dramatically changed requirement for what is currently consumed for data storage. 3.2. File names and internal processing The UTF-8 transformation format was originally conceived as a file system safe transformation format of UCS to allow historically ISO 646 based POSIX operating systems to cope with representation and handling in file names of the large number of characters that are possible to be encoded by UCS. In addition, from an internal operating system (kernel) viewpoint this hand- ling of a large character set is only a problem for handling file names, which are only analyzed for the solidus ("/") delimiter to parse a name into filename components. As UTF-8 can represent the full encoding of IS 10646 and is backwards compatible with ISO 646, UTF-8 handling is suf- ficient for POSIX internal encoding. 3.3. Communications Current ISO POSIX standards do not address communication, but as ISO 6429 control characters are often used in communication, and the UTF-1 transfor- mation format was originally created for avoiding control character prob- lems in communication. As UTF-1 is being removed from UCS and UTF-8 intro- duced, having the same capabilities with respect to control character problem solving, UTF-8 should be the recommended choice in POSIX com- munication interfaces. 3.4. Interprocess communication Communication between POSIX processes would probably use internal data formats, for example integers should be transferred in binary form. As it could be recommended that programs internally use a C language widechar style encoding of characters, a UCS-2 or UCS-4 format could be recommended. On the other hand interprocess communication is often across networks and between heterogeneous systems, therefore since UCS-2 and UCS-4 are depen- dent on machine architecture, UTF-8 may be the preferred candidate. UTF-8 would in many cases also be less space-consuming, which may be a sig- nificant plus when using low-capacity network lines. 4. Recommendation According to the above analysis, UTF-8 is the best candidate for POSIX internal encoding of UCS in the areas of data storage, file names and internal operating system (kernel) processing, and communication, where otherwise UCS-2 or UCS-4 would have been used for coded data. Furthermore UTF-8 is a good candidate for UCS representation in interprocess com- munication. It is thus the recommendation of this paper to use the UTF-8 transformation format whenever UCS is used in POSIX interfaces. As POSIX interfaces in principle should be coded character set independent, there is no general need to require the use of UTF-8 in POSIX standards, but guidance could be given in rationales. A specific recommendation is that the portable archive exchange utility "pax" be revised to be able to specifically use UTF-8 for file names, and the use of UTF-8 should be clearly identified.