Impact of ISO/IEC 10646 on POSIX

Prerequisites to this contribution :


Reference [1] gives a solution for implementing ISO/IEC 10646 with a minimum of impact in current implementations of POSIX. Provided that 8-bit clean systems are used, which is more and more the case with modern actual implementations, UTF-8 allows to use the full repertoire of the UCS without breaking system conventions. A good example of an implementation that works remarkably well with implementing this scheme is the current version of IBM AIX.

Reference [2] surveys different problems, among which the main ones, unsolved merely by UTF-8, can be summarized around one problem, i.e. the fact that complex "text elements" necessary to process many natural languages' textual data adequately, do not correspond to single characters. A minor problem generally not presented is the fact, though, that if UTF-8 is processed as a series of 8-bit characters, it can be done in a transparent mode if one realizes that this also has a variable impact on field lengths, one character of the UCS being possibly represented by up to 6 octets in UTF-8 mode... UTF-8 is, though, the system that allows minimal changes in POSIX if this caveat is understood. Many systems will use UTF-8 for interchange (for data portability reasons) but will prefer to upgrade actual storage to native mode UCS-2 or UCS-4.

Even with native UCS mode, it can be precised (what reference [2] does not do), that, in the worst case, ISO/IEC 10646 does not say a word about the fact that for a collating element of Indic scripts, for example, a series of many UCS characters will be necessary for what would be perceived on European grounds as one character on paper : an Indic ligature, which may correspond to an entity on its own for dictionary searching, may have to be formed with up to 5 or 6 UCS characters, and this, without any combining sequence [as defined in the ISO/IEC 10646 standard] being used.

SC22/WG20 WD4 of ISO/IEC 14651 (Ordering Standard Project) defines the term "text element" to address this general problem, that also deals with combining sequences. Furthermore ISO/IEC 14651 defines an API for character comparison that could help solve this general problem, with a binding consideration to "strxfrm():, "strcmp()" and "strncmp()" functions as used in the C language.

This draft is building on POSIX syntax, introducing very few innovations that could eventually [if one can deal with a non-ideal world for some time] be ignored for early compatible support in POSIX (ex.: multiple "order_start" statements for the global tailorable "LOCALE" defined in this standard, to deal with different properties of each system of writing defined in a "global" locale that caters for most characters [around 40000] used commercially on this blue planet).

RECOMMENDATION 1 : to carefully look at this WD, available on the DKUUG site, for solving most of the processing problems introduced by the complexity of UCS coding. Eventually POSIX could simply point to this standard rather than extend its own LOCALE mechanism for character comparisons. This standard project is intended to be programming-language-independent.

This is the most important recommendation for supporting the multiscript nature of the UCS. All character processing revolves around character comparison. Even character classification can be dealt with in this new generalizable API. POSIX should build on it as much as this new work has carefully built on the fertile POSIX ground.

If all goes as planned, this WD will be sent for CD ballot after the SC22/WG20 Kyoto meeting, held in the second week of April 1996.

Now on the question of equivalences between a combining sequence and a precomposed character in ISO/IEC 10646, it is generally recognized that this matter is application dependent. JTC1/SC2 does not deal with such issues. In SC22/WG20 would generally advise people to use preferably precomposed characters when they exist, rather than "equivalent" precomposed sequences. Now for those who would still have a need, the UNICODE consortium has privately defined tables of equivalences (available publicly on their web site). These could be used in tailored definitions of collations using the current LC_COLLATE specification without even the need to change POSIX.

RECOMMENDATION 2 : to ignore the issue of composite sequence equivalences, otherwise than by noting that tailoring is already possible without change in POSIX, should the need occur.

Two other issues should be known in the POSIX community. SC18/WG9 is working on a project of standard methods to enter, with the help of any keyboard, any UCS character independently of coding (methods not intended to bypass any national input method, rather to complement existing input methods and keyboard layouts for entering "foreign" characters in any given country). The project, ISO/IEC 14755, is at its second CD stage now and should normally be processed for DIS ballot in 1996 (in April or October, depending on the result of the current ballot). Filtering this process should be avoided.

RECOMMENDATION 3 : provided that coding does not constitute an obstacle per se (and UTF-8 allows this for 8-bit-clean systems), implementation of ISO/IEC 14755 is recommended without any filtering at the system level.

Now on the matter of presentation, it is a task that is application-based and it is the realm of display and printing support. No big impact is to be seen at the system level on this.

written by

Alain LaBonté
Québec (city of),
on behalf of the SC22
Canadian national body