Date: Tue, 17 Oct 1995 18:59:32 +0100 From: Martin Kirk To: d.cannon@xopen.co.uk Subject: Impact of ISO 10646-1 on POSIX The following lists several issues relating to the impact of ISO 10646 on POSIX and X/Open specifications. It is abstracted from the X/Open Study "Universal Multiple-Octet Coded Character Set Coexistence and Migration" (E401) which goes into more detail on the ramifications. o ISO 10646 characters are defined in 128 groups of 256 'planes'. Broadly, 10646 defines UCS-2 (two-octet characters) and UCS-4 (four-octet characters). o ISO 10646 set out to 'solve all the character set problems' by inventing a new set with enough scope to embrace all the character sets in the world. (The initial definition contains 34,168 characters, plus 6,400 for 'private use'.) o If it succeeded in this ambition then it could also provide a new 'interchange code' between systems. o However, if it became 'the new ASCII', systems would need to be redesigned with code-specific (ie 10646-specific) APIs. o Some problems (for POSIX and X/Open specifications) arising from the 10646 definition:- o Many current codesets include within them either ASCII or the Portable Character Set (PCS) as single-octet entities, 10646 does not. o Many codesets reserve the single-octet range from zero to 7f for Control Characters, 10646 does not. o Zero value octets and octets equating to the '/' character can appear anywhere in a 10646 character stream. Clearly this presents a problem in recognising 'end of string' and file-names. o Certain natural languages (Thai, Arabic) can only be fully supported by using 'combining characters', a base character modified by one or more diacriticals. o 10646 does not restrict either the number of diacriticals associated with a base character, or the sequence in which they occur. This means that there can be multiple forms of a final 'character' which all need to have the same weight from a user's standpoint (eg for collation). (Within 10646 a fully formed character might have its own code. The 'same character' comprised of a base plus one (or more) diacriticals needs to recognised as being the same. A simple example is the equivalence of codings for and plus . With multiple diacriticals and unconstrained sequence this can be a nightmare. o XPG4 systems have no interfaces suitable for processing composite sequences.