From daemon@dkuug.dk Fri Nov 16 20:34:32 1990 Date: Fri, 16 Nov 90 20:34:32 +0100 From: The devil himself X-Sequence: wg15rin@dkuug.dk 18 Errors-To: wg15rin-request@dkuug.dk Message-Id: <9011161936.AA24545@dkuug.dk> Apparently-To: i18n-list@dkuug.dk X-Charset: ASCII X-Char-Esc: 29 16 Nov 90 19:33 GMT id aa01561; Fri, 16 Nov 90 19:26:30 GMT From: Dominic Dunlop X-Sequence: i18n@dkuug.dk 6 Date: Fri, 16 Nov 90 15:23:35 GMT Message-Id: <1633.9011161523@tsa.co.uk> In-Reply-To: Mark E Davis "10646 Advantages" (Nov 15, 22:33) X-Fax: +44 491 651751 X-Phone: +44 491 652590 X-Address: 9 The Forty, Cholsey, OXON OX10 9LH, U.K. X-Organization: The Standard Answer Ltd. X-Mailer: Mail User's Shell (7.1.2 7/11/90) To: Mark_E_Davis.PINKTEAM@gateway.qm.apple.com, unicode@noddy.eng.sun.com, Internet_UniCore.PINKLINK@gateway.qm.apple.com Subject: Re: 10646 Advantages Cc: i18n@dkuug.dk X-Charset: ASCII X-Char-Esc: 29 [From "10646 Advantages" dated Nov 15] > [Much cogent and coherent stuff deleted] > ... > 4. Correct collation order > 10646 maintains the nationally mandated collation orders of Korean, Japanese > and Chinese. No tables are needed for collation. Collation?! Don't even think about collation! It's not your problem -- or 10646's. Two more measured comments here (and I did read the A: section before rushing in with them): 1. If there is a nationally mandated collation order for Japanese, my colleagues and I on the ISO/IEC JTC1/SC22/WG15 rapporteur group on internationalization (see below for explanation) are not aware of it. Our information is that, where collation orders exist at all, they tend to be proprietary and/or specific to a particular application areas (telephone books, dictionaries, directories...). To define a single national collating order for Japan would seem to be as much a political problem as it is a technical -- and it's one hell of a technical problem. If this perception, gained by working with technical experts from Japan, is misinformed or incomplete, we would appreciate being put right. 2. ``[A complaint against ASCII is] that you cannot order a file very well by using the binary sequences for character repesentation. Of course you can't! The New York Telephone Company, if asked, might send you its multipage set of rules for ordering the names in telephone directories. To think that the characters in a set should be grouped in a set by their usage (e.g. all arithmetic operators) is as futile as thinking that all vowels should lie next to each other on a keyboard, or that all keys should be laid out in alphabetical order. No way!'' Who said that? R. W. Bremer, credited as ``the father of ASCII'' in a letter published on pages 36-37 in Byte, volume 15, number 6, June 1990. In other words, correct collation was not even a goal in the choice of ASCII character encodings. It should not, and cannot be a goal in the design of more complex encodings, particularly those which, like Unicode and most ISO coded character sets, can trace their ancestry to ASCII. This consideration makes Mark's 10646 advantage 4. a no-op -- unless DIS 10646 claims to provide correct collation for Korean, Japanese and Chinese through simple arithmetic comparison of character encodings. Such a claim would almost certainly be unfounded, and would be a mark against 10646. Looking at the September 1990 working draft of DIS 10646, I see no such claim -- or, indeed, any reference whatever to collation. This is not surprising, as SC2, the JTC1 subcommittee on character sets and information coding, regards collation as Somebody Else's Problem. This is clearly the correct attitude to take if you are on a working group allocating encodings, but is unhelpful: every other part of JTC1 seems also to regard collation as Somebody Else's Problem. The net effect of this is that the ISO POSIX working group (!) is currently running with the issue because it needs a solution: the UNIX shell and tools embody collation and related concepts (filename expansion and listing, the sort command, regular expressions), and a corresponding international standard must be internationally applicable. Work in progress suggests that, by making up to four passes backwards and forwards through text, assigning different weights (including ``ignore'', ``high'' and ``low'') to each encoded character encountered on each pass, you can achieve useful real-world collation. Although you probably can't do a telephone book sort even in New York, never mind Tokyo. Our work has been based primarily on encodings without the non-spacing diacritics (accents) of Unicode. If it turns out that we can't accommodate these, we'll think again: the ability to handle Unicode is at the very least an important proof of concept for us. (My feeling is that, compared to the handling of stateful encodings with locking shifts -- something else that we intend to accommodate -- non-spacing diacritics should be a piece of cake.) Where am I coming from? Clearly the ISO camp: I'm a delegate to JTC1/SC22/WG15, the ISO POSIX working group, and the UK's designated expert on internationalization -- a topic which starts out with coded character sets and then gets worse. The rapporteur group on internationalization, where we ``experts'' hang around, works mainly on the definition of ``national profiles'' -- sets of preferences which mould a POSIX system to the needs of particular territories. Much of the groundwork on internationalization has been done by the UniForum Technical Committee Subcommittee on Internationalization. In particular, it was UniForum which came up with the current collation and regular expression handling. (X/Open was heavily involved as well.) Clearly, ISO POSIX has to accommodate existing ISO-sanctioned encodings -- although ISO 646 is giving us some grief because of the use by the UNIX shell of characters in the national variant positions. 10646 is coming down the pike, so we're looking at that too. If you want to help us look at Unicode, please keep us informed of relevant developments -- perhaps by cross-posting where appropriate to our public mail-list, i18n@dkuug.dk. (Mail to i18n-request@dkuug.dk if you want to join.) (I18n is short for internationalization, a 20-letter word.) Thanks. -- Dominic Dunlop