From daemon@dkuug.dk Sat Nov 17 20:26:57 1990 Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8) id AA26769; Sat, 17 Nov 90 20:27:59 +0100 Received: from alpha.Xerox.COM by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA26742; Sat, 17 Nov 90 20:26:57 +0100 Received: from Mirassou.osbu_north.xerox.xns by alpha.xerox.com via XNS id <16279>; Sat, 17 Nov 1990 11:30:05 PST X-Ns-Transport-Id: 0000AA0013829FE82B06 Date: Sat, 17 Nov 1990 11:29:38 PST Sender: Joseph_D._Becker.OSBU_North@xerox.com From: Becker.OSBU_North@xerox.com Subject: Re: Han Character Code Ordering In-Reply-To: "Dominic Dunlop's message of 16 Nov 90 07:23:35 PST (Friday)" To: domo@tsa.co.uk Cc: unicode@sun.com, i18n@dkuug.dk, arnet@hpda.cup.hp.com, arnet@hpcupt1.cup.hp.com, Becker.OSBU_North@xerox.com Message-Id: <"17-Nov-90 11:29:38 PST".*.Joseph_D._Becker.OSBU_North@Xerox.com> X-Charset: ASCII X-Char-Esc: 29 It is true that there are as many different CJK collation orders as there are specific application areas, dictionaries, etc. It is also true that "collation" can and should be separated from character code numeric ordering. However, it is also true that many computer applications will cop out and present their users with Han character data arranged in character code order, so the question is worth looking into. The Taiwan Big5 and CNS standards are in stroke/radical order at each level; CCCII is by radical/stroke. The layout of the other Chinese, Japanese, and Korean national code standards is: Level 1: phonetic Level 2: radical/stroke Level 3: radical/stroke .... The layout of the Unicode "UniHan" collection is: Level 1: radical/stroke Level 2: radical/stroke Level 3: radical/stroke .... Here, "phonetic" just means SOME phonetic-like scheme, since at the detailed level a unique phonetic order cannot be defined; similarly "radical/stroke" just means SOME radical/stroke scheme, since at the detailed level a unique radical/stroke order cannot be defined. The advantage of the national code standards' phonetic schemes is that common characters can indeed often be found via their most common pronunciation. However, standards like JIS X0208 and GB2312 also contain radical/stroke indexes in the back, since the "most common pronunciation" is often unguessable or inapplicable. The advantage of the UniHan scheme is that it consistently uses one ordering method rather than two different ones. In addition, UniHan Level 1 includes BOTH Levels 1 & 2 of standards JIS X0208 and GB 2312. So, for example, Kanji data containing characters from both JIS Level 1 & JIS Level 2 appears somewhat incoherent if sorted in JIS order, whereas its presentation in UniHan order is entirely uniform. It might be mentioned that nearly all book-form Han character dictionaries in Taiwan, Japan, and Korea use a radical/stroke order; and ordering via radicals and stroke counts is in fact a part of every national encoding standard except KS C5601. So any statement that this scheme is "foreign to Japanese eyes" is obviously false and must have resulted from some kind of misunderstanding. Phonetic-ordered dictionaries are more common in China, since the simplified characters are harder to accommodate in the radical/stroke system. So, what does this all mean? Very little. The "most common pronunciation" order is nice and familiar when it works, which it sometimes does. No users have memorized the national code standard orders, so they are not looking for them specifically. No users are surprised to see radical/stroke ordering. The UniHan scheme offers some improvement in consistency/predictability. Each application should apply its own preferred collation anyhow. Bottom line: None of the above-listed Han code ordering schemes has a significant superiority over any of the others. This is indeed a non-issue. Joe