From daemon@dkuug.dk Sun Nov 18 04:55:32 1990 Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8) id AA09785; Sun, 18 Nov 90 04:56:24 +0100 Received: from mcsun.EU.net by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA09767; Sun, 18 Nov 90 04:55:32 +0100 Received: by mcsun.EU.net with SMTP; Sun, 18 Nov 90 04:58:50 +0100 Received: from srava.sra.co.jp by srawgw.sra.co.jp (5.64WH/1.4) id AA21533; Sun, 18 Nov 90 12:57:42 +0900 Received: from sran8.sra.co.jp by srava.sra.co.jp (5.64b/6.4J.6-BJW) id AA10873; Sun, 18 Nov 90 12:57:43 +0900 Received: from localhost by sran8.sra.co.jp (4.0/6.4J.6-SJ) id AA24323; Sun, 18 Nov 90 12:56:19 JST Return-Path: Message-Id: <9011180356.AA24323@sran8.sra.co.jp> Reply-To: erik@sra.co.jp From: Erik M. van der Poel To: Becker.OSBU_North@xerox.com Cc: unicode@sun.com, i18n@dkuug.dk, arnet@hpda.cup.hp.com, arnet@hpcupt1.cup.hp.com Subject: Re: Han Character Code Ordering Date: Sun, 18 Nov 90 12:56:17 +0900 Sender: erik@sran8.sra.co.jp X-Charset: ASCII X-Char-Esc: 29 > It might be mentioned that nearly all book-form Han character > dictionaries in Taiwan, Japan, and Korea use a radical/stroke order; > and ordering via radicals and stroke counts is in fact a part of every > national encoding standard except KS C5601. So any statement that > this scheme is "foreign to Japanese eyes" is obviously false and must > have resulted from some kind of misunderstanding. Yes, it is true that Han character dictionaries in Japan (Kanji dictionaries) are in some kind of radical and stroke order. But it is also true that the Japanese rarely use these dictionaries, since they usually know how to pronounce the word they are looking up, and they look up these words in dictionaries that are sorted in Kana (phonetic) order. Radical/stroke order dictionaries are a pain in the ass. You probably won't hear a Japanese saying this, so I'll say it for them. :-) > The "most common > pronunciation" order is nice and familiar when it works, which it > sometimes does. > > Joe Yes, the "most common pronunciation" order is nice when you are ordering single characters. But to completely satisfy the ordinary Japanese user, collation will have to be string-based, rather than character-based. (Of course, as you say, most applications will cop out and just do character-based sorting, in which case I think the UniHan scheme is great.) String-based sorting is desirable because of the change in pronunciation of a character when it is combined with other characters. Example: KAZE (1 character) means "wind" TAI FUU (2 characters) means "typhoon" Here, the KAZE and FUU are the same character. The implications of this are staggering. Not only do we need a large dictionary with all the different pronunciations, but we may in some cases also need to parse sentences. But this should probably be left to sophisticated applications. So what's the conclusion? As far as Unicode and collation are concerned, UniHan is probably the way to go. ISO 10646 is somewhat at a disadvantage in this respect. But 10646 has many other advantages that far outweigh its disadvantages. Erik M. van der Poel erik@sra.co.jp Software Research Associates, Inc., Tokyo, Japan TEL +81-3-234-2692