From domo@tsa.co.uk Fri Nov 16 15:44:08 1990 16 Nov 90 14:42 GMT id aa04673; Fri, 16 Nov 90 14:03:41 GMT From: Dominic Dunlop X-Sequence: wg15rin@dkuug.dk 14 Errors-To: wg15rin-request@dkuug.dk Date: Fri, 16 Nov 90 13:45:42 GMT Message-Id: <1306.9011161345@tsa.co.uk> X-Fax: +44 491 651751 X-Phone: +44 491 652590 X-Address: 9 The Forty, Cholsey, OXON OX10 9LH, U.K. X-Organization: The Standard Answer Ltd. X-Mailer: Mail User's Shell (7.1.2 7/11/90) To: wg15rin@dkuug.dk Subject: 10646 Advantages X-Charset: ASCII X-Char-Esc: 29 RIN people, What follows brings a breath of fresh and rational air in to prolonged discussion on the Unicode mail list about the relative merits and demerits (mostly the latter, given the forum) of DIS 10646. It seems relevant to our work. Enjoy(?) -- Dominic Dunlop --- Forwarded mail from Mark E Davis >From mark_e_davis.pinkteam@gateway.qm.apple.com Fri Nov 16 13:36 GMT 1990 Date: 15 Nov 90 22:33:19 From: Mark E Davis Subject: 10646 Advantages To: unicode@noddy.eng.sun.com, Internet_UniCore.PINKLINK@gateway.qm.apple.com REGARDING 10646 Advantages Ken, This has been an interesting exchange between you and Arne Thormodsen. I certainly agree with your reasoning. However, I do think it is worth the effort to make an explicit list of areas where 10646 has an advantage, whether real or apparent. The purpose to this is both to be fair to 10646 (although clearly I am convinced that Unicode is far better), and to be prepared with responses when those areas come up in discussion about the relative merits. This is a very preliminary pass; I would appreciate any additions or changes that people can think of. ================ First off, there are broad areas where 10646 and Unicode are both superior to many current systems: single byte approaches (a la 2022) or 7-bit standards. They are both intended to be complete world encodings with enough characters for all significant letters and technical symbols. However, there are significant differences between them that can be separated into repertoire and structure. The main differences in repertoire are: 1. Han Unification 10646 separates ideographs by language, while Unicode unifies them. 2. Control Code bytes 10646 excludes any character with C0, C1 bytes, while Unicode does not. 3. Presentation Forms Although it does include some compromises for compatibility, Unicode generally excludes character forms and precomposed characters that 10646 includes. The main differences in structure are: 1. Maximum Width Unicode is 16 bit, while the canonical form of 10646 is 32 bit. 2. Fixed vs. Variable Unicode has fixed width (leaving compression methods to higher-level protocols).10646 has built-in compression methods (determined with control-code sequences). 3. Announcement of subsets Unicode has none (leaving announcement of subsets to higher level protocols). 10646 has specified subsets (using control-code sequences for announcement). Since 10646 is a portmanteau standard (those unkinder than I would use 'kitchen sink standard'), different compactions of 10646 may have different advantages or disadvantages (but not simultaneously). For example, 10646 can have the advantage of being fixed width, or it can have the advantage of built-in compression, but not both simultaneously. It is convenient to refer to the different compression schemes as the follows (in each case, the XX bytes are specified by the compaction method, and all character numbers are in hex): 10646/1 Single byte compaction YY 95 characters of the form 000000YY where YY is in (20..7F), and 96 characters of the form XXXXXXYY where YY is in (A0..FF). 10646/2 Double byte compaction YYZZ 6,112 characters of the form 0000YYZZ where YY is in (20..2F, A0..AF), and ZZ is in (20..7F, A0..FF), and 30,369 characters of the form XXXXYYZZ where YY is in (20..2F, A0..AF), and ZZ is in (20..7F, A0..FF). 10646/3 Triple byte compaction YYZZWW 6.9M characters of the form XXYYZZWW where YY, ZZ and WW are in (20..7F, A0..FF). 10646/4 Quadruple byte compaction 1.3G characters of the form YYZZWWVV where YY, ZZ, WW, and VV are in (20..7F, A0..FF). 10646/5 Dynamic byte-width compaction YY, YYZZ, YYZZWW, or YYZZWWVV Any of the above forms, with mixed shift characters to select between them. (Note: Control codes are always 1-byte in this method, since PAD is disallowed.) In listing the advantages of 10646 (real or apparent), I come up with the following: 1. Transmissibility For lack of a better term, I have been using this phrase to refer to purpose of disallowing control code bytes. The advantage of 10646 is that current programs that use control code bytes will continue to function with 10646. A: An interesting point here is that this is restricted to programs that do not look at the contents of the text; if a program scans for content (e.g. looks for 'a' as 61) then it will fail or get the wrong results with both Unicode and 10646/2, 10646/3, 10646/4. If a program scans the content for a-umlaut, then it also fail with 10646/1 if XXXXXX is not zero. 10646/5 may or may not succeed, depending on the compaction. However, if a system component or utility does not depend on the content of the text, but is sensitive to control codes, then it will correctly handle 10646/n whereas it would not handle Unicode. 2. Built-in compression 10646/5 incorporates the use of compression methods to allow text from a small range of characters to be compactly stored. Many of the worlds alphabets can be transmitted or stored in half the space that Unicode requires. A: Using more sophisticated compression techniques, both Unicode and 10646 can be compressed to must less than 1 byte/character for small alphabets. 10646's compression technique sacrifices optimal compression for transmissibility. 3. Fixed width form The most common form of multi-byte set for some time to come will be 2-byte. 10646/2 allows the use of this with a set that covers all non-ideographics, and which can be combined with one of a number of ideographic sets to also include one of the major ideographic sets. A: 10646/2 does not currently allow for all languages simultaneously, since Chinese, Japanese & Korean are coded separately. 10646/5 does allow for them in 2 bytes, but is not fixed width. Even if a later unified ideographic set is defined, 10646/2 only allows a total of 36,481 characters (56% of the Unicode set). 4. Correct collation order 10646 maintains the nationally mandated collation orders of Korean, Japanese and Chinese. No tables are needed for collation. A: JIS, GB and KSC each provide A mandated collation order, but none of them provides all of the collation orders appropriate to sorting Han characters in any given language. And each produces erroneous results if applied to characters from the other. Not only that, but the JIS sorting order is internally inconsistent between levels 1 and 2, and thus does not match any dictionary order. The others also separate different levels that should be intermixed in any collation order. Raw character coding does not provide accurate collation order (unless you prefer 'a' > 'Z'). 5. Expandability Since 10646 allows up to 1.3G characters, there is always room for future expansion. A: It is not necessary to have that many characters. Given that, the extra storage (10646/3, 10646/4) or complications of mixed widths (10646/5) are not worth it. 5. Uniqueness Because 10646 does not include non-spacing diacritics, it provides a unique encoding for accented Roman letters such as a-umlaut. A: ...to be continued.... If you send me any additions/corrections, I will add them to this list. --MED --- End of forwarded message from Mark E Davis