From erik@sran8.sra.co.jp Wed Jul 24 12:15:12 1991 Received: from mcsun.EU.net by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8) id AA00976; Wed, 24 Jul 91 12:15:12 +0200 Received: from srawgw.sra.co.jp by mcsun.EU.net with SMTP; id AA23068 (5.65a/CWI-2.100); Wed, 24 Jul 91 12:14:59 +0200 Received: from srava.sra.co.jp by srawgw.sra.co.jp (5.64WH/1.4) id AA10679; Wed, 24 Jul 91 19:14:01 +0900 Received: from sran8.sra.co.jp by srava.sra.co.jp (5.64b/6.4J.6-BJW) id AA16204; Wed, 24 Jul 91 19:14:22 +0900 Received: from localhost by sran8.sra.co.jp (5.65/6.4J.6-SJ) id AA12721; Wed, 24 Jul 91 19:12:08 +0900 Return-Path: Message-Id: <9107241012.AA12721@sran8.sra.co.jp> Reply-To: erik@sra.co.jp From: erik@sra.co.jp (Erik M. van der Poel) To: karels@okeeffe.Berkeley.EDU Cc: wg15rin@dkuug.dk Subject: Re: 1003.2 D11.1 resolution on internationalization Date: Wed, 24 Jul 91 19:11:49 +0900 Sender: erik@sran8.sra.co.jp X-Charset: ASCII X-Char-Esc: 29 > To: xojig@xopen.co.uk, wg15rin@dkuug.dk > From: Donn Terry > > ------- Forwarded Message > > From: karels@okeeffe.Berkeley.EDU (Mike Karels) > To: ballot2@okeeffe.Berkeley.EDU > Cc: pc@hillside.co.uk, posix.2@mks.com, rabin@osf.org, kuro@corp.sun.com, > seth@attunix.att.com > ... > Because of the newness and complexity of the locale issues, I strongly > suspect that most of the balloting group is not looking at these sections > very closely. I sympathize with this. Previously, I made a related comment that the new i18n stuff (not the old i18n stuff) should probably be put somewhere else (not in a draft standard), so that i18n'ers could play with it for a while. Nobody responded to my comment, however. > 1. The presence of full regular-expression-based substitution within > the collation rules has not been justified by anything other than claims > that various groups consider it a requirement. The only examples that > have been provided (mapping Mc to Mac) don't work correctly, and no > technical requirement based on internationalization has been given. > (Objection 068-12) I have an Action Item from the ISO POSIX i18n group (wg15rin) to investigate whether or not such substitutions are necessary. Some time ago, I wrote a couple of collation tables for Japanese. Kanji collation is impossible (since there are multiple pronunciations and the user must indicate which pronunciation is intended), so my tables attempted to sort Kana, which are phonetic. (I hope you understand this; if not, I would be happy to elaborate in private email.) The Kana include characters that can be placed after certain other characters, and have a special meaning. For example, the "prolonged sound mark" prolongs the sound of the previous character, e.g. ka- (that's followed by the mark <->) would be "pronounced" as kaa, and sorted that way, too. As far as I can see, there are two ways to deal with this in collation tables. One way is to use substitutions: substitute "\([\ ]\)<-6>" with "\1" The <-6> is the prolonged sound mark. Another way is to use collating elements. For *every* combination of a normal character with the prolonged sound mark, we declare two collating elements, e.g.: collating-element <*-A6-6> from <-6> collating-element <*-A6A6> from I'm not sure which method is better. I don't have any implementation to test these. Maybe one is faster than the other, but takes up more space, or whatever. The collating element method uses up far more space in the collation table itself (i.e. source to feed to localedef), but perhaps it's quite fast, I simply don't know. These remarks are not very helpful, I know, but I thought I should contribute my thoughts anyway. > 2. The regular expression matching and non-matching lists (bracket > expressions) can match multicharacter collating elements as well as > (single- or multi-byte) characters. The examples all list character > combinations which should be treated as a single character such as > or . I claim that these examples are all solved more correctly > by treating those two-byte sequences as multibyte characters rather > than two-character collating elements, and then everything works > as expected without this modification to bracket expressions. > Otherwise, this will lead to surprises for all; even the Germans won't > expect [p-t] to include , just because is defined as a collating > element in an equivalence class with . Worse yet, the regular > expression [^s] would match ! (Objection 068-19) No comment. (I am not able to comment.) > 3. Although ranges in matching lists (such as [p-t]) are noted as > inherently non-portable and are prohibited to Strictly Conforming > applications, they have been extended to allow equivalence classes > as endpoints. This isn't even well-defined if the members of the > equivalence class are not adjacent in the collation sequence. > (There is no requirement in this draft that members of an equivalence > class be contiguous in the collation sequence; a locale collation > could contain, for example, > > ; > ; > ... > ; > ; > ; > ... > ; > > in which case [[=a=]-d] isn't well-defined.) (Objection 068-21) I haven't thought about this for very long, but I would tend to agree with you here. Regards, EvdP