ISO/IEC SC22/WG15 N664 Date: Wed, 17 Apr 1996 22:11:04 -0700 From: Don.Cragun@eng.sun.com (Don Cragun) To: posix-dot2@pasc.org Cc: sc22wg15@dkuug.dk, wg15rin@dkuug.dk Subject: (wg15-uk 982) (SC22WG15.793) (POSIX.2 96) Proposal for culturally dependent fallback Keld, It looks like you sent the enclosed proposal to the sc22wg15@dkuug.dk email reflector (message sequence number 773) and wg15rin@dkuug.dk email reflector (message sequence number 323) on March 12th as well as to the posix-dot2@pasc.org email reflector (message sequence number 96) on March 28th. This is a response from the IEEE PASC Shell and Utilities Working Group to the mail sent to posix-dot2. It is not a United States response to the mail sent to sc22wg15 or wg15rin. We discussed this proposal during the Shell and Utilities meeting in Jackson yesterday and today. During our discussions with you at the St. Petersburg, Florida Shell and Utilities meeting in October, 1995, we remember you describing a need to have a way for the iconv utility in P1003.2b Draft 11 to be able to transform characters found in the conversion stream that do not appear in the "tocode" codeset into codeset specific strings in a standard manner. (Draft 11, as well as the X/Open specification on which it is based, provide for implementation specific actions when this condition arises.) This proposal, however, seems to go much farther than this by specifying ways to translate groups of legal characters into alternative groups of characters depending not only on codeset, but also on locale, culture, and repertoiremap. We also note that the iconv utility expects codesets to be specified by the -f and -t options; not locales. Therefore, we are not sure what problem this proposal is trying to solve. And, we do not understand how applications and standard utilities would make use of this information if it were part of a locale. As we mentioned in St. Petersburg, we have found that standards based on existing practice produce better standards and better implementations than standards based on specifications "designed by committee". Is there an existing implementation on which this proposal is based? Please see further comments and questions below. Respectfully, Don Cragun Chair IEEE Shell and Utilities Working Group >From owner-posix-dot2@themacs.com Thu Mar 28 14:24 PST 1996 >From: keld@dkuug.dk (Keld J|rn Simonsen) >X-Charset: ISO-8859-1 >X-Char-Esc: 29 >Mnemonic-Intro: 29 >To: posix-dot2@pasc.org >Subject: (POSIX.2 96) proposal for culturally dependent fallback > >Here is a text I promised at the St Petes meeting. I would like it >to be discussed at the forthcoming .2b meeting for inclusion in .2b. >/Keld > >Data specification format for transliteration and transcription > >In the following a format for describing translitteration and transscription >is given. The format is intended to be included in POSIX-like locales. >The format allows for cultural dependent transliteration and transcription >both dependent on the culture and language it transforms from and the >culture it transforms into. There is no indication of how the culture and language for these transformations would be determined by a standard utility or an application. For the iconv utility, if a -f option is specified, no locale is identified for the "from" or "to" codesets. (The iconv utility does use the current locale if no -f option is specified, but only to identify the codeset to convert from.) > >It was considered whether a more elaborate transscription could be >specified, but it was recognized that beyound the facilities described >here the transscription specification should be based on a database. > > >Transformation of characters, suitable for fallback in coded >character set conversion, transliteration and simple transliteration >can be specified with the following syntax in the LC_TRANS section >of the locale: This adds a new LC_TRANS section to locales as well as adding the new keywords. This also implies a change in environment variables to some of the standard utilities and a change to the way the setlocale() function behaves beyond what is specified in the ISO C and POSIX.1 standards. Since there are no programmatic interfaces to make use of this new section of a locale, it could not be used by portable applications. Based on our discussions in St. Petersburg, we expected this proposal to provide both updates to POSIX.2 and corresponding functions to be added to POSIX.1. Now that we understand the scope of this proposal, we believe it would be more appropriate to sponsor a new PAR for the combined POSIX.1 and POSIX.2 work than to try to fit this small part of it into P1003.2b. > >The following keywords shall be recognized in the transformation >definition. They are described in detail in the following subclauses. > >transform_start The name of the culture to transform from, if no culture is > specified > the transformation is the default transformation. The "transform" > keyword is followed by one or more transformation statements assigning > character transformation values to transformating elements, and > include statements copying transformation specifications from > other locales. > >transform_end The end of the transformation statements. > >include The name of the locale in text form and culture to transform from > and the repertoiremap for the locale to be used for the definition > of this category. Other specifications may follow to replace > specification of the copied locale. This keyword is optional. > >Transform_start keyword > >The "transform_start" keyword shall precede transformation statements and >"include" statements. It defines the culture to be transformed from. > >The syntax of the "transform_start" keyword shall be: > > "transform_start %s\n", > >If no operand is given, this is the default transformation. This means that a single locale definition file can be used to define transformations for several cultures. There is no indication of how utilities would determine which cultural transformation should be used. > >Transform_end keyword > >The transformation entries shall be terminated by the "transform_end" >keyword. If, indeed, there are multiple "transform_start" keywords in a locale definition file, do you propose that each be followed by a "transform_end" keyword; or is there one "transform_end" at the end? > >Transformation statements > >The "transform_start" keyword may be followed by transformation identifier >entries, The syntax for the transformation identifier entries is: > > "%s %s;%s;...;%s\n",,, > ,... > >Each shall consist of one or more characters (in >any of the forms defined in >POSIX-2 2.5.5 >). It is unclear what is being proposed here. First, there is no section 2.5.5 in POSIX.2 or in P1003.2b. Is the just an identifier, or is it a string of characters to be transformed? If it is an identifier, when is the identifier used to identify characters to be transformed? If it is a string of characters to transform, what context determines when the string should be transformed? No quoting is shown for s in this syntax, but you use quoting in some of your examples. Please clarify. > >The order the transformtion-strings is defined in defines the precedence >of transformations, the first transformation-string that satisfies the >transformation by for example having characters that are all in the coded >character set that is transformed into and having the desired string length, >is chosen. How does an implementation determine the "desired string length"? > >If more than one transformation statement is given for a given > this is an error, unless the C-option >is given - then a warning is given and the last transformation statement >is assumed. The "C-option" to what? (The localedef utility does not take a -C option and the iconv utility does not take a -C option.) > >A transformation statement may be terminated by a trailing > followed by a number of characters and a >character. Literally, this paragraph requires that a transformation statement with a comment be followed by an empty line. We understand that if we add this feature, we need to allow comments at the end of transformation statements as well as comment lines. > >Example: > > ;;;"" > ; > > >The first line defines a number of transformations for the LATIN LETTER AE, >including into LATIN LETTER A WITH DIAERESIS, GREEK LETTER EPSILON, >the two Latin letters A and E, and finally the LATIN LETTER E. We find this example confusing. We believe that the intent is that if the Latin Letter ae is found in the input and the destination codeset does not contain ae, it is to be converted to the first one of the following that exist in the destination codeset: Latin Letter a with diaeresis; Greek Letter epsilon; the two characters Latin Letter a and Latin Letter e; or the three characters Less Than Sign, Latin Letter e, and Greater Than Sign. Are there any codesets that contain the Latin Letter e that do not also contain the Latin Letter a? If not, why would the last in this example ever be used? Furthermore, what is supposed to happen if none of the conversions can be performed with the destination codeset? > >The second line defines transformation of the LATIN LETTER S into >GREEK LETTER SIGMA, and CYRILLIC LETTER ES. >From the above discussions about culture, how would an application determine whether Greek was appropriate in the output if the destination codeset supported both Greek and Cyrillic? Would an application use the first one supported by the target codeset, or would a "target culture" prevent some transformations from taking place? > >The 3rd line transforms the two Latin letters K and O into the >Japanese Hiragana character KO How would a standard utility or application determine the context in which these two characters should be transformed into KO? This seems to have gone from "transformation of characters that do not appear in a destination codeset" to "translation of character sequences" or even "language translation". > > > >Include keyword > >The "include" keyword specifies a set of transformation statements in >text form to be included in the current transformation. > >The syntax of the "include" statement is: > > "include %s;%s;%s\n",,, > > os a string identifying the locale to be included from. > > is a string identifying the repertoiremap used >in the locale being included, and is used to map character specifications >from the locale into the current locale. > > specifies the transformation specification >in the transformation section of the included locale, where the >transformation specification with the same is included. >This operand is optional, and if omitted, the default transformation >is included. This requires an implementation to derive the source that was used to define a compiled locale and base that source on a particular repertoiremap and culture. There are no plans to provide this reverse engineering in the standard. Finally, even if there were, it is not clear how an implementation would determine which culture or repertoiremap to use.