From suehiro@jrd.dec-j.co.jp Fri Sep 22 17:41:11 1995 Received: from gatekeeper.dec-j.co.jp (gatekeeper.dec-j.co.jp [202.34.226.2]) by dkuug.dk (8.6.12/8.6.12) with ESMTP id RAA16680; Fri, 22 Sep 1995 17:40:46 +0200 Received: by gatekeeper.dec-j.co.jp (8.6.12+usagi/JNET-GW-940327.1); id AAA28456; Sat, 23 Sep 1995 00:40:30 +0900 Received: from cobra.jrd.dec.com by garfield.jrd.dec.com (8.6.12+usagi/JULT-4.4-gar) id AAA14955; Sat, 23 Sep 1995 00:40:21 +0900 Received: from localhost by cobra.jrd.dec.com (5.65v3.0/JOSF-3.0-cobra) id AA32604; Sat, 23 Sep 1995 00:42:06 +0900 Message-Id: <9509221542.AA32604@cobra.jrd.dec.com> To: sc22wg15rin@dkuug.dk Cc: sc22wg15@dkuug.dk Subject: Input from Japan to POSIX.2b (LC_COLLATE collation weight name) Date: Sat, 23 Sep 1995 00:42:06 +0900 From: Yoichi Suehiro The attached is an input from Japan to POSIX.2b about LC_COLLATE extension. This topic is present in the RIN issues list (SC22WG15RIN.308). | 2. localedef user-specified collation weight names [Open] The proposal is not complete yet. Japan would like to have preliminary discussion among RIN members before including this item in the ballot comments. regards, Yoichi Suehiro =========================================================================== Source: Japan Title: Japanese proposal to POSIX.2b on LC_COLLATE extension for user-specified names of collation weights Status: Japanese position Short description: Japan proposes to extend LC_COLLATE locale definition in POSIX.2b so that names can be assigned to collation weights. This proposal is the response to the item (4) of ISO/IEC 9945-2:1993 Annex H.1 in which a proposal from Japan is expected. Text of contribution: ---------------------------------------------------------------------------- [Note: The page numbers refer to the ones of P1003.2/D10.] Sect 2.5.2.2.3 (LC_COLLATE) PROPOSAL. page 10: Problem: ======== 1. General Requirements In most cases of ideographic characters, it is a requirement that a user be able to specify the combination of collation weights as he/she wants. Japanese kanji characters, for example, have five (or more) typical collation weights to support Japanese SORT. The five weights are On-yomi (pseudo-Chinese pronunciation), Kun-yomi (Japanese pronunciation), Number of strokes, Radical (components of Kanji), and Kanji character code. There are many possible combinations of these weights and the requirements for them (number and order of weights) may change according to the type of data sorted, the purpose of sorting, user's preference, etc. Users (or applications) want to specify the method of sorting by specifying the primary weight and the secondary weight, and so on. Because no names are available for the combination of multiple weights, it is reasonable requirement that users can use the name of each collation weight for specifying the method of collation. That is the way in which most sorting utilities existing in Japan are implemented. The concept of each weight for kanji characters mentioned above are common knowledge for Japanese. However, there are no standards for the weights of Japanese kanji characters. So the detail of assigning weights can be slightly different among implementations depending on which information source (dictionary, etc.) is used for making the weights. It is difficult to handle such difference by using pre-defined sorting method. If each weight can be handled independently, it will be easier to manage. ISO 10646 (UCS) is now a standard. UCS can be used as a codeset for any locale whose character sets are included in. Even if UCS can be used for many different countries, the requirements for sorting characters are different country by country. The size of locale databases are concerns about using UCS. It is a requirement that there should be no problem for providing solutions to the above kanji sorting requirements when UCS is used as a codeset. 2. Problem in using current POSIX.2 standards specification Current locale model seems to assume having a well-defined collation definition for each locale. However, it does not match with the requirements for sorting ideographic characters. There is an opinion that it's not totally impossible for the current .2 specification to allow implementation of satisfying most of (not all) the above requirements. Producing locales for all possible combinations of weights as well as naming each locale is the possible solution based on the existing standards specification. In addition to that it is not a complete solution, the approach seems not practical in the following points. a. Size of locale databases There are about 12,000 kanji characters defined in JIS standards (JIS X0208 + JIS X0212). Because each possible combination of available weights needs to have a database, the total size of locale databases containing such big number of characters cannot be ignored. (for examples, 12,000 characters x 20 databases) When a local for ISO 10646 code set is defined, the problem must be more serious. b. Identification of each collation method "Onyomi", "Kunyomi", etc. are well-known names as methods of sorting kanji characters. However, the problem is that no names are available for the combinations of the primitive methods. Implementors need to invent new names for the methods. (for example, onyomi_strokes_radical, kanji0102, etc.) The possibility of making standard or de facto standard for the names of these combinations are very low. Hence, this approach will not be portable. Considering these problems, without extending current specification of LC_COLLATE, standard collation API such as wcscoll can support only limited ways of collation for kanji data, for example JIS code values. In this situation, applications which handle character orderings (for example, database applications) cannot rely on locale databases to sort kanji data. Some applications will support several collating methods by having their own ordering databases. Some applications will simply neglect the various sorting requirements for Kanji. 3. Overview of LC_COLLATE proposal By extending LC_COLLATE specification, single locale database can define multiple definitions of weights for kanji with their names. It is envisioned that the order of multiple weights can be specified at run time in the different order than the order of operands to order_start keyword. To make the different order effective, extension of another part of POSIX standards may be necessary. The weight names specified in the database should be referenced by a user or an application and the behavior of collation API needs to be modified according to the specified sorting method. The proposal for allowing users to specify collation methods is expected to work as follows. a. Define collation weights with names in LC_COLLATE Define collation weights with names in the locale database. EXAMPLE order_start forward,name="kunyomi";forward,name="radical" ; ; : : order_end b. Specify sorting methods There are two possible extensions to specify preferred collation. One is to introduce new environment variable (b.1), and the other is to use LC_COLLATE (b.2). b.1 Set the environment variable COLLWEIGHTS to preferred collation combination using names defined in the locale database. EXAMPLE COLLWEIGHTS=radical,kunyomi (Primary weight=radical, Secondary weight=kunyomi) b.2 Alternatively, existing LC_COLLATE environment variable can be used to specify user's preference. The weight names are specified after the string "@weights=" modifier. EXAMPLE LC_COLLATE=ja_JP.eucJP@weights=radical, kunyomi c. Initialize collation data There are two possible extensions to set collation methods at run time. One is to introduce new API (c.1), and the other is to use setlocale() (c.2). c.1 The call to setweights() initialize the collation method from the setting of COLLWEIGHTS environment variable. The setweights function can be used to change the method of collation at run time. c.2 The call to setlocale(LC_ALL, "") initialize the collation method from the setting of COLLWEIGHTS (or LC_COLLATE) environment variable. The setlocale function can be used to change the method of collation at run time. d. API behavior Collation APIs such as wcscoll work depending on the current setting of collation method. The details of the proposal for extended use of environment variables and the initialization by API are not decided yet. The proposed extension to locale definition file is described below. The detail proposals for other parts are not ready yet. 4. Proposal for POSIX.2b LC_COLLATE locale definition file Proposal: [LC_COLLATE extension for specifying weight name] =========================================================== The LC_COLLATE part of localedef specifications should allow a user to give names to the weights. => 2.5.2.2.3 order_start Keyword. Add the following directive description and the Example. It is implementation defined whether the following optional directive shall be recognized. If they are not supported, but present in a localedef source, they shall be ignored. name specifies the name of a collation weight by a string. An order of weights may be specified by using the name at run time. The syntax for the name directive shall be: "name = \"%s\"", Example: order_start forward,name="kunyomi";forward,name="radical" If an operand has a name directive, the definition of the primary, secondary, or subsequent weights for the collation element may be different from the order of operands to the order_start keyword. => 2.5.3.2 Locale Grammar. Modify the opt_word description as follows: opt_word : 'forward' | 'backward' | 'position' | 'name' '=' weight_name ; weight_name : '"' char_list '"' [END] ---------------------------------------------------------------------------- [Attachement : Example] =================================== Possible LC_COLLATE definition ============================== # Stroke collating-symbol <3stoke> collating-symbol <4stoke> collating-symbol <6stoke> collating-symbol <7stoke> collating-symbol <10stoke> # Onyomi collating-symbol collating-symbol collating-symbol collating-symbol # Radical collating-symbol collating-symbol collating-symbol order_start forward,name="stroke";forward,name="onyomi";\ forward,name="radical";forward,name="JISnumber" <10stroke>;;; <6stroke>;;; <7stroke>;;; <4stroke>;;; <3stroke>;;; Changing the order by assigning values to LC_COLLATE (b.2 method) ==================================================== LC_COLLATE=ja_JP.eucJP@weights=stroke,onyomi,radical,JISnumber Behavior of collation functions =============================== Output from weights=stroke,onyomi,radical,JISnumber (default) < < < < Output from weights=radical,onyomi,stroke,JISnumber < < < < [END]