From ns@slab.sfc.keio.ac.jp Thu Oct 19 12:04:33 1995 Received: from slab.sfc.keio.ac.jp (klein.slab.sfc.keio.ac.jp [133.27.68.16]) by dkuug.dk (8.6.12/8.6.12) with ESMTP id MAA25214 for ; Thu, 19 Oct 1995 12:03:44 +0100 From: ns@slab.sfc.keio.ac.jp Received: from meteor (meteor.slab.sfc.keio.ac.jp [133.27.68.37]) by slab.sfc.keio.ac.jp (8.6.10+2.5Wb1/3.4Wbeta2-95033021) with SMTP id UAA08271 for ; Thu, 19 Oct 1995 20:03:16 +0900 Received: from localhost by meteor (4.1/6.4J.6-slab-slave1.0) id AA16248; Thu, 19 Oct 95 20:03:15 JST Message-Id: <9510191103.AA16248@meteor> To: sc22wg15@dkuug.dk Subject: Japanese National Body Report for WG15 Date: Thu, 19 Oct 95 20:03:14 +0900 To whom it may concern: The following is the Japanese National Report Draft for the Action Items. ----------------------------------------------------------------------- SC22/WG15 N___ Japan Action Item Report to WG15, October 1995 9505-01 All member bodies - Report progress on national profiles at the October meeting. Japan has no work on the JNP darfted already. 9505-27 Japan - Identify input Japan has provided on P1003.2b for pp 8,10 and peovide to the US prior to July 1, 1995. The following is the Japanese input to the above items listed in P1003.2b. =============================================================================== (1)state-dependent encoding for chatacters Source: Japan Title: Japanese input to POSIX.2b on state-dependent character encodings Status: Japanese position Short description: It is expected that Japan propose an extension of charmap syntax to support state-dependent character encodings for POSIX.2b. However, considering the change of requirements for using state-dependent character encodings by POSIX systems, Japan decided not to pursue the extension of POSIX.2 standards. Text of contribution: ---------------------------------------------------------------------------- It has been an action item assigned to Japan that Japan propose an extension of charmap syntax for supporting state-dependent encodings. When Japan raised the issue of state-dependent encoding support by ISO/IEC 9945-2, the ISO/IEC 2022, which is a typical state-dependent encoding, is the only one international standard code extension technique to include multiple scripts (multi-lingual text) in a character stream or character string. However, since the ISO/IEC 10646-1 became available in 1993 and UTF-8 is now being standardized, the user of POSIX standards got alternative way to represent multi-script/multi-lingual text without using state-dependent encodings. And that must be the way which POSIX standards will endorse. Therefore, Japan believes that the requirements for supporting state-dependent encodings with POSIX systems are very small now. It is difficult to get support from vendors and users for any proposed extension on this topic. Considering the above situation, Japan would propose not to pursue the extension for supporting state-dependent encodings. (2) LC_COLLATE extension for user-specific names of collation weights Source: Japan Title: Japanese proposal to POSIX.2b on LC_COLLATE extension for user-specified names of collation weights Status: Japanese position Short description: Japan proposes to extend LC_COLLATE locale definition in POSIX.2b so that names can be assigned to collation weights. This proposal is the response to the item (4) of ISO/IEC 9945-2:1993 Annex H.1 in which a proposal from Japan is expected. Text of contribution: ---------------------------------------------------------------------------- [Note: The page numbers refer to the ones of P1003.2/D10.] Sect 2.5.2.2.3 (LC_COLLATE) PROPOSAL. page 10: Problem: ======== 1. General Requirements In most cases of ideographic characters, it is a requirement that a user be able to specify the combination of collation weights as he/she wants. Japanese kanji characters, for example, have five (or more) typical collation weights to support Japanese SORT. The five weights are On-yomi (pseudo-Chinese pronunciation), Kun-yomi (Japanese pronunciation), Number of strokes, Radical (components of Kanji), and Kanji character code. There are many possible combinations of these weights and the requirements for them (number and order of weights) may change according to the type of data sorted, the purpose of sorting, user's preference, etc. Users (or applications) want to specify the method of sorting by specifying the primary weight and the secondary weight, and so on. Because no names are available for the combination of multiple weights, it is reasonable requirement that users can use the name of each collation weight for specifying the method of collation. That is the way in which most sorting utilities existing in Japan are implemented. The concept of each weight for kanji characters mentioned above are common knowledge for Japanese. However, there are no standards for the weights of Japanese kanji characters. So the detail of assigning weights can be slightly different among implementations depending on which information source (dictionary, etc.) is used for making the weights. It is difficult to handle such difference by using pre-defined sorting method. If each weight can be handled independently, it will be easier to manage. ISO 10646 (UCS) is now a standard. UCS can be used as a codeset for any locale whose character sets are included in. Even if UCS can be used for many different countries, the requirements for sorting characters are different country by country. The size of locale databases are concerns about using UCS. It is a requirement that there should be no problem for providing solutions to the above kanji sorting requirements when UCS is used as a codeset. 2. Problem in using current POSIX.2 standards specification Current locale model seems to assume having a well-defined collation definition for each locale. However, it does not match with the requirements for sorting ideographic characters. There is an opinion that it's not totally impossible for the current .2 specification to allow implementation of satisfying most of (not all) the above requirements. Producing locales for all possible combinations of weights as well as naming each locale is the possible solution based on the existing standards specification. In addition to that it is not a complete solution, the approach seems not practical in the following points. a. Size of locale databases There are about 12,000 kanji characters defined in JIS standards (JIS X0208 + JIS X0212). Because each possible combination of available weights needs to have a database, the total size of locale databases containing such big number of characters cannot be ignored. (for examples, 12,000 characters x 20 databases) When a local for ISO 10646 code set is defined, the problem must be more serious. b. Identification of each collation method "Onyomi", "Kunyomi", etc. are well-known names as methods of sorting kanji characters. However, the problem is that no names are available for the combinations of the primitive methods. Implementors need to invent new names for the methods. (for example, onyomi_strokes_radical, kanji0102, etc.) The possibility of making standard or de facto standard for the names of these combinations are very low. Hence, this approach will not be portable. Considering these problems, without extending current specification of LC_COLLATE, standard collation API such as wcscoll can support only limited ways of collation for kanji data, for example JIS code values. In this situation, applications which handle character orderings (for example, database applications) cannot rely on locale databases to sort kanji data. Some applications will support several collating methods by having their own ordering databases. Some applications will simply neglect the various sorting requirements for Kanji. 3. Overview of LC_COLLATE proposal By extending LC_COLLATE specification, single locale database can define multiple definitions of weights for kanji with their names. It is envisioned that the order of multiple weights can be specified at run time in the different order than the order of operands to order_start keyword. To make the different order effective, extension of another part of POSIX standards may be necessary. The weight names specified in the database should be referenced by a user or an application and the behavior of collation API needs to be modified according to the specified sorting method. The proposal for allowing users to specify collation methods is expected to work as follows. a. Define collation weights with names in LC_COLLATE Define collation weights with names in the locale database. EXAMPLE order_start forward,name="kunyomi";forward,name="radical" ; ; : : order_end b. Specify sorting methods There are two possible extensions to specify preferred collation. One is to introduce new environment variable (b.1), and the other is to use LC_COLLATE (b.2). b.1 Set the environment variable COLLWEIGHTS to preferred collation combination using names defined in the locale database. EXAMPLE COLLWEIGHTS=radical,kunyomi (Primary weight=radical, Secondary weight=kunyomi) b.2 Alternatively, existing LC_COLLATE environment variable can be used to specify user's preference. The weight names are specified after the string "@weights=" modifier. EXAMPLE LC_COLLATE=ja_JP.eucJP@weights=radical, kunyomi c. Initialize collation data There are two possible extensions to set collation methods at run time. One is to introduce new API (c.1), and the other is to use setlocale() (c.2). c.1 The call to setweights() initialize the collation method from the setting of COLLWEIGHTS environment variable. The setweights function can be used to change the method of collation at run time. c.2 The call to setlocale(LC_ALL, "") initialize the collation method from the setting of COLLWEIGHTS (or LC_COLLATE) environment variable. The setlocale function can be used to change the method of collation at run time. d. API behavior Collation APIs such as wcscoll work depending on the current setting of collation method. The details of the proposal for extended use of environment variables and the initialization by API are not decided yet. The proposed extension to locale definition file is described below. The detail proposals for other parts are not ready yet. 4. Proposal for POSIX.2b LC_COLLATE locale definition file Proposal: [LC_COLLATE extension for specifying weight name] =========================================================== The LC_COLLATE part of localedef specifications should allow a user to give names to the weights. => 2.5.2.2.3 order_start Keyword. Add the following directive description and the Example. It is implementation defined whether the following optional directive shall be recognized. If they are not supported, but present in a localedef source, they shall be ignored. name specifies the name of a collation weight by a string. An order of weights may be specified by using the name at run time. The syntax for the name directive shall be: "name = \"%s\"", Example: order_start forward,name="kunyomi";forward,name="radical" If an operand has a name directive, the definition of the primary, secondary, or subsequent weights for the collation element may be different from the order of operands to the order_start keyword. => 2.5.3.2 Locale Grammar. Modify the opt_word description as follows: opt_word : 'forward' | 'backward' | 'position' | 'name' '=' weight_name ; weight_name : '"' char_list '"' [END] ---------------------------------------------------------------------------- [Attachement : Example] =================================== Possible LC_COLLATE definition ============================== # Stroke collating-symbol <3stoke> collating-symbol <4stoke> collating-symbol <6stoke> collating-symbol <7stoke> collating-symbol <10stoke> # Onyomi collating-symbol collating-symbol collating-symbol collating-symbol # Radical collating-symbol collating-symbol collating-symbol order_start forward,name="stroke";forward,name="onyomi";\ forward,name="radical";forward,name="JISnumber" <10stroke>;;; <6stroke>;;; <7stroke>;;; <4stroke>;;; <3stroke>;;; Changing the order by assigning values to LC_COLLATE (b.2 method) ==================================================== LC_COLLATE=ja_JP.eucJP@weights=stroke,onyomi,radical,JISnumber Behavior of collation functions =============================== Output from weights=stroke,onyomi,radical,JISnumber (default) < < < < Output from weights=radical,onyomi,stroke,JISnumber < < < < [END] ============================================================================= 9505-40 All Member Bodies - If any serious defects are discovered in the document in 9505-39, notify Lowell Johnson directly prior to June 5. The US may use these comments as a basis for not taking the document forward as a US standard. Japan has no comment for this document. 9505-47 Member bodies - Review documents listed in 9410-26. (open action item 9410-27) These are . P1003.2b Shell and Utilities - Extensions: Draft 10, and we already received Draft 11. The items listed in Annex H of 9945-2:1993 were removed from Draft 11, but the input from Japan are written in the response to Action Item 9505-27 as shown above. In addition to these items, Japan proposes the following comment. =========================================================================== Source: Japan Title: Japanese proposal to POSIX.2b on LC_CTYPE extension for locale-specific character mapping Status: Japanese position Short description: Japan proposes that LC_CTYPE locale definition should be extended to allow locale-specific character mappings to be specified. This extension is necessary to implement wctrans() and towctrans() functions in ISO C amendment on a POSIX conforming system. Text of contribution: ---------------------------------------------------------------------------- [Note: The page numbers refer to the ones of P1003.2/D10.] Sect 2.5 (Locale) PROPOSAL. Page 8-9,12: Problem: The LC_CTYPE (2.5.2.1) locale definition should be enhanced to allow user-specified additional character mapping, similar in the concept to the user-specified additional character class. In the Amendment of ISO C standard, extended character mapping functions (wctrans/towctrans) are specified. The following proposed extension will serve for the machinery to define locale specific character mappings used by the functions. Without having this extension, POSIX conforming systems need to have their own extensions to implement ISO C Amendment specifications. Proposal:[LC_CTYPE extension for specifying character mapping] The proposed extension for character mapping is similar to the extension of character class, which is already specified in .2b draft. New keyword 'charconv' is introduced to define locale-specific character mappings instead of 'charclass' keyword for character class. The way of defining character mapping is not extended with this proposal. The same specification for toupper/tolower mapping can be used for locale-specific character mappings. EXAMPLE: LC_CTYPE # define the names of locale-specific character mappings charconv tojkata;tojhira # tojkata: hiragana => katakana mapping tojkata (,);(,);\ .....definition..... # tojhira: katakana => hiragana mapping tojhira (,);(,);\ .....definition..... END LC_CTYPE [Proposed extension to .2b text] [Page 8] => 2.5.2.1 LC_CTYPE. Add the following keyword items after the item labeled tolower: charconv Define one or more locale-specific character mapping names as strings separated by semicolons. Each named character mapping can then be defined subsequently in the LC_CTYPE definition. A character mapping name shall consist of at least one and at most fourteen bytes of alphanumeric characters from the portable filename character set. The first character of a character mapping name cannot be a digit. The name cannot match any of the LC_CTYPE keywords defined in this standard. charconv-name Define the named locale-specific character mapping. In the POSIX Locale, the locale-specific named character mapping need not exist. If a mapping name is defined by a charconv keyword, but no character mappings are subsequently assigned to it, this is not an error; it shall represent a mapping without any character pairs belonging to it. [Page 12] => 2.5.3.1 Locale Lexical Conventions. Add the following token description: CHARCONV A string of alphanumeric characters from the portable character set, the first of which shall not be a digit, consisting of at least one and at most fourteen bytes, and optionally surrounded by double-quotes. [Page 12] => 2.5.3.2 Locale Grammar. Modify the ctype_keyword and charconv_keyword descriptions as follows: ctype_keyword : charclass_keyword charclass_list EOL | charwidth_keyword charclass_list EOL | defwidth_keyword defwidth_value EOL | charconv_keyword charconv_list EOL | 'charclass' charclass_namelist EOL | 'charconv' charconv_namelist EOL ; charconv_namelist : charconv_namelist ';' CHARCONV | CHARCONV ; charconv_keyword : 'toupper' | 'tolower' | CHARCONV ; ---------------------------------------------------------------------------- [END] ============================================================================= Nobuo Saito POSIX WG, ITSCJ