JTC1/SC22/WG15 N675 WG15 RIN SD-3 1996-Jun-25ISO/IEC JTC1/SC22/WG15 RIN Issues List -:- FINAL
Source: WG15 and RIN
Status: Approved by the joint WG15 and RIN meeting of 20/23 May 1996
Rationale:
At the WG15 RIN meeting in Twente, 11-12 May 1995, it was decided to remove the Agenda Items traditionally listed under 3.1 to entries in this document, the WG15 RIN Issues List. This was because the status and raison d'etre of these items had been obscured over time, and the debate on each item was being revisited at each meeting.
A triumvirate of David Cannon (UK), Keld Simonsen (DK) and George Kriger (Ca) was charged with exhuming the argument and status of each item from past RIN and WG15 papers and minutes, and encapsulating them here.
Strangely, RIN has been here before:From WG15 RIN Stockholm, November 1991:
Keld Simonsen suggested that the group should have an issues log. There was some discussion of the function of such a log, where it should appear, and of whether the group has any issues suitable for such a log.From WG15 RIN Annapolis, October 1993:
Canada proposes to remove a swathe of items under RIN Agenda item 3.1, and focus the agenda more closely on the papers submitted. The UK, US agreed. It was intended that items which were still relevant but had no immediate input, should be moved to an issues list. The issues list to be visited and reviewed at each meeting.
...we made it, eventually...
WG15, at its May 1996 meeting, debated the then current version of this document, and resolved that it be updated in line with that discussion. The updated document would be preserved as a WG15 paper, and any outstanding issues would be copied into the WG15 Issues List to ensure they remained under review.
Executive Summary:
Closed: The Issue is closed in RIN - not necessarily everywhere else. MBs or WG15 may still regard the Issue as active. This is the RIN Issues list - no-one else's. Closed in RIN means that RIN has no further legitimate interest in the Issue. WG15, at its discretion, may request RIN re-open it.
Open: The Issue is open in RIN - RIN regards the Issue as receiving its active attention. WG15 has asked RIN to consider the issue, and RIN has not yet reached conclusion on the Issue. Upon conclusion RIN shall advise WG15 of its recommendations.From WG15 RIN Orlando, October 1995:
RIN adopts the following process with regard to this document, SD-3: . Upon Closure of an Issue, RIN shall advise WG15 of the status of the Issue by transferring to it the complete section of this document, RIN SD-3, which concerns the newly Closed Issue.
. In order to advise WG15 of RIN's Open Issues which are under active consideration, RIN shall advise WG15's May 1996 meeting of those Issues by copying to WG15 a summary of those Open Issues. The summary shall consist of the 'Title' of the Issue, together with the 'Keywords', 'Description', 'Originator', 'Alternatives', 'Documents', 'Solution' and 'Status' sections from this document.
0. Extended Identifiers in 1003.2b [Closed] 1. localedef iswctype() [Closed] 2. localedef user-specified collation weight names [Open] 3. localedef "substitute" [Closed] 4. localedef "reorder-after" [Closed] 5. removal of NUL special handling [Closed] 6. full support for state-dependent charsets [Closed] 7. charmap-based charset conversion [Closed] 8. "file" user-specified recognition algorithm [Closed] 9. "pax" extended character set support [Closed] 10. C MSE widechar support [Closed] 11. Invariant ISO 646 support [Closed] 12. charsymb/CHARIDS [Closed] 13. regexps [Closed] 14. Canadian Collation Weight minimum levels [Closed] 15. Japanese proposal for LC_CTYPE extension [Open] 16. Character concepts in POSIX [Closed] 17. Range expression [Open]
characterset, lex, awk, shell, scripts, small, languageDescription:
A proposal to permit a more extensive set of characters in the small languages supported by the POSIX Shell and Utilities standards.Originator:
WG20, DKAlternatives:
To remain with the status quo.Documents:
RIN N047 A representation for the shell in ISO 646 N264 SC22/WG20 N085: Extended identifiers N283 SC22/WG15 liaison statement to WG20 N294 P1003.2b D4 (Shell & Utilities Amd) N417 WG20 liaison report to WG15 N420 Extended characterset in Posix identifiers N515 US Action Item Report N532 WG15 minutes and resolutions, Oct 1994 AN12 WG20 current and intended work (WG20 N223)Solution:
WG15 and the US development body have accepted the proposal contained in N420.Status:
Issue in RIN is closed, the proposal in N420 having been accepted. WG15 is requested to (re-)endorse N420.History:
N264 was the first relevant identifiable WG15 paper input on this subject.From WG15 Hamilton, May 1992:
The plenary considered N264 and prepared the following liaison statement to WG20 as WG15 N283:
WG15 has reviewed WG20 document N085 entitled "Extended Identifiers", which encouraged discussion of its proposal, and offers the following comments:
1) The POSIX Shell and Utilities standard (DIS 9945-2) provides facilities for locale-dependent specifications of character attributes that optionally are adjustable by the user or application. WG15 recognises that allowing characters outside the POSIX portable character set is a feature that directly impacts portability, but it is a desirable localisation facility in some environments.
2) WG15 believes that any extensions to programming language identifier requirements should be accomplished within the framework described in 1) above.
3) 9945-2 contains several "small languages", such as shell and awk, that WG15 intends to enhance in this area. It believes that the proper approach would be to allow characters in classification "alpha" in the current locale whereever the current specifications allow alphabetics from the portable character set (equivalent to the ISO 646 repertoire). (The "alpha" classification may include syllabic and ideographic characters, and is named "alpha" for historic reasons.) Because of differing requirements in the various languages, WG15 considers any additional degree of flexibility to be infeasible across all languages.
WG15 plenary resolved to pass the above statement through its liaison to WG20:
RESOLUTION 201. LIAISON STATEMENT TO WG20 WG15 instructs its liaison to WG20 to transmit WG15 N283 as a WG15 liaison statement to WG20.From WG15 Annapolis, October 1993:
4.4 Liaison statements & actions related thereto [N417, AN12, N420, N421, N422] N420 is intended to be an amendment to the Posix 'small' languages. It proposes an extended characterset for lex, awk, shell scripts, and as such might break them as they are currently specified.
Re N417 point 7: Keld maintains that N420 is implied by areas of work defined in AN12 (WG20 N223). This is the one which may break things. KS suggests that this is solved via the locales mechanism. No action is required... ???
22.41 additional utilities {2b} CD reg: [N416, N420] Proposed action on the US to take these on board. Nl accepts N420 proposal, but regards the N416 document as representing old technology superceded by ISO 10646.
The original action was on DK to provide these papers as additional information to the US. N416 and N420 will be passed to the US for comment.
[The action item was carried forward to the May 1994 meeting]From WG15 RIN Annapolis, October 1993:
Resolution RIN 9310-04: Internationalisation Concerns in 1003.2b WG15 RIN notes that the new Annex H to 9945-2 addresses the concerns of the international community, specifically of Japan and of Denmark. 9945-2 Annex H indicates that input is required from WG15 MBs on a number of specific issues and therefore WG15 RIN requests an indication of the latest dates by which such input is required by the US development body, in order to maintain synchronisation of the ISO/IEC and IEEE work.
...after input from Arnie Powell it was decided to convert the Resolution on Annex H to an action item on the US RIN Rapporteur in order to achieve it in a more timely fashion.From WG15 Tokyo, May 1994:
9405-52 United States: Review N416 and N420 and forward them to PASC for consideration.From WG15 Vancouver, October 1994:
The 9405-52 action was noted as Complete, the response being included in N515, the US action Item report:
re N420...The languages specified by POSIX.2 specify behaviour when identifier names are chosen from the portable character set. We have not found anything to preclude an implementation from recognising extended characters as part of an identifier. However, an application making use of those extensions would be non-portable.From 9945-2:1993 Annex H.1
7: 2.5 Locale (1) Provisions should be made to allow characters beyond those in the portable character set in user-supplied identifiers for the shell, awk, bc, lex, make, and yacc. A proposal has been made by Denmark to extend the locale definition to specify the set of identifier characters for all programming languages.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 Copenhagen, May 1996:
| Extended identifiers work in real compilers, but not for the | small languages of lex, awk, etc. WG15 does not support the use | of extended identifiers in these POSIX small languages. The .2b | WG cannot see how to do this in a way which allows locales to | drive the lexical analysers of these utilities 'on the fly'. | Action on MBs to bring forward any technical means to solve this | implementation problem. The Issue remains closed.
locale, localedef, iswctype()Description:
iswctype() determines whether the wide character c has the property p. For example: iswctype(c, wctype("lower")); where wctype("lower") returns a value of type p.Originator:
JAlternatives:
NoneDocuments:
N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N294 P1003.2b D4 (Shell & Utilities Amd) N531 IEEE P1003.2b D10: Shell & Utility Extensions N602 RIN N158: Japanese Action Item report to WG15, October '95 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
The issue is Closed. The 1003.2b document includes appropriate support for iswctype().Status:
ClosedHistory:
N245 was the first relevant identifiable paper on this subject:From WG15 Stockholm, November 1991:
The Japanese MB comments on CD 9945-2, quoted from N245, raises an objection [@ O o 4 <ITSCJ.4>] relating to "...additional character classes suitable for classes beyond the current ANSI/C and/or Latin based character classes. The current draft says that such additional character classes may be supported by implementation, but which is implementation defined.
"Action: As the ISO/C Multibyte Support Extension (MSE) is going to provide a new function iswctype(), some corresponding enhancement of LC_CTYPE description file should be considered so that 'user/implemetation definable character classes' can be supported in the POSIX environments in the standard manner.
"Japan will probably be able to cooperate with the POSIX.2 developing member body (US - IEEE) on how to solve these issues."
N281 contained the following disposition:
We also believe that this functionality should be studied for inclusion in the POSIX.2b revision and the full international standard. We are aware of efforts within X/Open to address this area and would like to take advantage of their developments.
An action 9111-23 was devised to reformat the Japanese comments on 9945-2 to items in the WG15 Issues list.From WG15 Hamilton, May 1992:
At WG15 Hamilton, this was transformed into:
9205-32: Japan to provide to the US Member Body proposals for areas identified in their 9945-2.2 comments #s 2, 3, 4, 10, 11, 54, and 57 addressing resolution comments in N281.From WG15 Reading, October 1992:
Action 9205-32 was noted as Complete. No document is cited, no action recommended.
WG15 plenary considered N294, the P1003.2b Draft 4 document. This contained on Page 5 the following:
2.5.2.1 LC_CTYPE Add the following keyword items between the items labeled blank and toupper:
charclass Define one or more locale-specific character class names as strings separated by semicolons. Each named character class can then be defined subsequently in the LC_CTYPE definition. ...
charclass-name Define characters to be classified as belonging to the named locale-specific character class. In the POSIX Locale, the locale-specific named character classes need not exist. ...
This addition was adopted from XPG4 to satisfy the following requirement from ISO/IEC DIS 9945-2:1992 Annex H:
(3) The LC_CTYPE (2.5.2.1) locale definition should be enhanced to allow user-specified additional character classes, similar in concept to the proposed C Standard {7} Multi-byte Support Extension (MSE) iswctype() function.From WG15 RIN Reading, October 1992:
RIN considered N088, a proposal for an LC_CTYPE extension to support additional character mappings. There is no record of further action on this document.From WG15 Vancouver, October 1994:
N531, Draft 10 of P1003.2b, was made available and contains only minor changes to references in the above section.From WG15 Copenhagen, May 1996:
| Closed, 1003.2b already includes support for this.
localedef, collation, weight, LC_COLLATEDescription:
A mechanism for the specification of named collation weights in the LC_COLLATE section of locales, particularly to support non-latin character scripts to manage a number of sorting algorithms.Originator:
JAlternatives:
NoneDocuments:
N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N330 Japanese comments on Posix .2b/D4 RIN N106 Japanese Proposal to POSIX 1003.2b N602 Japanese Action Item Report to WG15, October 1995 N640r US TAG N573, N587: AI 9510-14, Report on POSIX.2b IssuesSolution:
None as yet. The proposal has been accepted in principle. The US development body has asked for specific wording to be supplied by Japan for inclusion in a revision to the standard.Status:
Open. Awaiting input from the Japanese MB to 9945-2Amd2b.History:
From WG15 Hamilton, May 1992:
N245, the comments on CD 9945-2, and N281, the disposition of those comments, contained the Japanese MB objection <ITSCJ.30> relating to collation weight names; a similar later version (below) was recorded at the WG15 Reading meeting. The proposed disposition of <ITSCJ.30> is contained in N281 as:
We believe that this change, or something similar to accomplish the same objective, should be studied for inclusion in the POSIX.2b revision and the full international standard.From WG15 Reading, October 1992:
N330 contained the Japanese MB comments on POSIX.2b D4; they included:
<ITSCJ.2b.9> Sect 2.5.2.2.3 (LC_COLLATE) PROPOSAL Problem: In most cases of ideographic characters, it is a requirement that a user be able to specify collation weights as he/she wants. In case of Japanese characters (Kanji), for example, there are five possible collation weights for supporting Japanese SORT. The five weights are On-yomi (psuedo-Chinese pronunciation), Kun-yomi (Japanese pronunciation, number of strokes, radical (components of Kanji), and Kanji character code. There could be more weights. The LC_COLLATE part of localedef specifications should allow a user to describe these weights and give names to the weights. Any combinations of the defined weights should be able to be specified by the user at run-time.
Proposal:
LC_COLLATE extension for specifying weight name
=> 2.5.2.2.3 order start Keyword. Add the following directive description and the Example.
It is implementation defined whether the following optional directive shall be recognised. If they are not supported, but present in a localedef source, they shall be ignored.
name specifies the name of a collation weight by a string. An order of weights may be specified by using the name at run time. The syntax for the name directive shall be:
"name = Example: order_start forward,name="kunyomi";forward,name="radical"
If an operand has a name directive, the definition of the primary, secondary, or subsequent weights for the collation element may be different from the order of operands to the order_start keyword.
=> 2.5.3.2 Locale Grammar. Modify the opt_word description as follows: opt_word : 'forward' | 'backward' | 'position' | 'name' '=' weight_name
weight_name : '"' char_list '"'
Rationale: User's requirements for character collation in Asia are diverse. Ideographic characters have several rules to sort such as by pronunciations, strokes, etc. and the combination of the rules are used for their sorting. Those properties for a charcter such as pronunciation can be assigned as weights for a character element. However, no standard primary weight, secondary weight and so on exists for the weights (properties). The weight name extension for LC_COLLATE allows the order of multiple weights to be defined at run time in the different order than the order than the order of operands to order_start keyword. To make the different order effective, the weight names can be specified in the setting of LC_COLLATE category.
order_start forward,name="kunyomi";forward,name="radical"
When a ja_JP.eucJP locale has the above definition in the LC_COLLATE part, the order of sorting rules can be specified as follows by using the weight names:
LC_COLLATE = ja_JP.eucJP@weights=radical,kunyomi
This means that the sort-rule "radical" is used as the primary weight and "kunyomi" is used as the secondary weight.From WG15 RIN Heidelberg, May 1993:
3.1.3 user-specified collation weight names based upon phonetic, character based(radical), or code based. Dynamic based control of collation based upon sort key. The ability to switch pointer dynamically to bring collation tables into correct sequence. Japanese delegation has submitted two written requests without supporting material.[?] Next version would be submitted by June 18, 1993.From WG15 RIN Annapolis, October 1993:
Action Item reports: The action list was lost. The minutes of the previous meeting were scanned to recover as many action items as possible; these were determined to be as follows:
9305-01 Requirement for user-specified collation weights. MDR-02 contains the Japanese proposal on collation weights. (Closed)
MDR-02 -> RIN N106: Japanese Proposal to POSIX 1003.2b
3.1 I18N in POSIX.2b
Specific actions were taken in Annex H to address Denmark and Japanese concerns for May 93 Heidelberg meeting. Japan needs feedback for timeline to produce material for coordination with 1003.2b Resolution to be produced asking for timeline for national body contributions. The rest of 3.1 [including N106] was postponed to the next meeting, due to lack of knowledge of the current status of .2b and lack of input papers received in time.
9310-09 Lead Rapporteur: distribute documents N105, N106, N109 and N113 to the RIN mailing list together with a cover note indicating that these documents will be discussed at the next WG15 RIN meeting, May 1994, and also indicating which agenda items will be touched by the documents.From WG15 RIN Vancouver, October 1994:
9405-05 Member Bodies to review N105 (Japanese comments on .1a), N106 (Japanese comments on .2b), N109 (SC22/WG20 guidelines for the use of extended identifiers in programming languages), N113 (CEN standard for string ordering) for determination of appropriate action prior to Oct. Meeting 10/94: OPEN: Prof. Saito noted they are preparing a Japanese standard for character ordering.
The above action item was carried through from May 1994 to the May 1995 meeting.From WG15 RIN Twente, May 1995:
3.1.3 localedef user-specified collation weight names--Japan making proposal for Annex H--removed to issues listFrom 9945-2:1993 Annex H.1:
(4) The LC_COLLATE (2.5.2.2) locale definition should be enhanced to allow user-specified names for collation weights. A proposal from Japan is expected in this area.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 RIN Orlando, October 1995:
N158 [WG15 N602] includes new input to this item; Japan is still working on this item; solution to some of the problems are not yet obvious. Japan needs discussion of their paper to help them go forward.
[N602 includes the following:] LC_COLLATE extension for user-specific names of collation weights
Title: Japanese proposal to POSIX.2b on LC_COLLATE extension for user-specified names of collation weights
Status: Japanese position
Short description: Japan proposes to extend LC_COLLATE locale definition in POSIX.2b so that names can be assigned to collation weights. This proposal is the response to the item (4) of ISO/IEC 9945-2:1993 Annex H.1 in which a proposal from Japan is expected.
Text of contribution: [Note: The page numbers refer to the ones of P1003.2/D10.]
Sect 2.5.2.2.3 (LC_COLLATE) PROPOSAL. page 10:
Problem: 1. General Requirements
In most cases of ideographic characters, it is a requirement that a user be able to specify the combination of collation weights as he/she wants. Japanese kanji characters, for example, have five (or more) typical collation weights to support Japanese SORT. The five weights are On-yomi (pseudo-Chinese pronunciation), Kun-yomi (Japanese pronunciation), Number of strokes, Radical (components of Kanji), and Kanji character code. There are many possible combinations of these weights and the requirements for them (number and order of weights) may change according to the type of data sorted, the purpose of sorting, user's preference, etc. Users (or applications) want to specify the method of sorting by specifying the primary weight and the secondary weight, and so on. Because no names are available for the combination of multiple weights, it is reasonable requirement that users can use the name of each collation weight for specifying the method of collation. That is the way in which most sorting utilities existing in Japan are implemented.
The concept of each weight for kanji characters mentioned above are common knowledge for Japanese. However, there are no standards for the weights of Japanese kanji characters. So the detail of assigning weights can be slightly different among implementations depending on which information source (dictionary, etc.) is used for making the weights. It is difficult to handle such difference by using pre-defined sorting method. If each weight can be handled independently, it will be easier to manage.
ISO 10646 (UCS) is now a standard. UCS can be used as a codeset for any locale whose character sets are included in. Even if UCS can be used for many different countries, the requirements for sorting characters are different country by country. The size of locale databases are concerns about using UCS. It is a requirement that there should be no problem for providing solutions to the above kanji sorting requirements when UCS is used as a codeset.
2. Problem in using current POSIX.2 standards specification
Current locale model seems to assume having a well-defined collation definition for each locale. However, it does not match with the requirements for sorting ideographic characters. There is an opinion that it's not totally impossible for the current .2 specification to allow implementation of satisfying most of (not all) the above requirements. Producing locales for all possible combinations of weights as well as naming each locale is the possible solution based on the existing standards specification. In addition to that it is not a complete solution, the approach seems not practical in the following points.
a. Size of locale databases There are about 12,000 kanji characters defined in JIS standards (JIS X0208 + JIS X0212). Because each possible combination of available weights needs to have a database, the total size of locale databases containing such big number of characters cannot be ignored. (for examples, 12,000 characters x 20 databases) When a local for ISO 10646 code set is defined, the problem must be more serious.
b. Identification of each collation method "Onyomi", "Kunyomi", etc. are well-known names as methods of sorting kanji characters. However, the problem is that no names are available for the combinations of the primitive methods. Implementors need to invent new names for the methods. (for example, onyomi_strokes_radical, kanji0102, etc.) The possibility of making standard or de facto standard for the names of these combinations are very low. Hence, this approach will not be portable.
Considering these problems, without extending current specification of LC_COLLATE, standard collation API such as wcscoll can support only limited ways of collation for kanji data, for example JIS code values. In this situation, applications which handle character orderings (for example, database applications) cannot rely on locale databases to sort kanji data. Some applications will support several collating methods by having their own ordering databases. Some applications will simply neglect the various sorting requirements for Kanji.
3. Overview of LC_COLLATE proposal By extending LC_COLLATE specification, single locale database can define multiple definitions of weights for kanji with their names. It is envisioned that the order of multiple weights can be specified at run time in the different order than the order of operands to order_start keyword. To make the different order effective, extension of another part of POSIX standards may be necessary. The weight names specified in the database should be referenced by a user or an application and the behavior of collation API needs to be modified according to the specified sorting method.
The proposal for allowing users to specify collation methods is expected to work as follows.
a. Define collation weights with names in LC_COLLATE
Define collation weights with names in the locale database.
EXAMPLE order_start forward,name="kunyomi";forward,name="radical" <char-1> <kunyomi weight for char-1>;<radical weight for char-1> <char-2> <kunyomi weight for char-2>;<radical weight for char-2> : : order_end
b. Specify sorting methods
There are two possible extensions to specify preferred collation. One is to introduce new environment variable (b.1), and the other is to use LC_COLLATE (b.2).
b.1 Set the environment variable COLLWEIGHTS to preferred collation combination using names defined in the locale database.
EXAMPLE COLLWEIGHTS=radical,kunyomi
(Primary weight=radical, Secondary weight=kunyomi)
b.2 Alternatively, existing LC_COLLATE environment variable can be used to specify user's preference. The weight names are specified after the string "@weights=" modifier.
EXAMPLE LC_COLLATE=ja_JP.eucJP@weights=radical, kunyomi
c. Initialize collation data
There are two possible extensions to set collation methods at run time. One is to introduce new API (c.1), and the other is to use setlocale() (c.2).
c.1 The call to setweights() initialize the collation method from the setting of COLLWEIGHTS environment variable. The setweights function can be used to change the method of collation at run time.
c.2 The call to setlocale(LC_ALL, "") initialize the collation method from the setting of COLLWEIGHTS (or LC_COLLATE) environment variable. The setlocale function can be used to change the method of collation at run time.
d. API behavior
Collation APIs such as wcscoll work depending on the current setting of collation method.
The details of the proposal for extended use of environment variables and the initialization by API are not decided yet. The proposed extension to locale definition file is described below. The detail proposals for other parts are not ready yet.
4. Proposal for POSIX.2b LC_COLLATE locale definition file
Proposal: [LC_COLLATE extension for specifying weight name]
The LC_COLLATE part of localedef specifications should allow a user to give names to the weights.
=> 2.5.2.2.3 order_start Keyword. Add the following directive description and the Example.
It is implementation defined whether the following optional directive shall be recognized. If they are not supported, but present in a localedef source, they shall be ignored.
name specifies the name of a collation weight by a string. An order of weights may be specified by using the name at run time. The syntax for the name directive shall be:
"name = \"%s\"", <weight-name>
Example:
order_start forward,name="kunyomi";forward,name="radical"
If an operand has a name directive, the definition of the primary, secondary, or subsequent weights for the collation element may be different from the order of operands to the order_start keyword.
=> 2.5.3.2 Locale Grammar. Modify the opt_word description as follows:
opt_word : 'forward' | 'backward' | 'position' | 'name' '=' weight_name ;
weight_name : '"' char_list '"'
[Attachment : Example] Possible LC_COLLATE definition ============================== # Stroke collating-symbol <3stoke> collating-symbol <4stoke> collating-symbol <6stoke> collating-symbol <7stoke> collating-symbol <10stoke> # Onyomi collating-symbol <a> collating-symbol <i> collating-symbol <ka> collating-symbol <san> # Radical collating-symbol <ninben> collating-symbol <kuchi> collating-symbol <yama> order_start forward,name="stroke";forward,name="onyomi";\ forward,name="radical";forward,name="JISnumber" <j1602> <10stroke>;<a>;<kuchi>;<j1602> <j1643> <6stroke>;<i>;<ninben>;<j1643> <j1644> <7stroke>;<i>;<ninben>;<j1644> <j1829> <4stroke>;<ka>;<ninben>;<j1829> <j2719> <3stroke>;<san>;<yama>;<j2719> Changing the order by assigning values to LC_COLLATE (b.2 method) ==================================================== LC_COLLATE=ja_JP.eucJP@weights=stroke,onyomi,radical,JISnumber Behavior of collation functions =============================== Output from weights=stroke,onyomi,radical,JISnumber (default) <j2719> < <j1829> < <j1643> < <j1644> < <j1602> Output from weights=radical,onyomi,stroke,JISnumber <j1643> < <j1644> < <j1829> < <j1602> < <j2719>From WG15 Copenhagen, May 1996:
| PASC WG has captured this issue and has emailed an awk script | (in N640r) which solves the problem. Japan would like to take | the proposed solution back to Technical Experts to ensure it | answers their concerns. The US DB would like comments ASAP to | ensure it hits the .2b ballot window. Action on Denmark and | Japan to ensure the script works for them. The issue remains | open - the US DB believes their solution will not be changed.
locale, localedef, substitute, LC_COLLATEDescription:
The "substitute" statement in LC_COLLATE is needed for describing higher levels of Danish Standard DS 377 sorting, and should be re-introduced.Originator:
DKAlternatives:
None identified.Documents:
(WG15RIN.136) substitute in LC_COLLATE (WG15RIN.246) substitute N170r WG15 RIN N036: Minutes & resolutions, Rotterdam, May 1991 N213 WG15 RIN N046: Japanese national profile for POSIX: Vn 1.2 N215 WG15 RIN N051, N052: RIN Minutes and resolutions, November 1991 N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N323r WG15 RIN N096: Minutes & resolutions, Reading, October 1992 N370 RIN N103: RIN Minutes from Heidelberg, 10-11 May 1993 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
Substitute is requested only by Denmark; other potentially interested MBs - Canada, Japan, US, UK and the Netherlands have indicated that they do not require the substitute feature.
The concensus is that this support can best be provided at application level - Denmark disagrees.Status:
The Issue in RIN has been revisited many times without concensus being reached.
WG15 at its Copenhagen meeting resolved that 'substitute' is not required, and that the Issue is closed.History:
From WG15 RIN Rotterdam, May 1991:
N170r noted a debate on substitute: 3.2.2. localedef ... A particular problem is the substitute command, and its use of regular expressions. It has been suggested that string-for- string substitution would be adequate; however, the CSA -- and, by implication, most western -- collation standards cannot be met without regular expressions. Given rationale that regexps are not necessary for practical national collation sequences, Greger Leijonhufvud would be happy to drop them. [Actions 9105-08 and 9105-20 were devised to check if Japan and Canada needed 'substitute']From WG15 Stockholm, November 1991:
RIN9105-8 Erik van der Poel: Determine whether substitute is necessary to implement Japanese collation.
Closed. The substitute operation is not required -- see RIN N046.
RIN9105-20 Patric Dempster: Clarify, through discussion with Alain LaBonte, whether the CSA ordering standard requires the substitute operation.
Closed. The substitute operation is not required.From WG15 Hamilton, May 1992:
N245 included a number of Danish MB comments on the 2nd CD of 9945-2. Item 3 of the Danish comments was the request to re-introduce the "substitute" facility.
N281, the Disposition of Comments, proposed the following:
We believe that this change, or something similar to accomplish the same objective, should be studied for inclusion in the POSIX.2b revision and the full international standard. It should be deferred because there currently exists no firm consensus on its necessity within the US or international communities. An informative statement concrning future directions for 'substitute' will be included.From WG15 RIN Reading, October 1992:
(WG15RIN.246) substitute: From: keld@dkuug.dk Substitute specification in the LC_COLLATE section of localedef DS proposes to use the wording contained in ISO/IEC 9945-2 DIS annex G.
3.1.4 12. The use of 'substitute' in collation was suggested. A review of the history of this shows that this gives recursive definitions between the locale and regular expressions - which cannot in general be shown to be finite. DIN 5007 and the Canadian standard on sorting do not use this, but the highest level of the Danish sorting standard (DS377) does.
13. The Danish national body is to produce a paper before the next meeting on its perceived need for the use of substitution in the collating order category of a locale vis-a-vis DS377 and in particular the level at which that appears to be necessary (RIN AI 9210-01)From WG15 RIN Heidelberg, May 1993:
2.0 Action Item Reports: 9210-01 Defer discussion [to 3.1.4]
[The minutes do not record a paper responding to 9210-01]
3.1.4 Canada has trouble with nested substitute routines which allows no character control within application.From WG15 Twente, May 1995:
Denmark: One thing has not been provided - text for "substitute" facility, from an old draft of .2. Denmark believes that US has text in its archives.From 9945-2:1993 Annex H.1:
10: 2.5.2.2 LC_COLLATE (5) The collation substitute facility, removed from 2.5.2.2 in an early draft, should be restored.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 RIN Orlando, October 1995:
Denmark indicated that the problem was not a simple one, and that various other MBs would need it, if only they thought about it for a while. HW said he would go back to check what was required by the Netherlands.
9510-02 HW to check on requirement for 'substitute' by the Netherlands.
DB said that while this was required in telephone book sorting in Canadian English, this was an application issue, not an API one. KS disagreed; there should be API support at this level to prevent repetition of this functionality within multiple applications, with the possibility of them differing. DC indicated that the UK felt this could be supported by other means than the API.
RIN has identified no widespread need for the functionality. UK, US, Canada and Japan do not need it. Netherlands are checking. The Issue is closed.From WG15 Copenhagen, May 1996:
| The Netherlands reported that they saw no requirement for | 'substitute'. WG15 maintained that the Issue remain closed.
locale, reorder-after, replace_afterDescription:
A mechanism for building on the collation sequence constructed for one locale by allowing the specification of a set or sets of differences in the construction of other, similar collation sequences for other locales.Originator:
DKAlternatives:
reorder_after was substituted for replace_after in 'mid 1992.Documents:
RIN N035 Proposal for building on other locales (replace_after) RIN N092 Danish note on reorder_after and replace_after RIN N127 Procedures for European Registration of Cultural Elements, CEN draft 5 N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N391 DIS 9945-2 Disposition of Comments ballot RIN N154 RIN Minutes, Orlando, 26/27 October N640r US TAG N573, N587: AI 9510-14, Report on POSIX.2b IssuesSolution:
WG15 RIN resolved at its October 1992 Reading meeting NOT to proceed with either replace_after or reorder_after.Status:
The issue is closed, having been reopened since the Reading meeting. RIN believes that there is no requirement for this functionality.
WG15 is advised by RIN that 'reorder-after' is not required.
NB: The above Resolution and Status is disputed by Denmark, which believes that the functionality is required as specified in CEN ENV 1205.History:
From WG15 Stockholm, November 1991:
d. RIN N035, Proposal for building on other locales (replace after)
>> Consensus in RIN was that functionality of "replace after" should be explored (Canada volunteered to do some prototyping)
>> Denmark should include proposal as part of their ballot comments.
COPY statement exists in .2.2 but may work on binary data only (e.g. contents of locale after compilation)
Canada had no technical objections to exploring functionality but was concerned about affect on existing consensus if a change is made at a late point in balloting, and potential effect on portability.
Denmark position not final but is seeking consensus on issue; if consensus is to explore inclusion in later extension of standard, that would be OK. [This is in relation to the original 9945-2 standard]From WG15 Hamilton, May 1992:
N245 included Danish MB comments on CD 9945-2:
9. ...collating sequences vary a bit from country to country, but generally much of the collating sequence is the same. For instance the Danish sequence is quite equal to the German, English or French, but for about a dozen letters it differs. The same can be said for Swedish or Spanish; generally the collating sequence is the same, but a few characters are collated differently.
With the advent of the quite general coded character set independent locales like the example Danish in POSIX.2 Draft 11 annex F, it would be convenient if the few differences could be specified just as changes to an existing one. This would also improve the overview of what the changes really are. Therefore DS propose the following.
For the LC_COLLATE definition, a new command is allowed: replace_after <collating element> <collating-el1> ... <collating-el2> ... ... replace_after ... ... replace_end
This construct is allowed also when a "copy" statement has been given. More than one replace_after / replace_end construct can be given.
The <collating-el1> ... are removed from the current collating sequence and inserted after <collating-element> in the collating sequence.
For this to work the "copy" statement should be allowed to be used together with other statemants in the LC_COLLATE section ... The replace-after proposal can be included in the Annex F, where its use is demonstrated. Then the specification can be moved to the normative part of 9945-2 in a later issue.
N281 contained the response to this proposal:
We believe that this change, or something similar to accomplish the same objective, should be studied for inclusion in the POSIX.2b revision and the full international standard. It should be deferred because there currently exists no firm consensus on its necessity within the US or international communities.
The response goes on to indicate that the original concept of the "copy" statement was to duplicate an actual object description - the source text may not exist on the current system - and therefore replace-after would require the locale be 'de-compiled'.From WG15 RIN Reading, October 1992:
RIN N092 renamed 'replace_...' to 'reorder_...' and proposed:
The following section is inserted in the description of LC_COLLATE keywords in POSIX.2 D11.3 section 2.5.5.2.
2.5.2.2.6 'reorder_after' keyword
The 'reorder_after' keyword specifies a starting point for reordering collating elements. It is followed by one or more collation reorder statements, reassigning character collation weights to collating elements. The syntax is:
"reorder_after %s\n",<collating-symbol>
2.5.2.2.6 Collation Reordering
Each 'reorder_after' statement shall be followed by one or more collation element reordering entries. The definition of collation element reordering entries are equivalent to the collating element entries in 2.5.2.2.4, specifying collation elements and associated weights. The collating element reordring entries are terminated by a 'reorder_after' keyword or a 'reorder_end' keyword.
Each collation element specified via a collation element reordering entry is removed from the current collating sequence, if present, and inserted in the collating sequence after the previous reordering collation elements. The collating element specified on the previous 'reorder_after' statement specifies the first reordering collation element. The last reordering collation element is followed by the follower to the collation element specified on the 'replace-after' statement.
Example:
order_start <collating-el1> <collating-el2> <collating-el3> <collating-el4> <collating-el5> order_end reorder_after <collating-el4> <collating-el1> <collating-el2> reorder_after <collating-el2> <collating-el6> reorder_end The resulting order is then: <collating-el3> <collating-el4> <collating-el1> <collating-el2> <collating-el6> <collating-el5>
2.5.2.2.8 'reorder_end' keyword
The collating reorder entries shall be terminated with a 'reorder_end' keyword.
WG15 RIN minuted the following:
3.1.5 18. Discussion of RTN014 [RIN N092] resulted in a decision not to proceed with either 'reorder_after' or 'replace_after' mechanism in locale ordering.
...the debate was however pursued through both the Heidelberg and Annapolis meetings through a series of WG15 action items: 9205-31, 9210-10, 9305-06 - RIN needs to advise WG15 of its decision at Reading.From WG15 Heidelberg, May 1993:
5.2.1 (JTC1 22.21.02.01) Shell and Utilities base {2} DIS
The DIS ballot on 9945-2 closes June 6, 1993. Comments and negative ballots are expected. Member Bodies are requested to send electronic copies of ballot comments to the Project Editor (hlj@posix.com). The Project Editor will prepare a preliminary Disposition of comments and circulate this to WG15 in July, 1993. The US will host an Editor's Meeting in conjunction with the October, 1993 WG15 meeting (see open action items 9305-41 and 9305-42).
N391 presented the Disposition of Comments on DIS 9945-2: they included -
5. Other. The following comments will result in no changes to the IS, for the reasons indicated: ...
Denmark 4: The concept of "binary" or "compiled" locales has been quite popular among implementors of the standard and no attempt has been made to mandate interfaces that would make such implementations non-conforming. The "localedef copy" and "replace-after" modifications proposed here would make binary locales extremely difficult to support. Furthermore, they are merely alternatives to existing, standard UNIX (tm) text-file manipulation tools. Since these modifications have received little support in WG15/RIN after repeated discussions, and none from the US development body or any known implementors, they should not be required.From 9945-2:1993 Annex H.1:
(6) A facility should be added to allow simple modifications to existing locale collation definitions. A proposal for such a replace_after keyword in LC_COLLATE is being developed by Denmark.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 RIN Orlando, October 1995:
Canada indicated that this functionality is not required. KS indicated that this is a major building-block for WG20 work: he went on to outline the mechanism for the proposal.
HW proposed that the Issue be recorded as Closed. DC pointed out that RIN (at Reading) had already closed the Issue. The consensus was that the Issue is Closed.
9510-03 Canada to check its view of the status of the 'reorder-after' Issue at the request of Denmark, and to report back to the next WG15 meeting.
Add Issue .. re Dk concerns with the COPY statement - is this source or binary? (Ref Pp 18 of the [then] existing Issues list).
Debate on revisiting the Issues list decided to remove the above as an Issue for the time being, reinstating it if the response from 9510-12 fails to resolve the problem:
9510-12 US to request clarification on the COPY issue from the US development body dealing with .2b and report back to RIN.
[DS currently (27-Oct-95) believes that COPY works at source level - the IEEE development group believes it works at binary level. The COPY functionality may become the focus of a separate RIN Issue if the response is inconclusive.]From WG15 Copenhagen, May 1996:
| WG15 N640r responds to this. An awk script to give this | functionality will be added to the rationale of 1003.2b. The | Issue remains closed.
NUL, character, byteDescription:
Clarification of the form of NUL, to address the problems of null bytes (an eight-bit sequence with all the bits set to zero) appearing in multibyte character strings and appearing to be string terminators to C language library routines.Originator:
DK, JAlternatives:
NoneDocuments:
N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N294 P1003.2b D4 (Shell & Utilities Amd)Solution:
NUL: A character with all bits set to zero, which is defined as <NUL> in the character set description file.Status:
Closed. The resolution was reached in 1992.History:
From WG15 Hamilton, May 1992:
N245 contained the Danish MB comments on 9945-2, including:
11. Page 78 line 2212-2213, 2215, page 55 line 1249-1250: We see no need for a specific encoding and collating order for a character NUL, and we request that this be removed. The current specifications make the POSIX specification character-encoding dependent, and make unnecessary constraints on this character when collating.
N281 contained the following disposition:
This will be considered as part of the P1003.2b revision. NUL is the only special character, and that is because it has a special meaning in POSIX: it cannot be included in text files, and it is used to delimit strings in C. Its value is required by ISO/IEC 9899, on which most POSIX.2 implementations will be based. Consequently, it IS special (see also regular expressions). Most of the utilities using the collation definition are processing text strings; certainly neither strxfrm() or strcoll() can handle nulls except as string terminators. Making NUL the lowest character makes the end-of-string processing simpler and in line with the standards POSIX sorting rules (shorter string sorts before longer). Also leading ellipsis doesn't work if NUL isn't first.
N245 contained the Japanese MB comments on 9945-2, including:
<ITSCJ.6> Sect 2.2.2.91 (NUL) OBJECTION. page 37, line 647:
Problem: "NUL: A character with all bits set to zero" is ambiguous, since by the POSIX definition "a character" means "a multibyte character" in general.
It is unclear that the phrase "with all bits .. zero" this definition specifies a single byte null character, a multibyte null character (in generic), or both/neither (regardless of number of bits).
Action: If it implies a single byte null character, change to:
"NUL: a single byte character with all CHAR_BIT set to zero."
If it specifies a unique null characters regardless of number of bits in the POSIX environment, change to:
"NUL: A character with all bits set to zero, which is defined as <NUL> in the character set description file."
N281 contained the response to this proposal:
It is the second choice. We added a forward pointer to 2.4 in 2.2.2.91, where the requirements for NUL are already listed.From WG15 Reading, October 1992:
N294, the Shell & Utilities Amendment, Draft 4 contained the following entry:
=> 2.5.2.2.4 Collation Sequence. Remove the following sentence from the second paragraph:
The NUL character shall compare lower than any other character.
Rationale: This change partially satisfies the following requirement from ISO/IEC DIS 9945-2:1992 Annex H: (7) The specific encoding and collation requirements for the character NUL should be removed.
The specific encoding was retained because the C Standard {7} requires it.From WG15 RIN Reading, October 1992:
3.1.6 19. It was reported that the requirement for NUL to be handled separately had been dropped. It was suggested that NUL would be defined as in ISO 6429:1988 for all possible character sets. This is to be checked.
921003 The Danish national body is to provide a proposal for a definition of NUL to this group and to the US development body for consideration at its January meeting (Minute 20).
From WG15 RIN Heidelberg, May 1993:
The RIN Lead Rapporteur was unable to attend. There was no input on the above action item.From WG15 RIN Annapolis, October 1993:
The action list was lost. The minutes of the previous meeting [Heidelberg] were scanned to recover as many action items as possible. The action item on NUL was not amongst them.From WG15 Copenhagen, May 1996:
| NUL special handling was dropped in .2b, however, NUL was | not dropped because it had to be kept to allow POSIX locale to | be a superset of the C locale. This is acceptable to Denmark. | The Issue is closed.
charmap, character, encoding, shift-state, state- dependent, statefulDescription:
A mechanism to allow otherwise-identical byte values to be interpreted as different characters by preceding them by implementation-defined escape sequences. The escape sequence forces a change of state, and thus a different interpretation of: . a subsequent byte (single-shift encoding) or . subsequent bytes (locking-shift encoding). In the latter case, a further escape sequence is necessary to force further state-changes.Originator:
JAlternatives:
NoneDocuments:
N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N330 Japanese comments on Posix .2b/D4 N362 Japan action item report N365 US Action Item Report N436 Japanese action item response for October 1993 N602 RIN N158: Japanese Action Item report to WG15, October 1995 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
Japan believes that in view of more recent developments - the adoption of ISO/IEC 10646-1 and the imminent standardisation of UTF-8 - POSIX has an alternative way to represent multi-script/multi-lingual text without using state-dependent encodings.
Japan therefore proposes not to pursue support for state- dependent encodings, however Denmark asked for the opportunity to offer new input to this Issue before May 1996.Status:
Closed. While Japan has decided not to pursue this approach, their decision was reached only a few days before the October '95 RIN meeting. At that meeting Denmark requested the Issue be held open until the May meeting of WG15/RIN to allow time for additional input.
RIN determined that the Issue will be closed in May 1996 if no further input is received.
No further input was received. The Issue is closed.History:
From WG15 Hamilton, May 1992:
N245 and N281 (Disposition of comments on CD 9945-2 in N245) were considered by WG15 Hamilton. They contained:
<ITSCJ.57> Sect B.5 (regcomp() family) OBJECTION. page 788, line 618: Problem:
The functions regcomp() and regexec() should have wchar_t version interface because of the following reasons:
(1) To use regcomp() and regexec() functions in a program which handles its internal character data in wchar_t data type, for example a text editor, it should do the following process:
1. convert internal text data from wchar_t array to char array.
2. search pattern using regexec().
The conversion should be done every time the program searches a pattern, for each line. It is too heavy overhead to such programs and it will make wchar_t based programming too hard. If wchar_t version of regcomp()/regexec() functions are provided, no wchar_t-to-char conversion is needed.
(2) If regexec() is used on a system which uses state-dependent encoding, the following problem should occur.
When the function regexec() is called with REG_NOSUB flag in the cflags argument is not set, and when a match is found, the function returns matched position in pmatch argument.
If state-dependent encoding is used, this pmatch information may be useless because it sometimes will not returns state information.
For example, suppose we are using a state-dependent encoding, which has two shift state and switches initial shift state to another shift state by SO (Shift Out) code and return from another shift state to initial shift state by SI (Shift In) code.
If searched pattern is: #define SO 0x0e #define SI 0x0f
char *pattern = { SO, 'X', 'Y', 'Z', SI, ' ' };
and the string is:
char *string = { SO, 'A', 'B', 'C', 'X', 'Y', 'Z', 'U', SI, ' ' };
the regexec() function will return pmatch information which says:
pmatch[0].rm_so = 4 (start of matched string) pmatch[0].rm_eo = 7 (end of matched string) pmatch[1].rm_so = -1 pmatch[1].rm_eo = -1
But in this case, naive program will treated the matched string as
{ 'X', 'Y', 'Z' }
in INITIAL SHIFT STATE, not in ANOTHER SHIFT STATE, because returned string position information does not contains any state information.
Action:
Define wchar_t version of regcomp(), regexec() functions, which takes (wchar_t *) type string argument, not (char *) type. Because wchar_t string has no state dependent information, this problem does not happen.
It is also useful for programs which treats all character/string information in wchar_t type, instead of char type. _______________________________________________________________ RESOLUTION: We believe that this subject should be studied for inclusion in the POSIX.2b revision and the full international standard. See resolution ITSCJ.3.From WG15 RIN Reading, October 1992:
3.1 7. H Jesperson reported on the WG15 9945-2 ad hoc meetings in Utrecht as follows:-
a. State-dependent encoding was discussed and it was agreed that individual utility options should not handle the problem.
3.1.7 21. A review of Uniforum and X/Open documents on state- dependent text encodings has led the Japanese C-language group to develop a minimal set of functions for their manipulation. The whole matter of state-dependent encoding is agreed to be necessary, but the question of exactly what needs to be included is left for later consideration and further discussion.
From WG15 Reading, October 1992:
SC22/WG14 working on an amendment for C, Derek Jones is the Project Editor. It is also looking at locale specifications. Japan pointed out that concern has been voiced in RIN about "stateful" encoding. The SC22/WG14 Multibyte Support Extension will introduce this into standard. The issue should be reviewed carefully. The Japanese proposed MSE does not support stateful encoding., however is being changed to introduce 6 new functions to support this. It is possible that there could be a mis-match between POSIX and WG14 directions on stateful encoding.
N330, Japanese MB comments on POSIX.2b Draft 4, contained three references to state-dependent encoding problems:
<ITSCJ.2b.1> Sect 2.4.x (State-dependent encoding) DISCUSSION.
Discussion:
[Background] ISO CD POSIX.2/D11.2 Ballot resolution on shift (state-dependent) encoding issues raised by ITSCJ (Japan) chose the option (c) among the following candidates:
(a) State-dependent encoding is out of scope. (b) State-dependent encoding is allowed, but it is a feature of implementation defined. (c) To support state-dependent encoding is one of the issues, and it would be considered in the future draft.
[Goal of POSIX.2b] ISO DIS POSIX.2/D12 Annex H says:
(8) The support of state-dependent character encoding (*) should be addressed fully. [*: Original text of POSIX.2/D11 Annex H uses "state- dependent character sets". However, it is not an appropriate expression.]
[Current status of POSIX.2b/D4] As the first cut, it keeps space holders for (a) 2.4 Character Set section (b) 2.5 Locale section (c) 2.8 Regular Expression Notation section (d) 4-5 several utilities sections
[What are must] (1) give a definition of "state-dependent encoding" or "state-dependent encoded character set" (2) give a clear scope of POSIX(.2) on what kind of state- dependent encodings shall/should/may be supported. (3) give specification on how to define a state-dependent encoding in charmap file and/or locale (4) give specification on how to handle state-dependent encodings (by what utilities/functions)
<ITSCJ.2b.2> Sect Global (State-dependent encoding) OBJECTION.
Problem:
State-dependent encoding features are generic over almost all the string/character handling functions and utilities. For example, the following operations are very sensitive. They have to keep track of "state" transition.
- string/character search - substring/character manipulations (add/delete/modify/ insert/...)
However, the current POSIX.2b/D4 picked up several utilities for enhancement of stateful-dependent encoding support. Since the Japanese Ballot Comments on POSIX.2/D11.2 in terms of state-dependent encoding issues may not cover all the utilities that would be effected by state-dependent support, the POSIX.2b/D4 may mislead that other utilities have no problems on state-dependent encoding support.
Action:
In stead of addressing state-dependent encoding support in each potential utility section (except specific requirements for a specific utility), create a new subsection in Section 2 to describe global issues and generic requirements regarding state-dependent encoding support.
In particular, list up all the possible character/string processing operations which shall be carefully done in state-dependent encoding environments and specify desirable/requested result of such operations.
<ITSCJ.2b.3> Sect 2.4.x (state-dependent encoding) DISCUSSION.
Discussion:
[ Support of State-dependent Encoding ]
Charmap cannot describe character sets encoded by stateful encoding schemes well because, in a stateful encoding, there is no one-to- one correspondence between octet values and characters, and the same sequence of bytes represent different characters according to the state that is changed by locking shift escape sequences.
It is possible to write a charmap for such characters by placing locking shift to the both sides of character, where the second locking <locking shift><character><locking shift> shift specifies the default state. Although this virtually makes a state-dependent coding stateless, it is not the common practice as it uses a lot of extra bytes.
Single shift is an exception. This form of shift is used to change the state temporarily for interpreting a character that immediately follows it. In other words, every character in a character set invoked by a single shift has that single shift preceding it. Therefore, in charmap, it can be treated as a part of multibyte characters. Unfortunately, single shifts are by far the less used than the locking shifts.
Besides their description in charmap, the support of state-dependent character sets poses the following problems: (1) In searching or comparing statefully encoded strings, byte-par-byte comparison does not always yield valid results. It is allowed to insert locking shifts at arbitrary character boundaries even if they are redundant. (2) In dividing, truncating or making substrings of statefully encoded strings, simply returning part of them can produce strange results because they do not contain preceding and/or following locking shifts. (3) Concatenated strings may have redundant locking shifts which causes the comparison problem mentioned above.
In order to alleviate these difficulties, an implementation that supports state-dependent character sets shall: (1) process the statefully encoded strings as a concatenation of state-independent character. (2) insert (if necessary) locking shifts at the beginning and at the end of substring to retain correct state information when extracting substrings of a string. (3) eliminate redundant locking shifts whenever possible.
WG15 Plenary produced the following action items:
9210-22: Member Bodies: Review WG15/N330 and provide feedback through their RIN rapporteurs.
9210-23: Member Bodies: Bring the issues of stateful encoding within the new WG14 activities to the attention of their national experts, with special care given to issues that may conflict with 9945-2.From WG15 Heidelberg, May 1993:
The 9210-22 action was noted as CLOSED: the referenced documents [N362, N365] (US and Japanese AI reports) contain no substantive argument.
The 9210-23 action item was noted as Open and redesignated 9305-10: the assignee was changed to Japan: see [N362, N365]From WG15 Annapolis, October 1993:
9305-10 was flagged as Complete at Annapolis. N436, the Japanese MB report to WG15, included an attachment on State- Dependent Encoding Support in POSIX.2:
RATIONALE: State-dependent encoding is widely used in Japan and other countries for data communication and data processing. There are several examples:
- When using terminals with a terminal server that do not allow 8-bit non-parity transmission, Japanese characters are transmitted to/from terminal with 7-bit stateful encoding. If the host is using 8-bit non-stateful encoding, which is very common situation, code conversion is done within the terminal driver.
- For the Internet mail and news message transmission, 7-bit stateful encodings are used in Japan, Korea and Taiwan, because the underlying message transmission protocol, SMTP, does not allow 8-bit transmission (See RFC 821 and RFC 822). For detailed description of the encoding used in Japan, see RFC 1468.
- On IBM-compatible mainframes using EBCDIC-based encodings, stateful encodings are used to process multibyte characters. This is true not only in Japan, but in Taiwan, Korea and mainland China.
But in the current description of the POSIX standards does not fully address the support of state-dependent encodings, as written in the "2.4 Character Set" section of POSIX.2 (Page 61 in DIS 9945-2).
Not to prohibit implementing POSIX interfaces on the systems that use state-dependent encodings, some description for state- dependent encoding is necessary. Please note that our intention is not to mandate the support of state-dependent encodings on all POSIX-conforming systems, but just to allow state-dependent encodings as an optional feature.
THE CURRENT DISCUSSIONS IN JAPAN:
(charmap syntax extension) Currently one proposal to extend charmap syntax to allow definition of state-dependent encodings is proposed. It is very raw idea and not fully agreed one, so some feasibility study is needed to complete the proposal.
The idea is to introduce "shift state declaration" syntax in the charmap file. A shift state declaration declares the "shift sequence" (one or more bytes which indicate the change of shift states) to switch into the shift state. If a shift state declaration is appeared, the character set mapping definitions following the definition defines characters in that shift state.
The proposed syntax for shift state declaration is as follows:
"<shift_state_%d> %s %s\n", <shift_num>, <shift_seq>, <comments> where: <shift_num> Indicates shift state number (0, 1, 2...). <shift_state_0> shall be the initial shift state. <shift_seq> Indicates shift sequence. The syntax of shift sequence is the same as that of <encoding> part of character set mapping definition. <comments> Indicates comments.From WG15 RIN Orlando, October 1995:
N602, the Japanese Action Item Report, offered the following: Input from Japan to POSIX.2b:
It has been an action item assigned to Japan that Japan propose an extension of charmap syntax for supporting state-dependent encodings. When Japan raised the issue of state-dependent encoding support by ISO/IEC 9945-2, the ISO/IEC 2022, which is a typical state-dependent encoding, is the only one international standard code extension technique to include multiple scripts (multi-lingual text) in a character stream or character string. However, since the ISO/IEC 10646-1 became available in 1993 and UTF-8 is now being standardized, the user of POSIX standards got alternative way to represent multi-script/multi-lingual text without using state-dependent encodings. And that must be the way which POSIX standards will endorse.
Therefore, Japan believes that the requirements for supporting state-dependent encodings with POSIX systems are very small now. It is difficult to get support from vendors and users for any proposed extension on this topic.
Considering the above situation, Japan would propose not to pursue the extension for supporting state-dependent encodings.
[Discussions in the RIN minutes [RIN N154] indicated:]
Japan have indicated that they do not wish to pursue this Issue in very recent email. Denmark indicated that it wishes to pick up the problem and attempt to resolve it. UK has no objection. Canada proposed applying a time limit on holding the issue open - if no input is forthcoming within 6 months RIN will close the Issue; this was agreed.
9510-04 Dk to supply an input paper to Issue 6 within 6 months or the Issue will be closed.
From WG15 Copenhagen, May 1996:
| This issue is now Closed in RIN: no additional input has been | received by RIN. WG15 resolved to regard this issue as closed.
charmap, iconv, code-set, locale, characterDescription:
A coded character-set conversion technique based on the charmap mechanism, with a charmap- or locale-based fallback.Originator:
WG20, DKAlternatives:
Documents:
RIN N111 WG20 NP on Cultural Convention-Set Registry RIN N112 WG20: Subdivision for cultural convention specification standard RIN N113 CEN: Information Technology-European Multilingual Ordering N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 N284 WG15 minutes, Hamilton, May 1992 N294 P1003.2b D4 (Shell & Utilities Amd) N330 Japanese comments on Posix .2b/D4 N444 CEN cultural elements registry N462 Ca: Proposal for inclusion of CHARIDS in next amd 9945-2 N515 US Action Item Report. RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
The proposal has been accepted in principle by RIN and the development body. Charmap-based conversion appears in 1003.2b Draft 11 but is not yet fully developed.Status:
Closed.History:
From WG15 Stockholm, November 1991:
New DS Issues: 3. Want command to convert between code sets based on charmaps Keld has indicated that DS has done this. The DS solution, however, is not known to the other members of the small group.
>> KS to submit proposal. That proposal should be reviewed by RIN, with coordination with the IEEE working group, with the potential of being included in P1003.2b Ultimate solution should align, where possible, with technology of XPG4 iconv
[I could find no record of an appropriately-titled document to either RIN or WG15 in reponse to this]From WG15 RIN Stockholm, November 1991:
4.11. Interface routines for locale and charmap Keld Simonsen introduced Danish suggestions for interface routines for locales and charmaps, adding that it was related to work in progress within X/Open. Donn Terry pointed out that, when a well-finished proposal corresponding was forthcoming, it should be accompanied by a statement justifying the requirement for such a facility. Given such justification, the facility appeared to him to be suitable as a component of a revision to 9945-1.From WG15 Hamilton, May 1992:
N245 included Danish member body comments on 9945-2: 5. We miss a utility that can convert files based on charmaps or locales. The charmaps are the formal place to specify the character sets, and this information should be used also to convert files. As heterogeneous environments become more commonplace, viz. world-wide networking, and some frequent Danish letters occur in different positions in various character sets, there is much need for a specification for scripts and for user extensibility. We intend to have a proposal ready for a later issue of 9945-2, and we see a place for this in a revised "tr" utility. We would like a statement in 9945-2 that this is an area where work is to be done.
N281 contained the response to this proposal:
We have added a statement to the tr rationale. Such a statement of future intentions is limited by ISO rules to a footnote or informative annex.From WG15 Reading, October 1992:
WG15 Plenary considered the responses to the following action item from Stockholm:
9205-09 Danish Member Body to prepare and submit a specific proposal regarding conversions between code sets (based on charmaps, or otherwise; proposal should give appropriate consideration to XPG iconv). (open action item 9111-20) Status: Done - proposal is included in P1003.2b.
[I could find no appropriately-titled document to either RIN or WG15 describing the proposal]
N294, the 1003.2b (Shell & Utilities Amd) Draft 4, was available at the meeting. The draft included a new iconv utility to convert codesets.
N330, the Japanese MB comments on N294, included a number of objections to the iconv section:
<ITSCJ.2b.11> Sect 4.73.3 (iconv) OBJECTION. page 72, line 2022:
Problem:
[iconv command option]
The description of the "-f fromcode" option says that "If the option-argument is the pathname of a readable file, iconv shall attempt to use it as a charmap file, as defined in 2.4.1." This semantics may cause unexpected results depending on the current working directory, because if a file or a directory in the current directory happens to be the same name of "fromcode" (or "tocode"), iconv will treat the file as charmap file. This behavior restricts users to use file name same as codeset name. Because there are no standards for charmap file name, it will be impossible to use iconv command in a portable manner. I think there should be a mean for users to specify explicitly the "fromcode" and "tocode" arguments to be used as charmap files.
Action:
There are three proposals for the modification of iconv specification.
(1) The first proposal is to add a new option, "-c", to specify the "fromcode" and "tocode" option-arguments are charmap file names. If "-c" option is not specified, iconv will treat "fromcode" and "tocode" option-arguments as implementation- defined codeset names.
Change the description of "-f fromcode" option (lines 2021-2028) to:
-f fromcode Identify the codeset of the input file. Valid values for fromcode are specified in the system documentation. If this option is omitted, the codeset of the current locale shall be used.
and add the following option description after the line 2030:
-c Treat the fromcode and tocode option-arguments as the names of charmap files. If the option-arguments are the pathnames of readable files, iconv shall attempt to use them as charmap files, as defined in 2.4.1. If the readable file is not a valid charmap file, the results are undefined. If the option-argument is not the pathname of a readable file, the results are implementation defined.
(2) The second proposal is to add new set of options which specify charmap file names. In this proposal, "-f fromcode" option is always used to specify codeset name. To specify charmap file, you must use "-F fromcharmap" option.
Change the description of "-f fromcode" option (lines 2021-2028) to:
-f fromcode Identify the codeset of the input file. Valid values for fromcode are specified in the system documentation. If this option is omitted, the codeset of the current locale shall be used.
and add the following option description after the line 2030:
-F fromcharmap Identify the codeset of the input file. If the option- argument is the pathname of readable file, iconv shall attempt to use them as charmap file, as defined in 2.4.1. If the readable file is not a valid charmap file, the results are undefined. If the option- argument is not the pathname of a readable file, the results are implementation defined. If this option is omitted and -f fromcode option is not specified, the codeset of the current locale shall be used. If both of the -F fromcharmap and the -f fromcode options are specified, the results are undefined.
-T tocharmap Identify the codeset of the output file. The semantics are equivalent to the -F fromcharmap option.
(3) The third proposal is to add a mechanism to identify fromcode (or tocode) option-argument is charmap filename or not. In the following description, if fromcode or tocode option-argument has a <slash> character in it, it will be used as charmap file.
Change the description of "-f fromcode" option (lines 2021-2028) to:
-f fromcode Identify the codeset of the input file. If the option- argument contains <slash> character in it and the pathname of a readable file, iconv shall attempt to use it as a charmap file, as defined in 2.4.1. If the readable file is not a valid charmap file, the results are unspecified. If the option-argument does not contain <slash> character, the results are implementation defined. If this option is omitted, the codeset of the current locale shall be used.
<ITSCJ.2b.12> Sect 4.73.5.3 (iconv) OBJECTION. page 73, line 2058:
Problem:
[LC_CTYPE environment variable description of iconv command]
In the description of "-t tocode" option of iconv command, it says that "The semantics are equivalent to the -f fromcode option." and the last sentence of "-f fromcode" says "If this option is omitted, the codeset of the current locale shall be used." It means that if the "-f fromcode" option is specified and the "-t tocode" option is omitted, the codeset of the current locale is used as the output file's codeset. This behavior should also be noted in the LC_CTYPE description.
Action:
Add the following sentence after the line 2058:
If -t tocode option is omitted, this variable shall determine the codeset of the output file.From WG15 RIN Annapolis, October 1993:
Mapping locales on to the underlying character set is problematic. There is the charmap approach, but there are misgivings that this is inelegant at best and inefficient in the case of large character sets, such as used by the Japanese.
9310-07 MBs are asked to consider the impact and problems associated with the support of locales by the charmap mechanism, and to consider the need for the establishment of a charmap registry. Responses to RIN Lead Rapporteur prior to the WG15 meeting, May 1994.
9310-08 Lead Rapporteur to report to WG15 that RIN is considering the need and possible alternatives for charmaps. RIN is looking for technical input on whether charmaps provide the best solution to the problem. RIN notes that CEN is currently constructing a charmap registry, <MDR-12>, and that WG20 are also taking this approach - <MDR-10> and <MDR-11> refer.
MDR-10 -> RIN N111 MDR-11 -> RIN N112 MDR-12 -> RIN N113From WG15 Annapolis, October 1993:
Plenary considered N515, the US action item report, which responded to AI 9405-56:
9405-56 United States: Forward N444 to PASC for possible inclusion 1003.2b and report back to WG15 on actions taken; reference WG15 resolution 94-283. (Closed)
CLOSED...The US has identified two proposals for change to 9945-2 presented in N444. The first of these is the Charsymbmap proposal described in section 6.9. We beleive this proposal to be essentially the same as the Canadian CHARIDS proposal contained in N462. See the response to action item 9405-55. The second proposal is the "replace-after" proposal described in Annex A. The US believes this extension to be unnecessary as demonstrated in Annex A.4 of the same document.
Denmark had problems with the US reponses here. This was discussed in WG15 Plenary as follows:
4.9.2 Charsymbmap (US report back on [N444]) [N515] Denmark believes they have consensus on this proposal now. Canada disagree. The US response in N515 to Action item 9405-56 states that they believe the proposed extension to be unnecessary, the functionality being provided by the CHARID proposal - see above. Germany noted that if CEN adopts the charsymbmap proposal then Europe would have two incompatible standards - Posix and charsymbmap. Denmark suggested that the WG15 review of 1003.2b D10 should resolve any outstanding issues. The Canadian (CHARID) solution addresses a smaller set of problems than the Danish (charsymbmap) proposal. It may be possible to resolve any shortfall in CHARIDs by suitable proposals to enhance it from the European members.From WG15 RIN Orlando, October 1995:
KS reported from his discussions with the .2 group that this work was in process of being added to the draft. The Issue is Closed.From WG15 Copenhagen, May 1996:
| Accepted in principle. WG15 awaits the 1003.2b group, which is | working on an appropriate mechanism in iconv.
file, utility, locale, file-types, LC_CTYPEDescription:
A proposal to extend the set of file types recognised by the "file" utility by adding a command-line parameter specifying a file containing descriptions of file types.Originator:
DKAlternatives:
NoneDocuments:
N271 DK: Danish comments on 9945-2 Amd 1 N282 Disposition of comments on CD 9945-2 Amd 1Solution:
This proposal was accepted and will be added in the final standard.Status:
Accepted and closed.History:
From WG15 Stockholm, November 1991:
N271 is the first relevant document on the subject of the "file" utility:
Danish comments on 9945-2 Amd 1 Sect 5.14 OBJECTION, page 163
Problem: The specification of the FILE-utility is too small a subset of implementations normally seen.
a. It should as a minimum be possible to extend the number of file-types recognised in a reliable (or unreliable way). We need something like the /etc/magic-filetype-specification.
b. It should be possible to test, if a file is of type text according to the LC_CTYPE class printable.
Action: 1. Add a fileformat-specifications. Use /etc/magic if nothing better is available. Could be an option like [-m file] 2. Add the ability to recognise (printable) text according to the locale. This may also be done with an option like -t or with a separate utility. _______________________________________________________________ RESOLUTION: 1. This will be considered for inclusion in POSIX.2b. 2. This will be added in the final standard.From WG15 Copenhagen, May 1996:
| Accepted and closed in RIN and WG15.
file, exchange, portable, format, character, character-set, transportDescription:
A mechanism whereby the exchange format may accommodate the full set of characters in a portable way.Originator:
CaAlternatives:
Status quo.Documents:
(WG15RIN.185) pax -e comments N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N266 SC22/WG14 N197: Support for symbolic character names N281 Disposition of comments on CD 9945-2.2 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
The revision to pax in 1003.2b Draft 11 satisfies the requirement. 'pax' extended headers include support for ISO 10646Status:
ClosedHistory:
From WG15 Stockholm, November 1991:
The Danish MB's comments on CD 9945-2, quoted from N245 include: "7. We want the text for 'pax -e' (in previous drafts) to be included, as we need a better quasi-portable way of transporting such files. It may be included in Annex F.
"It could be included in the normative part of the standard at a later stage, and we would like indications in the standard that an extended exchange format is being planned."
The response, in N281, was:
RESOLUTION: The text has been added to Annex G (the previous F). Statements about future plans are already in the draft (See D11.2 page 551 lines 9965-68 and page 558 lines 10251-65).From the plenary:
New DS Issues: 1. Want pax -e in -2.2 Canada needs to have a meeting of their TAG to determine position. Will meet in December and advise Hal.
(WG15RIN.185) pax -e comments: Keld, here are the pax -e objection texts. Hal
The -e stuff is very complicated and there is a lack of standardized C language support to implement this feature. Trying to standard this at this point is a mistake. Why not place an optional record in one of the archive headers that states "this archive was created in the foobar locale" and leave it up to the recipient to handle the foobar locale. Even with -e the way it is stated, there is no guarantee that any locale but the portable one will be properly handled by recipients. ----------------------
Problem: I stated this once before -- it deserves repeating: The creeping proliferation of charmap is getting out of control.
The charmap started out to be a simple and straight forward device to allow code set independent specifications of locale definitions. It is trying to generate a life of its own. It is this type of thing that causes those whose who do not have an appreciation for internationalization to oppose any and everything having to do with internationalization and characters and character sets beyond ASCII.
I am strongly opposed to the -e option of the pax utility and the introduction of charmap where it should not be.
The introduction of the -e option and charmap to the pax utility only serves to reduce consensus on POSIX.2.
Action: Delete "[-e charmap]" from lines 9614, 9615, and 9617. Delete lines 9694-9713. Delete lines 10140-10170. ----------------------
Drop this whole mess. It's too new, I don't think that it's well thought out in the context of the full problem. The time to address this class of issue is when the new file format is addressed. When the full file format is addressed, this can be done in concert with controlling the format and having the ability to represent both very long file names and to indicate the character set in use. (The use of -e could cause distinct filenames to be truncated to the same name.)
Asking for warnings when a name might not translate is OK with me.From WG15 Hamilton, May 1992:
Keld's Proposal (N266): Danish proposal adding two functions was discussed. One function takes a code point and returns the symbolic character name. The other function takes a symbolic character name and returns the code point.
DK explained these would be used in the implementation of things like pax -e, and iconv().
Some discussions about the first record of the new pax format containing a character set name.
There still needs to be a translation between code pages, that the symbolic name routines do not help with. Keld is concerned that industry groups are leaning towards the use of symbolic character names. Additionally, there are a number of Danish proposals in the pipeline which depend on this particular proposal.
Donn Terry is still concerned with general portable applicability.
Because the timing of iconv() and pax -e are still indeterminate and these routines are being proposed solely because of these, it was felt it is too soon.
9205-40 US Member Body: Forward the Danish proposal, N266 to IEEE POSIX.1 for their review.From WG15 Reading, October 1992:
This action was noted as Complete at the start of the WG15 Reading meeting.From WG15 RIN Orlando, October 1995:
Canada indicated that it was satisfied with the text as presented in 1003.2b D11. The meeting agreed that the Issue is Closed.From WG15 Copenhagen, May 1996:
| Accepted and closed in RIN and WG15.
wide, char, character, MSE, multibyte, encodingDescription:
POSIX interfaces should normatively reference the C MSE wide character support APIs.Originator:
JAlternatives:
Documents:
RIN N105 Japanese Comments on POSIX.1a (MSE) RIN N106 Japanese Proposal to POSIX 1003.2b N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
To normatively reference the amended 9899:1995 C standard in POSIX standards is a necessary but insufficient resolution of the problem. WG15 at its Copenhagen meeting acknowledged the requirement and requested the US development body to include an acceptable solution in the next draft (12) of the 1003.2b document, following expert advice.Status:
Closed. WG15 and the IEEE development body accept the requirement.History:
From WG15 Stockholm, November 1991:
a. SRTN8, Japanese concerns re CD 9945-2
- Japan would like to make this document visible to other countries - need to assign number although Japan plans to expand the document and deliver more detailed response before the end of the year.
Japan needs to handle multiple char sets simultaneously, per ISO 2022; data files often contain various escape sequences which indicate which char set data follows; discussion of these requirements in relation to nature of LC_CTYPE:
- Hal indicated that he did not feel that LC_CTYPE would prevent interpretation of command line args consistent with Japanese needs
- item 3 on Pg2 of comments really deal with 9945-1 features? Japan has difficulty dealing with wide char data with traditional Lib C; would like to see wide char handling capabilities in .2 utilities, both for functionality and as an example of wide char handling for programmers. Japan is not sure whether it would be more appropriate to include wide char (ISO C/MSE) features in .1 or .2; .1a might be the appropriate place to include these extensions. (Although it might be feasible to include in the LIS spec, WG15 has told US body that LIS MUST be the same as the 1990 standard, thus no extensions could be included).
Hal suggested that these comments be included in the Japanese ballot, so that they would be on record officially, and the US could deal with them as work on .1 AND .2 proceed.
A Japanese "Yes" vote with this comment, creating a WG15 issue, would allow Hal to insist that extensions be included in .2b (and .1a)
[N245 includes the Japanese MB comments on 9945-2, and details the Japanese MSE proposal].From WG15 Reading, October 1992:
SC22/WG14 working on an amendment for C, Derek Jones is the Project Editor. It is also looking at locale specifications. Japan pointed out that concern has been voiced in RIN about "stateful" encoding. The SC22/WG14 Multibyte Support Extension will introduce this into standard. The issue should be reviewed carefully. The Japanese proposed MSE does not support stateful encoding, however is being changed to introduce 6 new functions to support this. It is possible that there could be a mis-match between POSIX and WG14 directions on stateful encoding.
WG15 Reading produced the following resolution:
RESOLUTION 92-223 9945 multibyte/wide character handling
Whereas the current ISO/IEC 9945-1 (POSIX.1) does not support any APIs for multibyte/wide character handling that are defined by ISO/IEC 9899 (C Language), and
Whereas the DIS 9945-2 (POSIX.2) does specify generic character handling features based upon a character definition that "a character means a sequence of one or more bytes representing a single symbol", and
Whereas an amendment to ISO/IEC 9899 is scheduled in 1993, in which Multibyte character Support Extensions (MSE) are proposed to provide a set of functions for multibyte/wide character handling, aiming at improvement of worldwide portability of C programs that need generic character handling capabilities, and
Whereas the CD 9945-2 Ballot Dispositions and the POSIX.2b Draft has indicated that certain extensions will be needed in conjunction with the proposed ISO C MSE and its derivatives in the POSIX environment, and an API part of which should be included in a future amendment to 9945-1,
Therefore, SC22/WG15 requests that the US:
1. Consider the LIS and language-binding interface changes necessary to handle character-oriented features as symbol and not storage patterns for a future revision of 9945-1.
2. Inform SC22/WG15 of any plans for supporting such features in future revisions of all parts of the 9945 Standard.From WG15 RIN Annapolis, October 1993:
RIN considered two papers submitted by Japan, touching on the MSE issue - N105, N106From WG15 Annapolis, October 1993:
22.39 Extensions to base {1a} na Japan will be proposing the inclusion of the 'C' MSE amendment in the Posix series of standards. This is still under discussion in WG14. Flags have been raised within RIN that this will happen.From WG15 RIN Twente, May 1995:
3.1.11 C MSE widechar support --Japan will make a proposal--openFrom WG15 RIN Orlando, October 1995:
This Issue was originated by Japan. The C MSE amendment is now a full international standard; it should be supported by 9945-1.
9510-05 Japan to check if the reference to IS 9899:1995 would satisfy their requirements for MSE support in 9945-1, and to report their findings back to WG15.
9510-06 KS to investigate the possibility of having the latest versions of the 9945-1 standard reference the 9899:1995 C standard, including the MSE addendum.
From WG15 Copenhagen, May 1996:
| Reference to the MSE C standard is not sufficient to resolve the | problem. | | Debate diverted to what the real problem was here, and whether | it was better solved in the locale or the charmap regimes. The | US offered to take the problem back to the IEEE development | body, and proposed closing the issue based on the understanding | that Draft 12 of 1003.2b would include a resolution of the | issue. The requirement for the functionality is accepted by | WG15. The Issue is closed.
ISO 646inv, shell, awk, 9945-2, ISO 10646Description:
A proposal to permit the characterset defined by ISO 646 inv in the shell and the small languages supported by the POSIX Shell and Utilities standards.Originator:
DKAlternatives:
a) No change b) Support ISO 10646Documents:
RIN N047 A representation for the shell in ISO 646 N323r WG15 RIN N096: Minutes & resolutions, Reading, October 1992 N416 Invariant ISO 646 support in Posix 9945-2 N640r US TAG N573, N587: AI 9510-14, Report on POSIX.2b IssuesSolution:
RIN regards the issue as closed. WG15 and the US development body also regard the proposal as being rejected.Status:
Issue in RIN is Closed: the issue is now between DK and WG15 who have invited DK to supply further documentation to support their proposal.
DK has been in contact with the development body and has submitted a proposal in their ballot comments to the CD registration of 1003.2b D11, October 1995.
WG15, following advice and debate, rejected the proposal at its Copenhagen meeting, May 1996. The Issue is closed.History:
From WG15 Stockholm, November 1991:
c. RIN SRTN7/N047, A representation for the shell in ISO 646
Proposal from Denmark relates to a long identified problem and an inconsistency with the recommendations of ISO TR10176 (programming languages should not use certain characters; note that TR10176 states that it may not be globally applicable, and seeks further input; 9945-2 may be a case in point), but the Danish proposal should be expanded and clarified so that it:
1) addresses all aspects of proposed standard, rather than JUST the shell, (e.g. it should work with not only shell, but also regular expressions, awk, etc)
2) should allow use of all features of the proposed standard, maintaining conformance, (e.g. currently proposed use of "--" would conflict with existing use)
3) should provide a general solution for similar requirements of other countries
4) should be sensitive to the cost/benefit ratio of imposing the solution in relation to existing implementations.
Issue that proposal addresses is the ability of using national characters within file names etc, without impact on shell interpretation (e.g. Danish "slashed-O" occupies the same space as the POSIX pipe symbol, thus file names cannot include a slashed-O without the shell interpreting that character as a pipe).
Presentation of national characters on displays and printers is a separate issue.From WG15 RIN Reading, October 1992:
3.1.15 28. The Danish draft on invariant ISO 646 is seen as a rehash of the original trigraph proposals to digraphs. This should be approved by WG14 [!] before this issue may be re-opened in this group. Closed pending such approval.From WG15 Heidelberg, May 1993:
9305-04 Denmark: Expand and clarify proposal contained in RIN N047 regarding usage of national characters (as defined in ISO 646 national positions), giving consideration that such proposal:
1) addresses all aspects of proposed standard, rather than JUST the shell, (e.g. it should work with not only shell, but also regular expressions, awk, etc.)
2) should allow use of all features of the proposed standard maintaining conformance, (e.g. currently proposed use of " " would conflict with existing use)
3) should provide a general solution for similar requirements of other countries
4) should be sensitive to the cost/benefit ration of imposing the solution in relation to existing implementations (open action item 9111-25, 9205-11, 9210-4)From WG15 Annapolis, October 1993:
The above action was noted as closed.From WG15 RIN Annapolis, October 1993:
RIN AI 9305-05 Invariant ISO 646: Input required from Denmark.
This action was noted as (Open) going into the RIN meeting - but was not present in the list of actions at the end of the meeting, possibly due to the appearance in WG15 of:
N416 Invariant ISO 646 support in Posix 9945-2
22.41 additional utilities {2b} CD reg: [N416, N420] Proposed action on the US to take these on board. Nl accepts N420 proposal, but regards the N416 document as representing old technology superceded by ISO 10646.
The original action was on DK to provide these papers as additional information to the US. Done deal. N416 and N420 will be passed to the US for comment.From WG15 Tokyo, May 1994:
9405-52 United States: Review N416 and N420 and forward them to PASC for consideration.From WG15 Vancouver, October 1994:
This action was flagged as (Closed) in the review of action items going into the WG15 Vancouver meeting; debate on the item was summarised as:
5.2.3 22.41 additional utilities {2b} CD reg: [N416,N420] Denmark is not happy with the response (not going to include extended characterset support because it would reduce consensus) to its request and would like to enter into a dialogue with the IEEE group responsible. Denmark is invited to offer further supportive argument.From 9945-2:1993 Annex H.1:
(2) The shell, awk, other small languages, and regular expressions should be supported by national variants of ISO/IEC 646 {1}. A proposal from Denmark is expected in this area.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 Copenhagen, May 1996:
| N640r responds to this at length. | | The IEEE development body does not believe this proposal is | useful - its incorporation would reduce concensus. Adding this | to RegExp support would comprehensively break it. The extension | in its effect on meta-characters in the small languages would | introduce grammar inconsistencies which would be difficult to | gain approval for. | | WG15 regards the issue as closed. Technical experts view the | problem as insoluble in the POSIX small languages. WG15 invites | technical contributions which would indicate the problem is | soluble, or has been solved.
CHARIDS, charmap, locale, localedef, UCS, code-point, code-setDescription:
A mechanism to enable the automated production of a charmap file through the addition of a reference to a code-point in ISO 10646 for each symbol in the CHARID file.Originator:
Ca, DKAlternatives:
charmapDocuments:
RIN N127 Procedures for European Registration of Cultural Elements, CEN draft 5 N316 Canadian contribution to SC22/WG20 - Short character names N462 Ca: Proposal for inclusion of CHARIDS in next amd 9945-2 N515 US Action Item Report N554 Ca Action Item Report N555 US Action Item Report N558 RIN N150: DK Action Item Report N566 CEN/TC 304 N437: Procedures for the registration of cultural elements: Draft 9 N605 RIN N160: DS Additional comments on P1003.2b/D11 (SC22WG15.498) Comments on WG15 Action Item 9410-24 (Canadien questions) RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
The Issue is Closed. Canadian and Danish inputs have been accepted into 1003.2b Draft 11 or later.Status:
Closed. The proposal to extend the charmap file to accomodate references to code points has been accepted. The US development body is developing text in 1003.2b draft 12 to address the requirement.History:
From WG15 Tokyo, May 1994:
Plenary considered N462, the Canadian MB contribution on CHARIDS:
Introduction: As defined in the current text of iso/iec 9945-2 a locale definition file that uses mnemonic character naming cannot stand alone, but must be associated with a Charmap file that maps the mnemonic names to code points. This mapping is necessarily dependent on the character set in use.
Therefore any locale definition requires: - the locale definition file; - at least one CHARMAP file; - for each CHARMAP file, a statement of what character set it corresponds to;
Further there is no standardized machine-readable way of specifying the second and third items. As a result it is not possible to write a locale definition that is independent of implementation. ...
Proposal: We are in the process of defining a Canadian Locale and we need to make this definition both unambiguous and implementation independent. We propose a "CHARIDS" file to address this deficiency. We feel that this is an international requirement and should be included as a normative amendment to ISO/IEC 9945-2.
The "CHARIDS" file would be very similar to CHARMAPS. The only differences are that the file/header name is CHARIDS and that the character value operand is a reference to a code point in ISO 10646. This permits anyimplementation, given a way of mapping ISO 10646 to the desired character set, to produce a corresponding CHARMAP file, without human intervention. Note that the existence of a CHARIDS mechanism does not preclude the use of CHARMAP files as currently specified. Document ISO/IEC JTC1 SC22/WG15 N316 outlines an approach based on ISO 10646 that we feel staisfies the CHARID requirement.
The header and trailer would be as follows: CHARIDS END CHARIDS
Between these two statements the symbol definitions would look like <symbol> <Uxxxx> "optional comment"
where: <symbol> is a symbol representing a character and used in the LOCALE definition: <Uxxxx> would be U (standing for UCS) followed by the hexadecimal coding value attributed to that character in iso/iec 10646 (4 hexadecimal digits); mapping of UCS coding to the actual code used by an environment would be implemented by this particular environment's designers/implementors/providers, based on this standard reference.
It should be noted that X/Open already uses this approach although it is not standardized. Canada plans to use this syntax in its LOCALE definition. ...
The discussion took place at agenda point 6.6:
6.6) CHARID (Canada) Reference N462 This is a better way of doing charmaps based on Canada's experience in this area. This document has been presented to WG20 who has accepted it. The CEN registry and X/Open is aligned with this proposal. Canada would like to give this to PASC for inclusion in 1003.2b. Resolution forwarded to the drafting committee to forward this to PASC. Action item 9405-55 on the United States to forward N462 to PASC for inclusion 1003.2b and report back to WG15 on actions taken.From WG15 RIN Vancouver, October 1994:
3.1.13 Charsymb/CHARIDs (N119, N127)
There was discussion over conflicting proposals (conflicting to a minor extent) presented by Mr. Kriger and Mr. Simonsen. Mr. Kriger noted he believes the US-proposed changes will not be upwardly compatible. Mr. Simonsen explained why they would. Mr. Hill noted the US noted its response to SC22/WG15 action item 9405-55 is relevant. Mr. Hill noted the US expects substantive discussion of this item to take place in SC22/WG15.From WG15 Vancouver, October 1994:
Action item 9405-55 was noted as complete. The US AI report, N515 refers:
CLOSED...The US believes the proposal is not complete since it does not provide any way way to transform CHARIDS files into charmap files. Therefore there still isn't a way to create portable locale definitions. A couple of straightforward extensions to the localedef utility and the charmap files in 9945-2 will provide a portable way to define locales. We believe this is the intent of the Canadian proposal.
The following list summarises changes the US proposes as an alternative solution to this problem:
1. Expand the legal values for the RHS of the charmap file to include UCS2 and UCS4 values. These values would be of the form <Uxxxx> and <Uxxxxxxxx>, respectively.
2. Add a -u <code-set-name> option to localedef to indicate the target code-set to be used by the compiled locale. If the -u option is given then all the values of the forms <Uxxxx> and <Uxxxxxxxx> will be translated from those UCS2 and UCS4 values to corresponding code-points in the code-set specified by the -u option.
3. That implementations have localedef predefined mappings for the standard symbolic names for characters in the character set defined by 9945-2 Section 2.4.
The US believes that these changes would allow application writers to build portable charmap and locale source definition files that could be used on any implementation providing the 9945-2 option that includes the localedef utility as long as the implementation recognised the target code-set for the compiled locale.
The US intends to flesh out this proposal for inclusion in the next distributed draft for IEEE ballot of P1003.2b. The proposal was not received by the US in time for distribution to SC22/WG15 in Draft 10. If you have any comments, the US would appreciate receiving them in time for discussion at our January IEEE PASC meetings.
The WG15 Plenary discussion on this was as follows:
4.9.1 Charid (US report back on [N462]) [N515] Canada raised a query on why the US response to 9405-55 in N515 offered the changes it did, and what the rationale for them was. The US could offer no immediate explanation, and offered to get a more detailed response, to be distributed by email. Canada to consider whether the changes have the effect required. The US had brought a number of copies of Draft 10 of 1003.2b, currently being distributed through the SC22 secretariat, which they invited comments on from WG15 MBs, preferably direct to the IEEE group.
WG15 AI 9410-24 was created to require the US to provide Canada with the rationale.From WG15 Twente, May 1995:
N555, the US report, included the following:
9410-24 United States: Distribute to the WG15 Email list the details on its proposal on CHARIDS, (see action item 9405-55) and US Response (SC22/WG15 N515)
Response: CLOSED The resulting changes to P1003.2b will appear in Draft 11 of that document. Draft 11 was being prepared at the 4/95 PASC Meeting and is already approved for distribution as CD/PDAM Registration and Ballot. This was mailed to cpwg-mail@revcan.ca and SC22WG15 mailing list on 4/27/95. [As (SC22WG15.498)]:
IEEE P1003.2 N269 April 26, 1995 SC22/WG15 US TAG N520
Topic: Response to SC22/WG15 Action Item A9410-24 From: Donald W. Cragun
The questions submitted by Canada with our responses are below: 1) a) Could the US present the precise format of the proposed new charmap file?
Draft 10.9 will be available from the US delegation at the Enschede meeting. Draft 11 will be distributed for concurrent registration and ballot soon.
b) Specifically, could the US explain the relationship of the new proposed field to the portion of each line that is now considered "comment" or explanatory material?
The proposal does not include a new field. If just allows two additional forms for specifying the <encoding> part of the the existing forms.
The <comments> portion of the lines between CHARMAP and END CHARMAP are not changed.
c) Has the <comment_char> been used to delimit RHS comments (i.e. those comments that do not start at the beginning of the line)?
Empty lines and lines starting with the <comment_char> are comments. The <comments> field can contain any characters (within the context of a line in a text file). Comments are separated from the <encoding> by one or more <blank> characters. A <comment_char> could be used after the required <blank> as a convention to make the charmap files easier to read by humans, but are not required by the current standard or the proposed changes.
2) a) Could the US explain the need for the addition of a new parameter to the localedef utility?
The new option (-u code_set_name), specifies the name of a code set to be used as the target mapping of character symbols and collating element symbols whose encodings are defined in terms of ISO 10646 position constant values.
b) Would not a similar effect be achieved by manipulating the charmap with the standard text utilities and then using the existing localedef utility?
None of the other standard utilities specified in 9945-2 (even the iconv utility in P1003.2b) is designed to translate from ISO 10646 16- or 32-bit values encoded as strings of the form <Uxxxx> or <Uxxxxxxxx> to octal, decimal, or hexadecimal encodings of the forms expected in charmap files by localedef. Scripts could be created using awk or sed to perform these translations manually, but the P1003.2 working group believes that implementations should be able to translate from 10646 to codesets supported by the implementation without manual assistance.
3) a) Could the US explain what is meant by "... have localedef predefine mappings for the standard symbolic names for characters in the character set defined by 9945-2 Section 2.4"? Canada is aware that 9945-2 specifies standard symbolic names for the characters referenced in Section 2.4. Canada's question relates to the "... localedef predefine mappings ...".
Since the 10646 encodings for all of the characters in Table 2-4 in section 2.4 of 9945-2 are always the same, they need not be specified in charmap files that are encoded using the new formats; localedef will be required to supply the encoding information using the <symbolic-name> values specified in Table 2-4 implicitly.
N558, the Danish report, responded to 9410-35 as follows:
9410-35 Member bodies: Look at the technical aspects of SC22/WG15 N444 and the applicable portion of SC22/WG15 N515, [the US AI report] in time for the May 1995 SC22/WG15 meeting. DS: ... 1. specify a repertoire format, as earlier decided in WG15 and WG20 3. specify repertoiremap files for locale and charmap with localedef, this is a further enhancement of the US recommendation 2 of N515 9405-55 response, 4. There is no need for the proposal 1. in the US contribution, if 1. and 3. above is specified. This is also in line with current X/Open work. The <Uxxxxx> information can still be provided, as a form of comments.From WG15 RIN Orlando, October 1995:
Canada expects that 1003.2b Draft 11 will resolve the Issue. Denmark expects that their concerns will be addressed in Draft 12, following their discussions with the IEEE group.From WG15 Copenhagen, May 1996:
| Closed. WG15 accepts the proposal; the US development group is | working on it in .2b draft 12.
regular, expression, small language, NUL, special characterDescription:
Internationalisation of regular expressions.Originator:
DKAlternatives:
Documents:
N170r WG15 RIN N036: Minutes & resolutions, Rotterdam, May 1991 N245 Summary of voting & comments on 2nd CD 9945-2: Shell & Utilities N281 Disposition of comments on CD 9945-2.2 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
None.Status:
Closed. Insufficient expertise currently exists to solve the problem within the IEEE, RIN, and possibly the known universe.History:
From WG15 RIN Rotterdam, May 1991:
3.2.1.3. Regular expressions
There was a serious error in the definition of longest leftmost match for regular expressions in the last draft of 1003.2. This will be fixed.
The issue of when '$' and '^' are special in regular expressions is contentious. Some want 'ab$cd' to be allowed ('$' not special); others want it to be illegal (as it is in extended regular expressions). Traditionalists counter by saying that this would break too many existing scripts, and will probably win the day. RIN is happy with this situation.
The result of the application of a regexp to a sequence of characters containing an embedded null is currently permitted; there has been an objection to this, as current practice in the C language and utilities written therein is that null is special. This suggests that the issue is language-dependent: RIN is in favour of putting language in the LIS which does not require that null is special, but allowing bindings to make it (or perhaps some other character) special if they wish.
tr no longer knows about multi-character collating sequences, or, indeed, anything much relating to regular expressions.From WG15 Hamilton, May 1992:
N245 included a number of Danish MB comments on the 2nd CD of 9945-2, including:
13. We are still not satisfied with the current regular expression syntax, but we have no better solution at present.
N281, the Disposition of Comments, responded:
No action proposed.From 9945-2:1993 Annex H.1:
(2) The shell, awk, other small languages, and regular expressions should be supported by national variants of ISO/IEC 646 {1}. A proposal from Denmark is expected in this area.
This text has been removed from P1003.2b Draft 11, May 1995.From WG15 RIN Orlando, October 1995:
KS's discussions at IEEE last week indicate that a new PAR will be forthcoming to address internationalisation issues in regular expressions. It is accepted by the development body that there are problems with the existing specification. BN said that the US TAG looked at this in some detail. It is the area which receives most interpretation requests. The .2 group does not currently have sufficient expertise to handle the existing problems, together with known internationalisation problems. It is not anticipated that these problems can be solved in the current work on .2b
9510-07 Lead Rapporteur to investigate the availability of expertise to apply to the problem of regular expressions within 1003.2x and to report back to WG15 at its October 1996 meeting.From WG15 Copenhagen, May 1996:
| Closed. WG15 believes this request cannot be accommodated.
collation weight, LC_COLLATE, locale, natural languageDescription:
The minimum number of weights for the LC_COLLATE feature is too small for the requirements of certain National Bodies. IS 9945-2 specifies 2, Canada requires at least 7, other NBs require 4 or more.Originator:
CaAlternatives:
Documents:
N388 Minutes of WG15 meeting, Heidelberg, May 1993 N577 WG15 Minutes, Enschede, 8-10 May 1995 RIN N154 RIN Minutes, Orlando, 26/27 OctoberSolution:
The Canadian requirement for extra collation weights was accepted by the development body. Draft 12 of 1003.2b is anticipated to include the appropriate changes.Status:
Closed.History:
From WG15 Heidelberg, May 1993:
RESOLUTION 93-230 Collation Weights
Whereas ISO/IEC DIS 9945-2, Utility Limit Minimum Value, Table 2-17, specifies that the maximum number of weights that can be assigned to an entry of the LC_COLLATE order keyword in the locale definition file is 2, and
Whereas the value of 2 is insufficient to process natural language collation sequences,
Therefore SC22/WG15 instructs the Project Editor to notify its development body that the collation weight is dependent on the language of the country and that Canada requires a minimum weight of 7.From WG15 Twente, May 1995:
9410-03 Project Editor: Notify the development body of collation weight requirements (resolution 93-230, open action item 9305-60, 9310-23, 9405-12) (Closed: has become 9505-02)
9505-02 Canada - Provide collation weight question to the US again.From WG15 RIN Orlando, October 1995:
KS solved this one at the IEEE meeting!! The IEEE accepted the Canadian proposal for 7 collation weights.From WG15 Copenhagen, May 1996:
| Closed. 1003.2b draft 12 will support 7 levels.
locale, char, character, character map, LC_CTYPE, wctrans(), towctrans(), charconv, charclassDescription:
Japan proposes that LC_CTYPE locale definition should be extended to allow locale-specific character mappings to be specified. This extension is necessary to implement wctrans() and towctrans() functions in ISO C amendment on a POSIX conforming system.Originator:
JAlternatives:
Documents:
N602 RIN N158: Japanese Action Item report to WG15 N657 Data specification format for transliteration and transcription N664 Proposal for culturally dependent fallback: ResponseSolution:
Status:
Open.History:
From WG15 RIN Orlando, October 1995:
N602 proposed the following extension to 1003.2b:
[Note: The page numbers refer to the ones of P1003.2/D10.]
Sect 2.5 (Locale) PROPOSAL. Page 8-9,12:
Problem: The LC_CTYPE (2.5.2.1) locale definition should be enhanced to allow user-specified additional character mapping, similar in the concept to the user-specified additional character class. In the Amendment of ISO C standard, extended character mapping functions (wctrans/towctrans) are specified. The following proposed extension will serve for the machinery to define locale specific character mappings used by the functions. Without having this extension, POSIX conforming systems need to have their own extensions to implement ISO C Amendment specifications.
Proposal:[LC_CTYPE extension for specifying character mapping]
The proposed extension for character mapping is similar to the extension of character class, which is already specified in .2b draft. New keyword 'charconv' is introduced to define locale- specific character mappings instead of 'charclass' keyword for character class. The way of defining character mapping is not extended with this proposal. The same specification for toupper/ tolower mapping can be used for locale-specific character mappings.
EXAMPLE:
LC_CTYPE
# define the names of locale-specific character mappings charconv tojkata;tojhira
# tojkata: hiragana => katakana mapping tojkata (<j0401>,<j0501>);(<j0402>,<j0502>);\ .....definition.....
# tojhira: katakana => hiragana mapping tojhira (<j0501>,<j0401>);(<j0502>,<j0402>);\ .....definition.....
END LC_CTYPE
[Proposed extension to .2b text]
[Page 8] => 2.5.2.1 LC_CTYPE. Add the following keyword items after the item labeled tolower:
charconv Define one or more locale-specific character mapping names as strings separated by semicolons. Each named character mapping can then be defined subsequently in the LC_CTYPE definition. A character mapping name shall consist of at least one and at most fourteen bytes of alphanumeric characters from the portable filename character set. The first character of a character mapping name cannot be a digit. The name cannot match any of the LC_CTYPE keywords defined in this standard.
charconv-name Define the named locale-specific character mapping. In the POSIX Locale, the locale-specific named character mapping need not exist.
If a mapping name is defined by a charconv keyword, but no character mappings are subsequently assigned to it, this is not an error; it shall represent a mapping without any character pairs belonging to it.
[Page 12] => 2.5.3.1 Locale Lexical Conventions. Add the following token description:
CHARCONV A string of alphanumeric characters from the portable character set, the first of which shall not be a digit, consisting of at least one and at most fourteen bytes, and optionally surrounded by double-quotes.
[Page 12] => 2.5.3.2 Locale Grammar. Modify the ctype_keyword and charconv_keyword descriptions as follows:
ctype_keyword : charclass_keyword charclass_list EOL | charwidth_keyword charclass_list EOL | defwidth_keyword defwidth_value EOL | charconv_keyword charconv_list EOL | 'charclass' charclass_namelist EOL | 'charconv' charconv_namelist EOL ;
charconv_namelist : charconv_namelist ';' CHARCONV | CHARCONV ;
charconv_keyword : 'toupper' | 'tolower' | CHARCONV ;From WG15 Copenhagen, May 1996:
| N657 and N664 refer. N657 is an expert contribution from | Denmark, N664 is not an official US response - it comes direct | from the .2b group. | | The US development body asked for clarification of the Japanese | proposal: does it require just character-to-character translation, | or character-to-string, which is a much larger problem. | | WG15 actioned KS to provide details of existing implementations | of the proposal in N657 by 15-June. | | WG15 further actioned KS to respond to the queries raised in N664 | by 1-July for consideration by the IEEE 1003.2b DB.
wchar_t, character, byte, internationalisation, localisation,Description:
Japan expressed a concern that POSIX standards blurred the terms byte and character.Originator:
J, DKAlternatives:
None.Documents:
N372 I18N Guidelines N388 Minutes of WG15 meeting, Heidelberg, May 1993 N434 WG15 minutes and resolutions, October 1993 N441 Character concepts in Posix standards N482 US TAG N472: US Action Item Report N499 WG15 minutes and resolutions, May 1994 N515 US Action Item Report. N532 WG15 minutes and resolutions, Oct 1994
RIN was actioned to produce guidelines to assist the Development Body and Project Editor to write interface definitions which clearly differentiated between the two, allowing better support of international character sets.
Guidelines were offered in N372, and further comments in N441.Status:
Closed. The standards take sufficient care to distinguish 'character' and 'byte'.History:
From WG15 Heidelberg, May 1993:
The Plenary minutes, N388, contained the following Action Item and response:
Action 9210-32: RIN Lead Rapporteur: Investigate the production of guidelines for standards developers for the usage of the terms character and byte in the definition of interfaces, with especial attention to the internationalisation issues arising from character-based interfaces.
CLOSED: see [N372]
No specific action was assigned to N372 in WG15's minutes. N372, authored by Yasushi Nakahara, included the following:
For your good understanding of this action item, some background information may be required. If I remember correctly, this action was derived from my comments at the plenary session. So, I'm adding some explanations. See an excerpt from the Reading minutes and my comments below.
> 2.8 Rapporteur Group report/status > 2.8.1 Security > > ... > > Japan further identified problems in the usage of the terms > "character" and "byte" in the P1003.6 document. RIN should be > requested to provide guidance to standards developers in order > to avoid such problems in the future. The specification of > character-oriented interfaces require careful consideration of > internationalisation issues that do not affect interfaces > specified in terms of bytes.
The last paragraph was an actual (partial) log of such discussion, although at that time in conjunction with Jon's comment on I18N issues I added that not only P1003.6, but also almost all the P1003.x documents may have I18N issues wherever "character" interfaces are being specified. More specifically, I explained that the recent P1003.4 and P1003.7(.x) drafts have the similar I18N issues to what Japanese POSIX WG has been actively commenting on POSIX.1 and POSIX.2 specifications since 1989 in terms of I18N/L10N features and "character vs. byte" issues, and that Japan has to repeatedly send the similar comments again and again on each POSIX.n draft, which may be neither effective nor productive. So, I suggested, rather than such patch works, that concerned National Bodies and/or RIN should develop certain designing/reviewing guidelines (or appropriate template) for I18N/L10N specifications, in order to make each ballot/disposition process of POSIX.n draft more productive and consistent (in terms of I18N/L10N specifications).
Actually, the Japanese ballot comments on CD 9945-2 pointed out such cross functional aspects of I18N/L10N issues and introduced some proposed designing/reviewing guidelines for I18N/L10N specifications.
With these things in mind, I'm enclosing draft proposed reviewing/designing guidelines for I18N/L10N specifications. _______________________________________________________________
Draft Proposed I18N/L10N Guidelines for (POSIX) Standard Interface Design and Review
1. Take into account of the following aspects:
- Character counts != byte counts - Character counts != display width - Byte counts != display width - Only the "wchar_t" type in C language (known as a "wide character") corresponds to the concept of a character.
2. Do not use a term "character" neither in the meaning of "byte" nor in the meaning of "display width" or "column position".
3. Determine which interfaces are character-oriented (arguments or operands, input data, output data, I/O format and etc.)
If the interface in question is byte-oriented, carefully use a term "byte" or an appropriate wording so that interpretation of the specification should not be mixed up with the concept (definition) of a character. And, skip the following guidelines (which are fully character-oriented).
4. Carefully study the features of character-oriented interfaces and give appropriate specifications (or review the proposed specifications in reviewing process) in terms of the following aspects:
- Character boundary recognition [This shall be generic "character" based.]
- Limit check & truncation in various units, in particular, make clear what units (byte, character, column, width, and etc.) shall be applied.
- Character/string width recognition [This shall be generic "character" based.]
- Character/string parsing & manipulation [This shall be generic "character" based.]
Also, locale dependency such as LC_CTYPE and LC_COLLATE shall be well defined.
- Language dependency of text data including message data [Make clear what natural language dependencies are (explicitly/implicitly) included in the target text.]
- Culture dependency of representations [Make clear what (other) locale dependencies are covered by the specification via suitable LC_XXX (such as LC_TIME, LC_NUMERIC, LC_MONETARY, LC_MESSAGE ) and LANG variables.]
No specific action was assigned for N372 in WG15's minutes.From WG15 Tokyo, May 1994:
N441, submitted by Keld Simonsen, contained the following:
Action: for US NB consideration on WG15 action item 9305-24
A comment on Nakahara-sans paper, his statement in the draft guideline, clause 1, that only "wchar_t" type in C conrresponds to the concept of a character.
I would say that it is the multibyte character type of POSIX which corresponds to the concept of a character.
The "wchar_t" type of C gives restrictions on the represen- tation of characters, as they all must be represented by the same number of bits, and there is restrictions on the values which must be harmonized with the "char" type, this is not the case with the POSIX multibyte characters. C multibyte charac- ters cannot have a null byte in them, but allowing null bytes is needed for a general representation of a character. I believe that the POSIX multibyte character concept does not have this limitation. If not, the limitation should be removed.
As POSIX standards currently use the multibyte character as it "character" concept, there is no need to change this. But there is a great need to use the character terms consistently across POSIX standards.
N499, the WG15 Plenary minutes, perpetuated an action on the US as follows:
9310-10 United States: 1) Consider the LIS and language-binding interface changes necessary to handle character-oriented features as a symbol and not storage patterns for a future revision of 9945-1.
2) Inform SC22/WG15 of any plans for supporting such features in future revisions of all parts of the 9945 standard (resolution 226, open action items 9210-71, 9305-24)
Open. Status in N482. Pending response from PASC. New action item 9405-03
The Danish comments in N441 were dealt with under Agenda Item 6.4, as:
6.4) Character concept. Reference: WG15 N441. No actionFrom WG15 Vancouver, October 1994:
N532, the WG15 Plenary minutes, noted Action 9405-03 on the US as 'Closed', with no comment. N515, the US Action Item Report, contained the following:
9405-03 United States: ... Status: CLOSED...Character interfaces defined in ISO/IEC 9945-1 use containers for representations of character strings, with size in bytes, since this is existing behaviour. These interfaces support multi-byte character encodings (with some restrictions), as defined in the C standard.
Support for abstract characters is being considered in the 9945-1 LIS. There are no plans to add support for abstract characters in the C binding for 9945-1 or in the C-Language Bindings Option (Annex B) in 9945-2.From WG15 Copenhagen, May 1996:
| The US DB is not aware of any blurring of the term byte and | character in the current standards. | | Japan and Denmark believe the current drafts are clean. The US | DB has done the right thing and adopted wherever possible the | correct usage. The Issue is closed.
| collation, element, regular, expression, pattern, LC_COLLATE, localedefDescription:
| The user-defined ordering of collation elements in an | LC_COLLATE table is inadequately specified. Different | but equally valid tables can produce differing results | when used as the basis of regular expressions, pattern | matching, etcOriginator:
DKAlternatives:
None.Documents:
N605 RIN N160: DS Additional comments on P1003.2b/D11Solution:
Status:
Open.History:
From WG15 RIN Orlando, October 1995:
@ 2.8 o 5
line 379: The range expression should not be dependent on the collation element order, but rather the result of the comparison using the relevant collation. Using the collating element order is not proper, and confusing to users that only have expectations as defined by the collation rules.From WG15 Copenhagen, May 1996:
| 1003.2 is ambiguous on this point and 1003.2b will not be able | to fix the problem. There are two fairly simple solutions, | but they are mutually exclusive, and the proponents of each | solution do not readily admit to the possibility that the | alternative solution may be valid.
| This issue remains open.
Additional historical notes:
| This request was forwarded to IEEE from X/Open end 1993 for | interpretation.
| (Section 2.5.2.2, LC_COLLATE, | "User-defined ordering of collating elements. Each collating | element shall be assigned a collation value defining its order | in the character (or basic) collation sequence. This ordering | is used by regular expressions and pattern matching and, unless | collation weights are explicitly specified, also as the collation | weight to be used in sorting."
| Given this passage, assume there are two similar LC_COLLATE | fragments. The fragments include lowercase letters only to | simplify the examples. Here is the first fragment:
| <a <a>;<a>;<a> | <a-grave<a>;<a-grave>;<a-grave> | <a-acute<a>;<a-acute>;<a-acute> | <b <b>;<b>;<b> | <c <c>;<c>;<c> | <d <d>;<d>;<d> | . . . | <z <z>;<z>;<z> | . . . | Here is the second fragment: | <a <a>;<a>;<a> | <b <b>;<b>;<b> | <c <c>;<c>;<c> | <d <d>;<d>;<d> | . . . | <z <z>;<z>;<z> | <a-grave<a>;<a-grave>;<a-grave> | <a-acute<a>;<a-acute>;<a-acute> | . . . | Suppose a user wanted to find all words that begin with a letter | in the range a-c. An XoJIG meeting agreed that a locale | built using the first fragment returns words that begin with <a>, | <a-grave>, <a-acute>, <b>, and <c>. However, there were varying | opinions about whether the second fragment would return the same | results, or would exclude <a-grave> and <a-acute>. So the | question is this:
| Should an RE run against a locale built using the second fragment | include the accented a's in the range because they are defined as | being in the same equivalence class as <a>, or should it exclude | the accented a's because they are listed outside the range of a-c?
| A preliminary response was obtained from IEEE in Feb 1994:
| The standard is unclear on this issue, and as such no conformance | distinction can be made between alternative implementations based | on this. This is being referred to the Sponsors of the standard | for clarifying wording in the next amendment.
| This response will be incorporated in an IEEE interpretations | publication, and will be also made available on-line on the IEEE | SPAsystem.
| IEEE Interpretation for 1003.2-1992 | ----------------------------------- | The standard is ambiguous in this area, since it is not clear | what the phrase "collation sequence order" means or is. The two | possibilities are "the order in locale file", or "the order | determined by the weights in the locale file". The standard | allows either behavior. Concern over the wording of this area | has been forwarded to the Sponsors of the standard.
| Rationale for Interpretation: | ----------------------------- | None. | ________________________________________________________________ | (c) 1994 The Institute of Electrical and Electronic Engineers, Inc. | Not to be published without prior written permission of the IEEE.
| Andrew Josey | PASC Vice-Chair Interpretations
| ------
| DS finds it unnecessarily complex to introduce two levels for | comparisons, one that is related to the comparison functions, | and then one that is related to the order the weights appear in | a localedef definition file. The latter is normally not part of | the definition of the collation order, but becomes significant | if this interpretation is favoured. The first interpretation | should be favoured, as the algoritm is already known by the user, | and gives the less unexpected result.
Description:
Originator:
Alternatives:
Documents:
Solution:
Status:
Open/Closed.History: