JTC1/SC22/WG15 N640 WG15 TAG N587 Subject: AI 9510-14, Report on POSIX.2b Issues Members of the US TAG and the IEEE POSIX.2 working group met during the January and April IEEE POSIX working group meetings to review the following documents and ensure there was an appropriate response to all issues raised in the WG15 community with respect to the current POSIX.2b work: WG15 N583 N586 N598 N602 N604 N605 N606. Below, each document is identified (for reference's sake) and the response captured. N583 - Use of IS 10646 in POSIX Interfaces A long discussion of the technical details of IS 10646 with respect to POSIX, prepared by RIN, that essentially claims in its closing recommendations that POSIX.2 has done the right thing. A specific recommendation with respect to the pax utility was incorporated into D11 of POSIX.2b. N586 - Overview of DK Input to .2b A summary document that pointed to several other documents (WG15 N283, N420, N566, N416, N441, N558, N561) N283, N420 Deal with issues surrounding the use of characters beyond the portable character set in identifiers in the little languages. Don Cragun (Chair POSIX.2) has prepared a detailed report commenting on the commercial and technical infeasibility of this request. Please see WG15 TAG N573 (WG15 N_____). N566 Discusses the "replace-after" proposal. This can be accomplished by an awk script, which was prepared by POSIX.2 working group members, is published, and will further be added to the rationale for POSIX.2b. Adding this to the specification would reduce ballot consensus. It is not being added. An issue is raised over the collation substitution facility, and this is being worked on through ballot resolution. N416 Discusses support for ISO 646 Invariant characters. (i) With respect to regular expression support, this was personally discussed with the Head of Delegation for Denmark at the October 1994 PASC meeting, and demonstrated why this destroys an already ambiguous RE grammar. It was agreed by all that this should be put off to a future amendment of POSIX.2 where a group of experts in internationalized regular expressions could address this properly. (ii) With respect to meta-characters in other utilities, the proposals handle things differently for different utilities, and introduce grammar inconsistencies that can confuse readers of the document, and certainly will reduce ballot concensus. (iii) As this proposal is designed to address what appears to be a singular historical problem, and all other issues where character set does impact POSIX.2 have been addressed in the forward-looking ISO 10646 direction, and the proposal is not grounded in historical practice, the proposal is not being adopted into POSIX.2b. N441 Discusses character concepts in POSIX standards. POSIX.2 is complete agreement with the discussion and believe that POSIX.2b is consistent with the discussion. N558 Discusses repertoiremap issues and localdef collating weight comments. POSIX.2 believes that POSIX.2b Draft 12 addresses all the concerns raised by Denmark according to Danish proposals. N561 Proposes new utilities for inclusion in POSIX.2b. See responses in N614, and WG15TAG N571 (WG15 N______). N598 UK Action Item Report (October 1995) The UK identifies where the place-holders in Draft 10 of POSIX.2b were removed in Draft 11. These include: -- Characters beyond the portable character set in identifies. See above dicussion. -- Collation substitute facility. POSIX.2 is investigating. Same as above. -- replace_after. See above discussion. -- A Japanese Locale concern with respect to the use of Kana characters in identifiers, where the proposal in WG15 N416 was rejected in favour of WG15 N420, ie. it is completed. -- The allowance of user-specified names for collation weights. See discussion below in N602. N601 France Action Item Report (October 1995) No additional comments on POSIX.2b have been received from the French member body. N602 Japan Action Item Report (October 1995) Identifies two issues, one which Japan has withdrawn, and the second which is a proposal for user-specified names for collation weights. Adding this to the grammar will reduce ballot concensus, however, the functionality can be accomplished by an awk script, POSIX.2 has demonstrated the awk script and it will be included in the rationale in the next revision of the document. (Additional discussion and the awk script is further attached to this report below.) N604 Danish National Body Report (October 1995) Does not identify concerns but points to N586 -- addressed above. N605 Additional comments on POSIX.2b from Denmark This appears to be in reponse to a request from POSIX.2 to the Danish HoD at the October 1995 PASC meeting to turn in a summary of comments to Draft 11 of POSIX.2b. The issues ifentified and addressed include: 2.2 c 1 Editorial issue is accepted. 2.4 c 2 Will be addressed with 4.35 o 6 2.4 o 3 Requires clarification: For a given code set, POSIX.2 does not understand how different locales would assign different widths to a given codeset point. Therefore, POSIX.2 believes it would be appropriate to specify the widths in the charmap file once for each codeset, rather than repeatedly in each locale that refers to the codeset. 2.5 o 4 If this objection were accepted the POSIX locale defined in POSIX.1 and POSIX.2 would nolonger be a superset or compatible with the C locale defined by the ISO C standard. Since that is the basis for all locale work in POSIX.1 and POSIX.2, the POSIX.2 working group will not be able to make the suggested change. 2.8 o 5 This is an issue concerning range expressions within regular expressions. This is an open issue under discussion and will not be closed during the POSIX.2b ballot period. See above disussion under N416. 4.35 o 6 POSIX.2 understands the issue of repertoire maps as opposed to the alternative presented in Draft 11. The committee is discussing this issue with members of the ballotting group and other i18n experts, and a decision will be made and presented in Draft 12. The committee would like to thank the Danish HoD for his explanation of the repertoire map issue at the St. Pete meeting. 4.48 o 7 This issue is editorial and will be fixed in Draft 12. D.1 c 8 This issue is editorial and will be fixed in Draft 12. N606 RIN report to WG15 October 1995 Does not identify concerns but points to N583 -- addressed above. POSIX.2 Working Group Discussion of Japanese Collation Weights Proposal ----------------------------------------------------------------------- The IEEE P1003.2 committee considered the Japanese proposal for an extension to the LC_COLLATE section of locale definition files to support user specified collation weight names. We believe that the attached awk script can be used to provide the same capabilities on implementations that currently support the POSIX2_LOCALEDEF of ISO IS 9945-2. This means that users needing these capabilities will not have to wait until the amendment to POSIX.2 becomes an approved standard and implementations conforming to the amended standard are developed. We believe that IEEE P1003.2b will also go through balloting with fewer objections if this new feature is added to the process. (Some members of the IEEE balloting group have expressed the concern that a lot of vendors would have to make significant changes to implementations for this when an alternative (such as the awk script provided here) perform the same functions with facitlities alread provided by the standard.) This awk script would be used to generate as many locale definition files as are needed for the various LC_COLLATE variants. Then the current localedef utility could be used to generate locales with the different collation requirements. New functions would not be needed to switch collation orders because the LC_COLLATE environment variable and the POSIX.1 setlocale() function would be all that would be needed to choose between the various collation rules. (Existing applications would not have to be changed to take advantage of the alternatives.) To use the awk script below, the locale definition file would need to contain a comment line of the form: "%cweight-names", where is the current comment character in use in the locale definition file followed by one or more weight names of the form: "%s%c", , where the fields correspond to the "name=" entries that were added to the "order_start" keyword in the Japanese proposal, and is a semicolon character after all weight names except the last, and a newline character after the last weight name. The only loss of generality is that the Japanese proposal allows quoting of characters in weight names. The attached awk script doesn't handle quoting and, therefore, doesn't allow semicolon characters in weight names. Blank characters are allowed in weight names, except as the first character in the first weight name. An example in the Japanese proposal was: *** Start Example *** order_start forward,name="kunyomi";forward,name="radical" ; ; : : order_end *** End Example *** Using the awk script below, when # is the comment character, this example would be rewritten as: *** Start Example *** #weight-names kunyomi;radical order_start forward;forward ; ; : : order_end *** End Example *** Another example given in the Japanese proposal was: *** Start Example *** Possible LC_COLLATE definition ============================== # Stroke collating-symbol <3stoke> collating-symbol <4stoke> collating-symbol <6stoke> collating-symbol <7stoke> collating-symbol <10stoke> # Onyomi collating-symbol collating-symbol collating-symbol collating-symbol # Radical collating-symbol collating-symbol collating-symbol order_start forward,name="stroke";forward,name="onyomi";\ forward,name="radical";forward,name="JISnumber" <10stroke>;;; <6stroke>;;; <7stroke>;;; <4stroke>;;; <3stroke>;;; Changing the order by assigning values to LC_COLLATE ==================================================== LC_COLLATE=ja_JP.eucJP@weights=stroke,onyomi,radical,JISnumber Behavior of collation functions =============================== Output from weights=stroke,onyomi,radical,JISnumber (default) < < < < Output from weights=radical,onyomi,stroke,JISnumber < < < < *** End Example *** Again, the only change needed to the locale definition file to make this example work is to change the order_start keyword lines to: #weight-names stroke;onyomi;radical;JISnumber order_start forward;forward;forward;forward Assuming the locale definition source file in the example above was named ja_JP.eucJP.ls, the default locale would be created just as it has been in the past with a command like: localedef -i ja_JP.eucJP.ls ja_JP.eucJP and the locale for the stroke,onyomi,radical,JISnumber variant would be created with the commands: awk -f collation_weights.awk stroke;onyomi;radical;JISnumber \ < ja_JP.eucJP.ls > /tmp/localetmp.ls localedef -i /tmp/localetmp.ls \ ja_JP.eucJP@weights=stroke,onyomi,radical,JISnumber rm /tmp/localetmp.ls The awk script needed to do this is: *** Start "collation_weights.awk" *** # Initialize local variables and process command line arguments. BEGIN { exit_code = 0 # Set default comment character. comment_char = "#" # When in_order != 0, we are processing lines between order_start # and order_end directives. in_order = 0 # weight_names_cnt is the number of weight names found in a # weight-names comment line. weight_names_cnt = 0 # Verify that at least one weight name is specified on the command # line. if (ARGC < 2) { printf("Usage: %s \\\n\t%s \\\n\t%s\n", "awk -f collation_weights.awk weight-name...", "< localedef_file_including_weight-names_comment", "> modified_localedef_file") >> "/dev/tty" exit 1 } arg_cnt = 1 while (arg_cnt < ARGC) { requested_name[arg_cnt] = ARGV[arg_cnt] arg_cnt++ } arg_cnt-- # Throw away weight names from ARGV array by clearing ARGC so the # weight-names aren't treated as files to be processed by awk. ARGC = 1 } # Exit with preserved exit code in caes a non-fatal error was detected. END { exit exit_code } # Look for a change in the comment character in use in the localedef file. /^comment_char/ { comment_char = $2 print $0 next } # Look for comment line defining collation weight names. index($0, comment_char) == 1 { print $0 sub(".", "") if (index($0, "weight-names") != 1) { # Not a weight-names comment; go on to next line. next } if (weight_names_cnt) { printf("Error on line %d: Only one weight-names %s\n", NR, "directive allowed.") >> "/dev/tty" exit_code = 2 } # Throw away the comment part and split out the weight names. sub("^[^ \t]*[ \t]*", "") weight_names_cnt = split($0, names, ";") loop_cnt = 1 while (loop_cnt <= weight_names_cnt) { loop_cnt++ } if (weight_names_cnt < arg_cnt) { printf("Error on line %d: More weight names given %s", NR, "on command line than are specified in weight-names", "directive.\n") >> "/dev/tty" exit_code = 3 weight_names_cnt = 0 next } loop_cnt = 1 while (loop_cnt <= arg_cnt) { loop2_cnt = 1 while (loop2_cnt <= weight_names_cnt) { if (names[loop2_cnt] == requested_name[loop_cnt]) { name_index[loop_cnt] = loop2_cnt break; } loop2_cnt++ } if (loop2_cnt > weight_names_cnt) { printf("weight-name \"%s\" %s %d\n", names[loop_cnt], "not found in weight-names directive on line", NR) >> "/dev/tty" exit_code = 4 } loop_cnt++ } # if (exit_code) { # weight_names_cnt = 0 # } next } # Look for order_end directive. /^order_end/ { in_order = 0 } # Look for order_start directive. /^order_start/ { in_order = 1 if (weight_names_cnt == 0) { printf("Error on line %d: order_start found %s\n", NR, "before weight-names directive.") >> "/dev/tty" exit_code = 5 in_order = 0 } } # If we're not between order_start and order_end, don't have any weights, or # haven't found any weight names, just copy out the line. in_order == 0 || NF < 2 || exit_code != 0 { print $0 next } # Fix lines from order_start up to order_end line as requested. { # Find, print, and throw away part of line before weights. match($0, "^[^ \t]*[ \t]*") printf("%s", substr($0, 1, RLENGTH)) # Split the weights into separate field. $0 = substr($0, RLENGTH + 1) split($0, weights, ";") # Put out weight fields in the order requested on the command line. loop_cnt = 1 while (loop_cnt <= arg_cnt) { printf("%s%c", weights[name_index[loop_cnt]], loop_cnt == arg_cnt ? "\n" : ";") loop_cnt++ } } *** End "collation_weights.awk" *** WG15 TAG/N573 IEEE P1003.2/N9603 Page 1 of 1 IEEE PASC/N584 Date: 1996-01-17 Superceeds: 0 Document Title: Comments on "Recommendation on ext. use of characters in identifiers (WG15 N227)" presented in JTC1/SC22/WG15 N420 in response to WG15 Action Item 9510-14 Source: IEEE PASC Shell and Utilities Working Group Category: ??? Status: ??? Action Requested:Forward to WG15, WG20, and SC22??? Details: WG15 Document N420 (dated 1993-10-22) states that WG20 and a number of SC22 ad hoc groups recommend that programming languages be extended to allow identifiers to be extended based on a list of characters specified as their code points in UCS-2 in addition to the latin non-accented upper and lowwer case letters and digits. This recommendation makes a lot of sense for compilers and interpreters running on implementations using ISO 10646 as the underlying codeset. We believe it would also make sense in cases where a single subset of those UCS-2 characters could be identified as common code points in all of the codesets used in all of the locales supported by an implementation. However, in the case of the various languages specified in ISO/IEC 9945-2 (POSIX Shell and Utilities) there are problems with supporting these extensions when ISO 10646 is not the underlying codeset. (Note that 9945-2 is codeset neutral in most respects and does not mandate support for any particular codeset except in cases where the data is intended to be used for interchanges between implementations.) When users are allowed to define their own locales (using charmap files, locale definition files, and the localedef utility), there is no way for an implementation to efficiently implement the grammars for languages like awk, lex, sh, and yacc. In the worst case, the utilities would have to rebuild themselves from source when they are invoked with a different locale to be able to build a parser to recognise appropriate identifiers in the codeset used by the then current locale. Therefore, we beleive that it is inappropriate for the WG20 N227 recommendations to be applied to the interpretive languages specified in 9945-2.