From ajosey@rdg.opengroup.org Fri Feb 9 21:04:45 2001 Received: from mailgate.rdg.opengroup.org (mailgate.rdg.opengroup.org [192.153.166.4]) by dkuug.dk (8.9.2/8.9.2) with SMTP id VAA43175 for ; Fri, 9 Feb 2001 21:04:44 +0100 (CET) (envelope-from ajosey@rdg.opengroup.org) Received: by mailgate.rdg.opengroup.org; id AA26672; Fri, 9 Feb 2001 20:03:42 GMT Received: from unknown [216.218.247.8] by smtp.opengroup.org via smtpd V1.38 (00/07/25 13:18:13) for ; Fri Feb 09 20:03 GMT 2001 Received: (from ajosey@localhost) by skye.rdg.opengroup.org (8.9.3/8.8.7) id UAA02220 for sc22wg15@dkuug.dk; Fri, 9 Feb 2001 20:02:07 GMT Date: Fri, 9 Feb 2001 20:02:07 GMT From: Andrew Josey Message-Id: <1010209200206.ZM2219@skye.rdg.opengroup.org> Reply-To: ajosey@rdg.opengroup.org (Andrew Josey) X-Mailer: Z-Mail (5.0.0 30July97) To: sc22wg15@dkuug.dk Subject: WG15 Issue 10 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Dear All, Enclosed is the response to WG15 Issue 10 from the Austin Group. The following changes have been made to draft 4, the notation [+ +] surrounds additions from d4, and [- -] surrounds deletions. p.58 s.3 Definitions 1848 [-3.103 Collating Element Order 1849 The relative order of collating elements as determined by the setting of the LC_COLLATE 1850 category in the current locale. 1851 The collating element order is used in range expressions in REs and is determined by the order in 1852 which collating elements are specified between order_start and order_end keywords in the 1853 LC_COLLATE category.-] p.59 1858 3.105 Collation Sequence 1859 The relative order of collating elements as determined by the setting of the LC_COLLATE 1860 category in the current locale. The collation sequence is used for sorting and is determined from 1861 the collating weights assigned to each collating element. In the absence of weights, the collation 1862 sequence is [-also the collating element order.-] [+the order in which collating elements are specified between order_start and order_end keywords in the LC_COLLATE category.+] (note the following text was moved to rationale, as per .2 when chapter 7 was reworked for another ERN) p.158 s.7.3.2.3 The order_start Keyword 4662 The character[- (and collating element)-] order is defined by the order in which characters and 4663 elements are specified between the order_start and order_end keywords. [-This character order is 4664 used in range expressions in regular expressions (see Chapter 9).-] Weights assigned to the p.159 s.7.3.2.4 Collation Order 4746 The collation order as defined in this section [-defines-][+affects+] the interpretation of bracket expressions in 4747 regular expressions (see Section 9.3.5 (on page 199)). p.160 4766 1. The UNDEFINED means that all characters not specified in this definition (explicitly or 4767 via the ellipsis) shall be ignored for collation purposes[-; for regular expression purposes 4768 they are ordered first-]. p.195 s.9.2 Regular Expression General Requirements 6300 character, but also its case counterpart (if any), shall be matched. This definition of case- 6301 insensitive processing is intended to allow matching of multi-character collating elements as 6302 well as characters[-. For example-], as each character in the string is matched using both its cases[+. For example, in a locale where "Ch" is a multi-character collating element and where a matching list expression matches such elements+], 6303 the RE "[[.Ch.]]" when matched against the string "char", is in reality matched against 6304 "ch", "Ch", "cH", and "CH". p.199 s. 9.3.5 RE Bracket Expression 6347 classes, character classes, or range expressions. [-Portable applications shall not use range 6348 expressions, even though all implementations shall support them. -]The right-bracket (']') 6360 2. A matching list expression specifies a list that shall match any [+character in any+][-one-] of the expressions 6361 represented in the list. The first character in the list shall not be the circumflex; for 6362 example, "[abc]" is an RE that matches any of the characters 'a', 'b', or 'c'. [+It is unspecified whether a matching list expression matches a multi-character collating element that is matched by one of the expressions.+] 6363 3. A non-matching list expression begins with a circumflex ('^'), and specifies a list that shall 6364 match any character [-or collating element -]except for the expressions represented in the list 6365 after the leading circumflex. For example, "[^abc]" is an RE that matches any character 6366 [-or collating element -]except the characters 'a', 'b', or 'c'. [+It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions.+] The circumflex shall have this 6372 that make up the multi-character collating element. For example, if the string "ch" is a 6373 collating element in the current collation sequence with the associated collating symbol 6374 , the expression "[[.ch.]]" shall be treated as an RE [+containing the collating symbol+][-matching the character 6375 sequence-] 'ch', while "[ch]" shall be treated as an RE matching 'c' or 'h'. Collating 6376 symbols are recognized only inside bracket expressions.[- This implies that the RE 6377 "[[.ch.]]*c" shall match the first to fifth character in the string "chchch".-] If the string 6378 is not a collating element in the current [-collating sequence definition, or if the collating 6379 element has no characters associated with it (for example, see the symbol in the 6380 example collation definition shown in Section 7.3.2.2 (on page 157)), the symbol shall be 6381 treated as an invalid expression-][+locale, the expression is invalid+]. p.200 6402 7. [+In the POSIX locale, a+][-A-] range expression represents the set of collating elements that fall between two elements 6403 in the [+collation sequence+][-collating element order of the current locale-], inclusive.[+ In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched.+] A range expression shall be 6404 expressed as the starting point and the ending point separated by a hyphen ('-'). 6405 [-Range expressions shall not be used in portable applications because their behavior is 6406 dependent on the collating sequence.-] 6407 In the following, all examples assume[- the collation sequence specified for-] the POSIX locale[-, 6408 unless another collation sequence is specifically defined-]. 6412 within a bracket expression, but only outside the range.[- For example, the unspecified 6413 expression "[[=e=]-f]" should be given as "[[=e=]e-f]". The ending range point 6414 shall collate equal to or higher than the starting range point; otherwise, the expression is 6415 treated as invalid. The order used is the order in which the collating elements are specified 6416 in the current collation definition. One-to-many mappings (see the description of 6417 LC_COLLATE in Section 7.3.2 (on page 155)) are not performed. For example, assuming 6418 that the character eszet ('_') is placed in the collation sequence after 'r' and 's', but 6419 before 't' and that it maps to the sequence "ss" for collation purposes, then the 6420 expression "[r-s]" matches only 'r' and 's', but the expression "[s-t]" matches 6421 's', '_', or 't'.-] [+If the represented set of collating elements is empty, it is unspecified whether the expression matches nothing, or is treated as invalid.+] 6430 '@' inclusive; and the expression "[a--@]" is[+ either+] invalid[+ or equivalent to "@"+], because the letter 'a' follows the 6431 symbol '-' in the POSIX locale. To use a hyphen as the starting range point, it shall either XCUd4 changes: p.3137 tr Utility // This change is independent of the other changes in this proposal. // It allows the implementation to define tr ranges to have the same // meaning as range expressions, and it clarifies some ambiguous // wording about collation sequence versus collating element order. 35710 c-c [+In the POSIX locale, represents+][-Represents-] the range of collating elements between the range endpoints (as long as 35711 neither endpoint is an octal sequence of the form \octal), inclusive, as defined by 35712 the [+collation sequence.+][-current setting of the LC_COLLATE locale category. The application shall 35713 ensure that the starting endpoint precedes the second endpoint in the current 35714 collation order.-] The characters or collating elements in the range shall be placed in 35715 the array in ascending collation sequence. [+If the the the second endpoint precedes the starting endpoint in the collation sequence, it is unspecified whether the range of collating elements is empty, or this construct is treated as invalid. In locales other than the POSIX locale, this construct has unspecified behavior.+] XRATd4 changes: p.3357 s.A.7.3.2 // Remove text that merely repeats XBDd4 p.156 s.7.3.2.3 lines // 4662-4666, which is changed above. 1812 [-The character (and collating element) order is defined by the order in which characters and 1813 elements are specified between the order_start and order_end keywords. This character order is 1814 used in range expressions in regular expressions (see the Base Definitions volume of IEEE Std. 1003.1-200x, 1815 Chapter 9, Regular Expressions). Weights assigned to the characters and elements define the 1816 collation sequence; in the absence of weights, the character order is also the collation sequence.-] p.3362 s.A.9.1 2016 bracket expression only matched a single character. [-If, however, the bracket expression defines, 2017 for example, a range that includes ij, then this particular bracket expression also matches a 2018 sequence of the two characters 'i' and 'j' in the string.-] [+POSIX.2-1992 required bracket expressions like [^[:lower:]] to match multi-character collating elements such as "ij". However, this requirement led to behavior that many users did not expect and that could not feasibly be mimicked in user code, and it was rarely if ever implemented correctly. The current standard leaves it unspecified whether a bracket expression matches a multi-character collating element, allowing both historical and POSIX.2-1992 implementations to conform.+] p.3363 s.A.9.2 // Remove text that merely repeats XBDd4 p.195 s.9.2 lines 6300-6304, // which is changed above. 2036 [-The definition of case-insensitive processing is intended to allow matching of multi-character 2037 collating elements as well as characters. For instance, as each character in the string is matched 2038 using both its cases, the RE "[[.Ch.]]", when matched against "char", is in reality matched 2039 against "ch", "Ch", "cH", and "CH".-] p.3364 s.A.9.3.5 RE Bracket Expressions 2073 Range expressions are, historically, an integral part of REs. However, the requirements of 2074 ``natural language behavior'' and portability do conflict[-:-][+. In the POSIX locale,+] ranges must be treated according to the 2075 [-current -]collating sequence and include such characters that fall within the range based on that 2076 collating sequence, regardless of character values. [-This means, however, that the interpretation 2077 will differ depending on collating sequence. If, for instance, one collating sequence defines 'a as 2078 a variant of 'a', while another defines it as a letter following 'z', then the expression "[a-z]" 2079 is valid in the first language and invalid in the second. This kind of ambiguity should be avoided 2080 in portable applications, and therefore the standard developers elected to state that ranges must 2081 not be used in strictly conforming applications; however, implementations must support them.-] [+In other locales, ranges have unspecified behavior.+] 2093 As noted previously, the new syntax and rules have been added to accommodate other 2094 languages than English. The remainder of this section describes the rationale for these 2095 modifications. [+In the POSIX locale, a regular expression that starts with a range expression matches a set of strings that are contiguously sorted, but this is not necessarily true in other locales. For example, a French locale might have the following behavior: $ ls alpha Alpha estimi ESTIMI iti eurjka $ ls [a-e]* alpha Alpha estimi eurjka Such disagreements between matching and contiguous sorting are unavoidable because POSIX sorting cannot be implemented by a deterministic finite-state automaton. Historical implementations used native character order to interpret range expressions. POSIX.2-1992 instead required collating element order (CEO): the order that collating elements were specified between the order_start and order_end keywords in the LC_COLLATE category of the current locale. CEO had some advantages in portability over the native character order, but it also had some disadvantages: * CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular user programs like Emacs, ksh and Perl. * CEO caused range expressions to match accented and capitalized letters contrary to many users' expectations. For example, [a-e] typically matched both "E" and "a" but neither "A" nor "i". * CEO was not consistent across implementations. In practice CEO was often less portable than native character order. For example, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named "da_DK". Because of these problems, some implementations of regular expressions continued to use native character order. Others used the collation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from the traditional POSIX semantics because it generally requires [a-e] to match either "A" or "E" but not both. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could not rely on POSIX.2-1992's guarantees in practice. While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving the standard alone was also considered, but rejected due to the problems described above. The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer that portable applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatible user-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques. POSIX.2-1992 required [b-a] to be an invalid expression in the POSIX locale, but this requirement has been relaxed in this version of the standard so that [b-a] can instead be treated as a valid expression that does not match any string.+] ----- Andrew Josey The Open Group Director, Server Platforms Apex Plaza,Forbury Road, Email: a.josey@opengroup.org Reading,Berks.RG1 1AX,England Tel: +44 118 9508311 ext 2250 Fax: +44 118 9500110 Mobile: +44 774 015 5794