From keld@dkuug.dk Mon Jul 17 01:23:14 1995 Received: by dkuug.dk id AA23936 (5.65c8/IDA-1.4.4j for sc22wg15); Sun, 16 Jul 1995 23:23:15 +0200 Message-Id: <199507162123.AA23936@dkuug.dk> From: keld@dkuug.dk (Keld J|rn Simonsen) Date: Sun, 16 Jul 1995 23:23:14 +0200 X-Charset: ASCII X-Char-Esc: 29 Mime-Version: 1.0 Content-Type: Text/Plain; Charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Mnemonic-Intro: 29 X-Mailer: Mail User's Shell (7.2.2 4/12/91) To: sc22wg15@dkuug.dk Subject: Danish proposals to .2b Additiaonal Danish proposals to .2b can be found in the standard itself, ISO/IEC 9945-2:1993, in annex G. There were proposals on pax, reorder-after (was replace-after) and symbolic ellipses. I have here text for symbolic ellipses that is more or less technically equivalent to text in G.2.2 A proposal on an extension to locale definitions to gain portability ===================================================================== * RATIONALE: To make a locale be portable by using ellipsis in the locale definition. * PROBLEM: The current semantics of the POSIX ellipsis (hereafter called "code-value ellipsis", which is represented by three adjacent periods: "...") is code value dependent and only valid within a single encoded character set. A code-value ellipsis is interpreted as including in the list all characters with an encoded value higher than the encoded value of the character preceding the ellipsis and lower than the encoded value of the character following the ellipsis. When the code-value ellipsis is used within a locale definition, it limits the portability of that locale definition. * PROPOSED SOLUTION: To introduce a new codeset independent ellipsis specification, "symbolic ellipsis", which defines a sequence of symbolic names. Syntax ------ The "symbolic ellipsis" is represented by SIX adjacent periods: "......". ...... where consists of zero or more nonnumeric, visible, characters from the portable character set, followed by an integer formed by one or more decimal digits. The characters preceding the integer shall be identical in the two symbolic names, and the integer formed by the digits in the second symbolic name shall be equal to or greater than the integer formed by the digits in the first name. Semantics --------- The symbolic ellipsis is interpreted as a series of symbolic names. It is based on the symbolic name sequence, has no dependency on underlying code value. For example, ; ...... ; is interpreted as: , , , , where The underlaying code values for , , , and are arbitrary. * SUMMARY: - This proposal is not intended to replace the existing code value dependent POSIX ellipsis scheme. The "symbolic ellipsis" provides an additional feature for writing a code-set independent locale. - In the charmap file, only "code-value ellipsis" is allowed. - In the locale definition file, both "code-value ellipsis" and "symbolic ellipsis" can be used. - When the "code-valued ellipsis" is used, the locale becomes code-set dependent. One needs to use "symbolic ellipsis" to write a portable locale. - The pros and cons to using "symbolic ellipsis" are: * It makes a LOCALE definition shorter and portable, but the charmap definition might become longer and more complicated. * It requires the symbolic name have the specific form described above under "Syntax". - The "symbolic ellipsis" benefits especially those locale definitions with large character set. For example, there are about 6,000 Kanji characters in JIS X0208 and about 20,000 ideographic characters (in a different order) in ISO 10646 (Unicode). To create a Japanese locale that can support JIS X0208 and ISO 10646 code sets, with "code-value ellipsis", two separate charmaps and two separate locale definitions must be created. Using "symbolic ellipsis" only ONE locale definition need be created, together with two charmap files. The only alternative to the use of symbolic ellipsis is to create a very large locale definition that explicitly names all 6,000 characters. * EXAMPLE: To define to be upper case characters in the LC_CTYPE. Each code value of is defined in its charmap file. For example, here are two different charmap files: (1) code set 1 charmap: CHARMAP ... \x41 END CHARMAP (2) code set 2 charmap: CHARMAP ... \xC1 ... \xD1 ... \xE2 END CHARMAP If the locale definition is using code-value ellipsis: LC_CTYPE upper ... The resulting behavior will be different by using the different charmap files above: (1) when compiled with the "code set 1" charmap, all the characters between code value \x41 to \x5A (26 characters) are "upper". (2) when compiled with the "code set 2" charmap, it means all the characters between code value \xC1 to \xE9 (41 characters) are "upper". This is probably not what was intended. If the locale definition is using symbolic ellipsis: LC_CTYPE upper ...... The resulting behavior will be the same no matter which charmap is used: (1) when compiled with the "code set 1" charmap, it means all the symbolic names between and (code value from \x41 to \x5A, 26 characters) are "upper". (2) when compiled with the "code set 2" charmap, it means all the symbolic name between and (code value \xC1 to \xC9, \xD1 to \xD9, \xE2 to \xE9, 26 characters) are "upper".