SC22/WG15 N316
WG15RIN N091
WG20 N103


Canadian Contribution to the ISO/IEC SC22 WG20

 

October 16 1992

Milos Lalovic


 

Short Character Names for defining Standard Locales

To guarantee the same cultural behaviour of applications in distributed computing environments, locales across the network must have the same definition.  A reasonable way to satisfy this requirement is to standardize locale definitions.  Standard locale definitions would normally be provided by relevant national standards groups, so a standard locale definition syntax would have to be used.

Short symbolic character names represent an important element of the locale definition syntax.  Many national standards groups have already defined their national locales using their own choice of short symbolic character names.  It is therefore, impossible to define standard short symbolic character names that will satisfy everyone's taste, and preserve investment in already defined locales.  The solution is to allow national standards groups to define the national standard locales using the short symbolic character names of their choice, provided that locale definitions are accompanied by a reference table that uniquely and unambiguously describes the short symbolic character names in terms of ISO 10646 hexadecimal identifiers.  An ISO 10646 hexadecimal identifier has the following form:  <Uxxxxxxxx>, where "xxxxxxxx" represents eight hexadecimal digits expressing the code point value of the corresponding ISO 10646 character in canonical form.

Most standard character sets are already defined in ISO 10646, so most short symbolic character names will have a corresponding ISO 10646 hexadecimal identifier.  In cases where a standard character set has not yet been defined in ISO 10646, the ISO 10646 hexadecimal identifier will be substituted by the name of the standard character set, followed by a string of hexadecimal digits representing the code point value in the standard character set (e.g. <ISO6429_xx> ).

This method does not require the use of ISO 10646 character encoding scheme, only the ISO 10646 hexadecimal identifiers are required.

The following is the syntax for the reference table:

<Uxxxxxxxx> <short-name>

"blank" would be the separator (or possibly the horizontal tab) and the entry would be terminated by a new line character.

If there is more than one short name for a given ISO 10646 hexadecimal identifier there would be one entry for each short name, e.g.

<Uxxxxxxxx> <short-name1>
<Uxxxxxxxx> <short-name2>

Currently only UCS-2 form of ISO 10646 has been assigned, so all ISO 10646 hexadecimal identifiers will ahve four leading zeros.  This may be avoided if the syntax is extended to allow an alternate form for ISO 10646 hexadecimal identifiers that will have only four hexadecimal digits following the letter U (e.g. <Uxxxx> ).  There is no danger of ambiguity since the identifiers with four hexadecimal digits are synonyms for the eight digit identifiers with four leading zeros.