SC22/WG15 N217R 5 pp Summary of Some Internationalization Mechanisms in POSIX Donn Terry The purpose of this paper is to summarize some of the issues dealing with POSIX internationalization that have proven to be confusing. I. The locale environment variables: In the simplest case, a given computer model supports only a single natural language, worldwide. (This is usually English, but might be the language of the country where the system was designed.) The work on internationalization is a recognition that that is unacceptable. The next simplest solution is that a given system (installation) is used in a single language. It is installed in some country, so it can use the language and cultural conventions of that country. Upon a bit of reflection, this does not work either; it may be necessary to use a computer in one country to process data appropriate for another country. Dedicating a single multi-user machine to each language and culture is uneconomical, particularly if the need to process data for that language/culture is not frequent. Thus it's clear that a computer system must be able to adapt to various language and cultural conventions reasonably dynamically. All this is obvious. However, implementation of this brings out another aspect that is a bit less obvious (until it's pointed out). That is the fact that the language in which the programmer operates may be different from the language of the data he is using. It can also be that the language does not track the cultural conventions: many languages are used in more than one country, and some countries use more than one language. However, other cultural conventions (usually) follow national boundaries. Currency is an example. Thus, the language conventions and the other cultural conventions are somewhat independent. To accomodate this variation, POSIX provides several layers of specification of language and cultural convention that are user selectable. At the bottom is the vendor default language. The system must understand some default set of conventions and some language, and if for no other reason that accident, the vendor chooses one. The next level is the installation default. This is the same as the vendor default in many systems, but it may also be possible to override this with a default value that is specific to the installation. This is probably either the vendor default or the conventions of the region in which the system is physically installed. The means by which this is done is not specified by POSIX. Each user of a system may choose to have a different language or set of cultural conventions as his or her personal default, and which may be different from the installation default. A visitor to the system or remote access to the system are examples where this need is obvious. The user's personal default is the LANG environment variable. The value of LANG is an implementation-defined string which gives a user the language and cultural convention he desires. Within POSIX, it is also possible to create user-defined values for LANG, and to use them. However, the default value is not always right. The situation where a person speaks one language, but is working on data in another language, is not uncommon. Thus there are several environment variables which begin with LC_ which allow fine-tuning of the operation of the system. Each accesses some part of the language and cultural information, and overrides only that part to reflect a different language or set of conventions. These currently are: LC_CTYPE: The characteristics of the characters making up the language: which are alphabetic and which are not even part of that language. LC_COLLATE: The order in which character data is "sorted". LC_TIME: Processing of time and date related information. LC_MONETARY: How monetary amounts are processed. (The currency symbol and the like.) LC_NUMERIC: How non-monetary numeric values are processed. (Is the radix point character a dot or a comma?) LC_MESSAGES: The language in which error messages appear. (Being considered for standardization.) One special case is needed: LC_ALL. If a user is using the LC_ variables, then it's likely that occasionally he will want to use an application that reverts back to a single set of language and cultural conventions. LC_ALL overrides the other LC_ variables, and LANG, and puts a single language and set of conventions into control. LANG: Used whenever the installation default is not as desired, or on general principles. LC_: Used selectively, when needed, for special purposes. Would not typically be used. LC_ALL: Used whenever a single, known, set of language and cultural conventions is needed. Normally only used during the execution of a single command, and not set except to revert temporarily to a known state. POSIX also provides a known, default, environment. This is called the POSIX (or historically C) locale. If any of the language and cultural environment variables are set to "POSIX" (or "C"), the POSIX locale is in force, and that locale is defined by the POSIX standard. The POSIX locale is the same as the C locale where the C locale specifies anything; however the POSIX locale specifies things not addressed by the C locale, such as regular expressions. II. National Profiles: From the above, it is obvious that for each linguistic region of a country, a locale must be defined. In a homogeneous country, there might naturally be only one. In some inhomogeneous countries, compromises may be reached to have only one. However, it is expected that the general case will be that more than one locale will be needed in a country. It is also very clear that each country, not the international standards community, is the best source of the information about that country. Thus the concept of national profiles is born. The concept of profile goes beyond simply a locale, but may also include specification of character sets or other system characteristics that it makes sense to standardize for a single country. However, the key element is the locale: the collation order, character set conventions, monetary conventions, timezones and the like. Thus, POSIX is encouraging the various countries to create national profiles to reflect their national needs. Ultimately, these will have to be "harmonized": the linguistic conventions for two countries sharing the same language should not be unnecessarily different, and countries sharing some common cultural convention should also not be unecessarily different. One issue that is a concern is that there are situations where there is a single "common" baseline for a given country, and then variations. Most commonly, conventions such as monetary units are the same, but the languages vary. In theory, it is possible to have a single locale for that country using the dominant (or an arbitrarily chosen) language. Then, locales for each additional language specify just the differences for that language. This overlaying operation can be done either by hand, or theoretically it could be done automatically. However, creating national locales has the characteristic that it is not done very often, and once done, it is reasonably static. Thus the current standard does not implement automatic layering, but rather expects it to be once done by hand when it needs to be done. Given the rather fine control given by the LC_ environment variables, there should be little reason for the user to create his own custom set of conventions, where automatic layering would be of much value. III. Character Sets: All of the above tacitly assumed that the characters required for each language and cultural environment were available. With the historically common ASCII character set, and its variants, this is not even nearly true. Simple ASCII is useful only in English speaking countries, and the few other countries which use an alphabet which is not larger than that of English. ASCII is a seven bit character code. By replacing some of the less frequently used (in worldwide usage) ASCII characters with other characters, a character set quite similar to it can be used for one specific additional language. A different substitution would work for a second language. However, not all languages can be handled in this way, because only a handful of characters are available for such substitiution. (Going further would make the character set simply unusable because critical punctuation would have to be removed. As it is, removing specifically [ ] { and } presents many problems, particularly for programming languages.) Worse yet, mixing two or more languages is impossible because they require the same cell of the 7-bit code to represent different characters. A single 8-bit code is sufficient to add all the characters used in Western Europe in a single set. However, this is not sufficient to add the Latin-based alphabets of Eastern Europe, let alone Cyrillic (and its variants), Greek, Hebrew, and Arabic. Multiple character codes (and switching between them) are required to mix such characters. Japanese and Chinese present an even larger problem. Depending on which dictionary is used, they have between 2000 and 80,000 or so distinct characters. In addition, Japanese tends to use the character set native to the subject language for inclusions from languages other than Japanese. In effect, for proper Japanese usage they also require at least the full Western European and Cyrllic character set in addition to their own. Adding the other character sets used in the world, even if limiting to populations of a million or more speakers, introduces many more characters. Further, specialized symbol sets, such as mathematics and chemistry, add to the problem. However, just going to a very large character set is not practical. A lot of additional storage would be required to represent data that is successfully represented today. Depending on the country, the way of handling that may vary: ignoring the problem works in countries where it is not an issue. In others, the relatively limited Western European set is sufficient. A limited Japanese or Chinese character set is also quite effective, where the total number of characters is limited to somewhere between about 14000 and 65000, depending on the technology (with somewhere between 3000 and 7000 in typical use). Often, the solution is to have several different character sets, and to explicitly switch between them. However, this makes programming more difficult. In the case of Japanese and Chinese, even the minimal sets of characters require more bits to be represented than is needed for Western Europe. An application must be able to adapt the character set in use. Current technology makes it very difficult for an application to deal with characters where the size of the basic character varies dynamically (as opposed to where two or more basic characters are used to represent a single element of the character set). However, the underlying systems do in fact vary the size of the basic character. At the source language level, it is possible to translate to different instruction sequences depending on the size of the basic character, but occasionally a program must "know" what this size is. Thus, the programming tool CHAR_BIT is introduced. This represents the number of bits in a basic character. Typically it might be either 8 or 16. (7 bit codes use 8, and specify the remaining bit in some way.) IV. Applications: Use of all these capabilities cannot be automatic; some data might be subject to internationalization, and other not. There is no automatic way to know which is which. Thus applications must be coded to take advantage of the services the internationalization environment provides. There are library functions to do things such as "convert this date to a printable form", or "compare these two strings", or "get the text for a message" that an application can call. Using the versions of these functions that take into account the internationalization environment, the application can be moved from one country or language to another without modification. It may be necessary to do such things as translate messages, but this is a much simpler task than modifying the program itself to do the same thing. The payback for the costs of internationalization are when it is possible for software to run in several cultural or linguistic environments. Purchasers gain access to a larger range of software, and vendors to a larger range of markets, both at a lower incremental cost.