From isaak@csac.zko.dec.com Fri Aug 20 12:22:11 1993 Received: from crl.dec.com by dkuug.dk with SMTP id AA17968 (5.65c8/IDA-1.4.4j for ); Fri, 20 Aug 1993 22:19:54 +0200 Received: by crl.dec.com; id AA13975; Fri, 20 Aug 93 16:20:29 -0400 Received: by csac.zko.dec.com (5.65/fma-100391/BobG-15-Feb-93);id AA18575; Fri, 20 Aug 1993 16:22:11 -0400 Date: Fri, 20 Aug 1993 16:22:11 -0400 From: isaak@csac.zko.dec.com (Jim Isaak-respond via isaak@decvax.dec.com) Message-Id: <9308202022.AA18575@csac.zko.dec.com> To: sc22wg15@dkuug.dk Subject: I18n Framework, all 7 parts together X-Charset: ASCII X-Char-Esc: 29 ----------------------------------------------------------------------------- THIS IS PRELIMINARY VERSION FOR UNOFFICIAL FIRST PASS REVIEW by out side SC22WG20 members Suggestions and comments to be posted to SC22WG20@dkuug.dk [For SC22/WG15 please copy comments to RIN] ----------------------------------------------------------------------------- Foreword To be filled in later. Key message at here is "This is type 3 technical report". Introduction To be filled in later ISO/IEC DPTR - Framework, requirements and models of Internationalization (WD3A) Contents 1. Scope 2. Vision 3. Internationalization and Localization 4. Culture-dependent requirements for internationalization 5. Models of Internationalization 6. Expectations and obligations Annex A Related activities Annex B Bibliography Annex C Terminology (Glossary) Annex D Examples of legal requirements Annex E Example of solutions 1. Scope This Technical Report presents the framework and reference model(s) for internationalization, and identifies the services required for the internationalization of information technologies. Historically, the internationalization of information technologies has been provided on a demand/requirement basis, and thus, solutions have beenbased on the "Best available technology" approach. As a result, internationalization solutions for each technology do not necessarily share common directions and goals. In many cases, different technologies have different goals altogether. This technical report presents the relationship between the different requirements, approaches and solutions to the internationalization of information technologies. This report provides: - A list and brief discussion of internationalization-related requirements, - Models and taxonomy for internationalization, - Methods to provide internationalization features for information technologies, and /* Editor's note */ /* Are we going to cover following ? */ /* Aren't recommendations in TR 10176 revision ? */ - Recommendations for each standard to be internationalized. Internationalization services are comprised of two parts: internationally generic services and nationally specific services. This document presents the generic aspect of these services for software, but not the hardware- or ergonomics-related requirements. This technical report is to be used by standard providers as a planning reference for the specification of internationalization services for those responsible for different technologies, as well as a basis for all ISO/IEC JTC 1/SC22/WG20 activities. In addition, this technical report is to be used as a communication vehicle between those who provide standards and those who request them. Information technologies are changing rapidly, and dealing with technologies which are constantly evolving is very difficult. Therefore,the solutions and services available in the mid-80's have been selected as the technological departure point for discussion. 2. Vision 2.1 The need for and importance of internationalization There is no doubt that information technology is shifting its role from that of a specialized tool for certain people sitting in a room with glass windows, to that of a daily tool for the average person. Its prevalence in everyday life is approaching that of such social infrastructures as water/power supplies, public communications links, or road/public transportation systems. Requirement of national language and conventions The key to making this situation a reality is ensuring the user-friendliness of applications for the average person. One of the most important elements of this ease-of-use is interface user-friendliness, especially in terms of interaction with the system. In addition to requiring meaningful output, application users need to be able to provide input in a way that is friendly to them. For many users this means messages that are displayed in an appropriate natural language and the use of conventions that are familiar to him/her for the output and the input of such elements as dates, times, numbers and currency. Incorporating this functionality in an application increases user acceptance and decreases error rates. Multicultural requirements An application which is to be implemented in several different countries/cultures will clearly need to be implemented in such a way that it can provide suitable output for its (and accept input from its) different users. Even applications produced for a single country or a single location may still need to accommodate users with different cultural backgrounds within that environment (e.g. Canada, Belgium, CEC Secretariat, CERN, airports). Global uniformity /*Uniform to all*/ The above explanations make the case for user friendliness in terms of different cultural requirements. But applying this concept of user friendliness within the context of each cultural milieu is also paramount. With this in mind, the "Global uniformity" solution aims to minimize or eliminate ambiguities and is another aspect of internationalization that must be taken into consideration. For example, the date format 01-02-03 could mean February 3rd, 2001, January 2nd, 2003 or Februay 1st, 2003, all of which dates will become reality within a decade. Historically, ISO has pursued this "Global uniformity" strategy with respect to other technologies. In the field of information technology, however, a multiple view of internationalization and international standards will be necessary also. The following criteria evolve from this "Global uniformity" principle: 1) The culture-specific presentation formats which may be interpreted differently from one culture to another should be noted. ex.1) date 01/02/03 -- 2001-03-02 or 2001-02-03 ex.2) number 1,234 ex.3) 1 pound -- 453 grams or 373 grams depending on the object 2) The existing internationally accepted format should be encouraged, even if it is not currently used in the human interface: ex.1) the SI system of measurement ex.2) one decimal digit for each day of the week, i.e. 1 for Monday and and for Sunday (ISO 8601) Cross-cultural "Friendliness" /*Frienly to stranger*/ Cross-cultural communication is becoming more and more an integral part of daily life, not only in terms of the multicultural support described earlier. As a result people may be confronted with and process data based on concepts unfamiliar to their culture(s). For those cases, friendlyness of the unfamilier cutural data is not always same as frienliness for the daily user of the cultural data. Sometimes, friendliness could be even unfriendly for native user. For example, the ordering of CH after CZ (as a section in a dictionary, for example) is natural in Spanish, but not for other people. CZ at the end and CH between CG and CI might be far more friendly, or natural to most of the world's population. In this context, user friendliness can be defined as "Friendliness of unfamiliar data". Therefore, with respect to this "Cross-cultural Friendliness" aspect of establishing international standards, there is a need for data support with the aim of tailorable standards. This, therefore, is another type of new requirement. In summary, a "friendly" system could mean one of the following: - a system which supports one of many different cultures (that is, "friendly" to one culture); - a system which supports several cultures simultaneously; - a system which supports the "Global uniformity" principle; - a system which supports "Cross-cultural Friendliness". 2.2 Old assumptions and new environments Not only friendly user interface requirements, also, the new requirement has been driven by change of environment of information technology. Some of sample environment changes are listed below. Traditional assumptions were, as indicated in Table 1: Item Assumptions - The character set is ASCII - One character is 7 bit - One character is one byte - Display width of a character is one column on character cell terminal - Character count = byte count = display width - Maximum number of characters is 256 in most cases - Printable character is ASCII - The collating sequence is ASCII - Directly input from a keyboard - The message is in English - The writing direction is left to right and top to bottom - Source code character set is the same as execution character set - File code = process code - and so on ... Table 1 Old Assumptions These assumptions have, however, been changing. For example, users in Japan use Kanji (Japanese ideographic character), whose representation requires more than 14 bits. So, if an application assumes the character set is the 7 bit ASCII, then the applications do not meet the needs of Japanese users. The applications should be adapted in a new environment as shown in Table 2, Item Requirements - Character set is anything - One character is at least 8 bit - One character is >= one byte - Display width of a character is variable numbers of columns on character cell terminal - Character count <> byte count <> display width - Maximum number of characters is unknown - No assumption can be made on printable characters - No assumption can be made on collating sequences - Input from a specially designed device or indirectly from a keyboard using some (interactive) interpreting methods - Message is in any language. e.g. national language - Writing direction might vary among execution environments - An execution character set might be different from the source character set - file code <> process code - and so on ... Table 2 New Environments 2.3 Vision for future The needs of internationalization expressed in Section 2.1 are based on current information technology. Most current at-a-distance communication between people is confined to applications within machine(s) in a single cultural area. And also, the communication between people and machine(s) are mostly being done by text and keyboard interactions. Growth towards the age of general communication between humans via machines has changed the key topics of the internationalization of software applications. This change includes communication between humans who do not share same cultural background. The media and modes of communication in the future will extend beyond text to include data for any of the senses. This scenario may take two steps to make those services available for daily use. First, multiple medias/modes will take place the text/keyboard oriented interaction between person and machine. Then real person-machine(s)-person communication will take place. At the first step, from internationalization view point, persons motion for input/output is very cultural dependent, the new human interface internationalization method should be considered. Then real multi-cultual communication must start. The most simple and near term example is input by hand written character. Writing system is very cultural dependant even for same character, hand written number 7 looked 1 from other cultures, and U.S. 4 is recognized as 6 by Japanese. After the first step, gradual progress toward future vison described as follows to be taken. The fundamental principle that must be taken into account before communication can take place is that all of the components required for message passing must be provided at both ends of the person-machine(s)-person communication link. This requirement applies in any case where people are communicating using a machine through one or more media. The reasons for this requirement relate to technical aspects of consensus or agreement in distributed communication. Viewing the communication process as a conversation between two mutually understood cultures ensures that technical solutions take into account the human aspects of communication. In the sense used here a 'culture' is a 'world of experience', a 'cultural milieu' or a 'cultural background'. This constraint implies that before a conversation can occur each intended participant must be able to describe both the topic and, the nature of the two cultures. The topic must be stated so that both participants may appreciate the boundaries of the conversation. The nature of their own world must be described so that it may be identified, and the culture with which they wish to communicate must be described, so that the target culture can affirm the acceptability of the description that is offered. The criteria for a successful description is that it (the description) must be agreed to by the target 'culture' as adequate for dealing with the topic of conversation. This is a reciprocal requirement on each participant. The human consequence of these requirements is that each person or culture can assert their human right to a cultural identity. This identity includes the right to expect to be addressed in the mode and in the medium that is acceptable to them. 3. Internationalization and Localization The requirements described in section 2 are all external from the point of view of the system. If the system is sufficiently "user friendly", then the methodology used which ensures this "friendliness" is not of primary concern to the user(s). On the other hand, the methodology used is of considerable concern to suppliers. Since many different approaches to providing "friendliness" are possible, this section describes the internationalization/localization method as a recommended approach of the SC22WG20. 3.1 Current approach Most of application developers incorporate all the codes necessary to support different cultural environments into the product design. Codes with similar functionalities are therefore often being developed repeatedly. This is a waste of effort on the part of designers and programmers, and carries with it the risk of inadequate or inconsistent implementation. Also, the cost of developing applications for multiple cultural environments is high. This means that many applications are developed for a single cultural environment, which automatically limits the potential market for the application. The weaknesses of the current approach, described above, are: - High cost due to the reinvention of the same functionalities for different culture(s) - This reinvention is not only costly, but also results in a timing gap in the introduction of the application in the marketplace. This gap may cause serious problems when systems are components of a worldwide network. - The possibility of inconsistencies in applications in different cultures arises, even though the external functionality is the same at the beginning * Again, this can cause problems within a worldwide network, and * lead to up-dating/maintenance problems. These differences could prohibit consistent next-generation system up-grading. The ownership of reinvented applications might be unclear, which could hinder the original inventor's maintenance support capabilities. For these reasons, different systematic approaches should be considered. 3.2 Internationalization/Localization approaches If support services were provided which simplified the development of applications for multiple cultural environments, the effort required to produce applications that were usable in a number of cultural environments would be substantially reduced. Also, the implementation of different cultural features would be more uniform among applications, which means that users would become familiar with what to expect in different circumstances. The Internationalization To permit the design and implementation of an application which can accommodate users with a variety of cultural backgrounds, services are required which insulate the application from a variety of cultural differences that are not relevant to its functionality. A system which provides this service is called the internationalized system in this document. The Localization The internationalized application must then adapt to the specific cultural interfaces required by users with shared cultural needs. This adaption process is called localization in this document. Localization can be provided for specific single cultures, multiple cultures, or for the "Global uniformity" or "Cross-cultural friendliness" principles. Since localization is not necessary only for a single culture, the ideal internationalized system would be the basis for any worldwide (internationalized) system. Note: Once a real internationalized system is in place, even localization to USASCII and the American culture will be necessary, so that the American user can use the system. It is not necessary to start with the traditional ASCII system in order to internationalize. 3.3 Relationship to Application Portability The Internationalization as described above is, in other words, very similer to an Application Portability. In principle, applications can be considered to exist in an environment made up of human users and the application platform. Most of the aspects of Application Portability deal with moving an application from one Application Platform to another (i.e. changing the application platform) while keeping the user requirements the same. Internationalization, however, involves changes to the external interface to the application so as to adapt it to different user requirements while keeping the application platform the same or within same family and the application functionality same as well. In Principle therefore, internationalization does not need to consider portability across different platforms. It should be noted here that most applications only communicate with the user via the application platform and thus Figure 1 provides a realistic view where an application's portability is concerned. +-------------------------------+ | Application Platform | User | | | +---------------------+ | ___ | | | | (^_^) | | Application | | | | | | | ----+---- | | | | | | | | | | | | | | | | +---------------------+ | / \ | | / \ | | / \ +-------------------------------+ Figure 1 Applications Portability View of the Relationship between User, Application and Application Platform Because of this relationship between application, platform and user, Internationalization should be considered not only for applications, but also for the platforms themselves, together, as a paired set. In view of this relationship, internationalization can be defined as a "High level of adaptability of different user interface requirements" or "High localizability". 4. Culture-dependent requirements This section describes the difference in requirements from one culture to another. It itemizes the various specific requirements of cultures, but does not discuss taxomony or the relationship between the requirements. Also, the methodologies for finding solutions to these requirements are described in section XXXX of this document. There is not necessarily a direct link between a given requirement and a given solution, but secondary requirements which are derived from original requirements may have a close relationship with the implementation methodologies. For example, an environmental switching mechanism is NOT a user requirement, but it is a secondary requirement and one of the choices necessary in order to fulfill the original requirements. 4.1 Requirements for Cultural Dependencies It is necessary to adapt the system surface to handle the culture-dependent representation and description. Cultural dependencies can be divided into two categories: the first is SCRIPTS to present natural language in native form, and the other relates to culture-dependent items such as national conventions. Section 4.2 of this document is concerned with scripts, and culture-dependent items are listed in section 4.3. In general, the installation of local requirements on to internationalized systems ensures the desired behavior of the localized system. This installation process is to be called localization. There are many other customer requirements which can be categorized as cultural requirements. However, those requirements which do NOT stem from cultures, geographically and socially speaking, are NOT included in this document. Such requirements would be categorized as Application field culture or similarly. NOTE: This is only a list of the differences or requirements, it is not always necessary to support all items under internationalization. To try to accommodate all (or any) requirements to make systems friendly to all (or any) users does mean DIVERSIFICATION, which is somewhat contradictory to STANDARDIZATION (which is what ISO is aiming for). 4.2 Script At the present time, more than 3000 languages are spoken throughout the world. Just over a hundred or so of these languages are actually written. About one half of the world's population uses some version of the Latin script. The other half uses different significant/minor scripts. Information systems represent these non-ASCII scripts within four writing schemes: alphabetic, (diacritical), syllabic and ideographic: a. In alphabetic scripts, vowels and consonants have equal importance. Vowels are distributed within the alphabet, rather than grouped at the beginning. Moreover, most of the alphabetic scripts are the only ones that have uppercase (i.e. capital) and lowercase (i.e. small) forms of each letter. Typical non-Latin alphabetic scripts are Cyrillic, Greek, Arabic, Hebroe and Japanese Katakana/Hiragana. ---- Samples of alphabetic character to be here ------ Figure 2 Sample of Alphabetic character Some of the alphabetic scripts are used with diacritical marks and some are not. >From information technology view point, needs of support of diacritical marks as a combining marks (so called non-spacing character for example) requires significant different technology, therefore, if necessary, alphabetic script may be categokized into two (with and without diacritical marks). b. In syllabic scripts, a vowel can appear above, below, within or beside its associated consonant, or a vowel and its associated consonant are combined as a single independent symbol. Most of South-East Asian scripts are former case and Korean Hangul is a leter. In some of these scripts, vowels are not separate characters. ----- Samples of Syllabic and Ideographic character to be here ----- Figure 3 Sample of Syllabic and Ideographic character c. In ideographic scripts, e.g. Chinese, each character symbolizes a concept, and sound(s). Moreover, ideographic scripts have an open-ended nature in terms of the number of characters within the script. Alphabetic and syllabic scripts are all phonetic, i.e., without specific meaning attached to the individual character. A specific script is not systematically attached to a given linguistic family. For example, Persian is an Indo-European language written with Arabic characters which were designed for a Semitic language. For the writing schemes discussed above, present-day computer systems and their data entry, processing and display facilities must be rendered capable of supporting any operation that can be supported in English. 4.3 List of Cultural Dependent Items Culture-dependent items recognized as relevant to internationalization are listed below. The more widely used an information system is, the more culture-dependent items there are to be identified. Addition of such items to the list will be done on a demand/registration basis. It is necessary to note that these cultural elements carry different weight depending on the culture of the user. For example, tolerance to input methods that are difficult to use varies with the number and frequency of appearance of the infrequent characters in the data: American users may accept the use of the ALT key to enter accented characters more easily than French users, because French users often encounter these characters in their native language text. /* Editor's note */ /* Questionnaire response should be reflected in this section, if needed */ 4.3.1. Character encoding and handling Character sets used in data, literals, source code, search functions, and identifiers vary in terms of contents (e.g., national characters), the container size (e.g., multi-octet), and encoding (different codings of the same set of characters in containers of the same or different size). 4.3.2. Text/String comparison/ordering process (Collating sequence) Collating sequence depends upon natural languages used. For example, the German sharp-s sorts as ss, and Spanish ch sorts after cz. 4.3.3. Conversion mapping of characters/Case conversion Mapping of characters for conversion (including case conversion) is required to handle character data in some character sets, while it is not allowed in other character sets. Samples of the conversions are Normalized-Character, Uppercase/Lowercase, Free-Standing/Initial-Form/Medial-Form/Final-Form, Subscript/Superscript, Simplified-Form/Variation-of-Form/Traditional-Form (CJK) and so on. 4.3.4. Character property classification Character property classification (e.g., alphabetic characters, numeric characters, and special characters in Latin alphabetic character sets; Hanja and Hangul in Korean character set) differs. 4.3.5. Hyphenation of words, Spacing/Punctuation in text Hyphenation of words is applicable to some natural languages (e.g.,English), while it is not applicable to other natural languages (e.g., Chinese). The ways of hyphenating words differ from one natural language to another. Also, rules for spacing the words (No word spacing is needed for Japanese) and punctuation rules/marks are different. 4.3.6. Word representation of numbers Word representation of numbers may be different even though the number formatting is the same. 4.3.7. Messages and dialogs Natural languages may be used for computer-human dialog. The ways of presenting headings, prompts, error messages, and warnings differ among national languages used. 4.3.8. Documentation Documentation (e.g., user manuals) should be provided in users' natural language 4.3.9. Character(Glyph) size, line size, and line spacing Printed/displayed character size, line size, and line-spacing differ among cultures and scripts. (e.g., the Han script is normally wider than the Latin script) 4.3.10 Preferred Font style Preferred font styles differ among cultures, even for the same glyph. For example, Chinese has a more "brush writing" flavor than does Japanese. Unfamiliar style may give the reader a strong foreign impression. 4.3.11 Writing directions Writing direction (e.g., embedded left-to-right numbers in right-to-left in Arabic text ) is language and culture-dependent. Writing direction differences can also have an impact on the usage of mirrored characters. (e.g. open/close perenthesis vs. left/right perenthesis) 4.3.12 Voice message Some applications may need to have voice messages translated to user's natural language, such as television news programs. Others, such as music, should not be translated. 4.3.13 Date and time calendar Presentations of date, time, and calendar are culture-dependent (e.g., the sequence of presenting the day, month, and year), and different presentations can be used in a single culture (e.g., 09/18/90 and September 18, 1990, 2:00pm and 14:00). Some needs Era name for year and some not. Also, some cultures still use the lunar calendar. 4.3.14 Currency The presentation of currency symbols can be at the beginning (e.g., $15.23 in the US), in the middle (e.g.,15$23 in Portugal) or at the end (e.g., 15,23F in France). Also, currency signs, monetary field size, formatting, etc., are different. 4.3.15 Price expression On top of the Currency presentation, price expression are different in some cases. (e.g. $ 123.45++ means tax and service charge not included in some place) 4.3.16 Number formatting The presentation of numbers is culture-dependent (e.g., 99,999.99 in one place and 99.999,99 in another). 4.3.17 Number Rounding The way in which numbers are expected to be rounded in format conversion for presentation when reducing the number of places after the decimal point is culture dependent, some expecting truncation, others rounding and sometimes different action depending on whether the value is positive or not. 4.3.18 Mathematical symbols Mathematical symbols (for common people use) are different in some cultures. (e.g. dot above and dot under holizontal bar is division symbol for most of cultures, but it means minus in Denmark) 4.3.19 Telephone number formatting The presentation of telephone numbers varies from country to country. Also, the same telephone number has different formats depending on whether it is international, domestic, long-distance or local. For example, (5432)-9876 is the local number in Tokyo, 03-(5432)-9876 is the number from within Japan, but outside of Tokyo, and +81 3 5432 9876 is the international number. 4.3.20 Postal address formatting Presentations of postal addresses vary across countries (e.g. the state-town- street sequence in China and street-town-county sequence in UK). 4.3.21 Measurement systems Measurement systems (e.g., distance, weight, speed) used are culture-dependent. Moreover, most cultures have both a modern measurement system and traditional units. 4.3.22 Icons and symbols Icons and standard symbols are different depending on the country and culture (e.g., icon for trash cans in U.S does not look like trash cans for Japanese). 4.3.23 Use of color The use of color differs depending on the culture (e.g., white dress is for dead body in some of Asian culture) 4.3.24 Paper size There are several standard paper sizes depending upon the culture (e.g., ISO standard A4 size and North-American letter size). 4.3.25. Input mechanism In some natural languages, two or more methods for entering characters (e.g., Kana to Kanji conversion in Japanese) are available, and are selected by users based on their preferences. Preferred input methods differ from culture to culture as well. 4.3.26 Message length The space required to store message are different by language and character sets, and also, structure of sentences and order of the words are differenbt. (e.g. "open file" equivalent in some language is "file open"). 4.3.27 Spelling Sppeling of same words may different from culture to culture. (e.g. Center vs. Centre Color vs. Colour) 4.3.28 Function names There are cases in which the words of a natural language are used as function names. The names may be required to carry the meaning within each culture, and must be appropriately translated. 4.3.29 Page Layout Special page layouts for documents (mainly legal use) are required in some cultures, center-folded, double-sided (Fukurotoji) in Japan is an example of layout requirements. Business letter format is also in this categoly. 4.3.30 Legal/Regulatory requirements Each country has its own regulatory/legal requirements - they are not necessarily the same from country to country. 4.3.31 Taboo words Each culture has its own taboo words which are of no significance in other cultures. 4.3.32 Person's title Methods of addressing a person differ from culture to culture. 5. Models of internationalization Today, there exist different, inconsistent and/or incompatible internationalized implementations of information systems. A given implementation may be appropriate for a given user, but all solutions are not necessarily useful for those who are outside that particular environment. In other words, each internationalization method has some merit, and it is not possible to select one, and only one, solution which meets all of the requirements of every user and application. While these different solutions may all have good reasons for existing, the goal of internationalization is to minimize diversity and unify the different approaches as much as possible. So, these useful but different solutions carry with them the risk of inadequate solutions for a target application. It is very difficult to select the right solution from functional specifications. Furthermore, it is necessary to understand the relationships between different models in order to select the correct internationalization solution. More importantly, standard providers need to understand differences between models in such a way that the standard specifies the most adequate internationalization functionalities, and/or selects functionalities under the uniformity requirement. It is therefore very important that the standardization body be organized in such a way as to take into account each of the different approaches/models for internationalization. This section describes the different internationalization models. As is the case with internationalization itself, it is very difficult to explain the models themselves from a single of view. Therefore, this technical report describes the models from three aspects as follows: - Surface Functionality Models - Application Models - Architecture Models The Surface Functionality Models present a conceptual positioning of internationalization within information technology, and describe all possible requirement combinations from the surface of information systems (which drives different models). The Application Models group internationalization solutions (mostly still functionality viewed from the surface). The Architectual Models present a model of the structure(s) of computer systems to accommodate internationalization requirements. This section also describes historical differences in incorporating i nternationalization requirements into information systems. This historical classification may provide readers of this document with a bird's eye view of existing solutions. 5.1 The Functional aspect Models of Internationalization At the beginning stages of planning the conceptual model for internationalization, we need to consider the several kinds of independent aspects of information processing systems. The following aspects are important for internationalization: - Coded character set independence - Language independence - Culture independence Following aspects are being expressed by internationalization experts: - Implementation Approach Differences - Culture & Custom and Language Coverage degree differences - Necessary Cultural Elements Differences - System Development/Maintenance process Differences - Multilingual and Multicultural Environment In most cases, the requirements for the above "Differences" are independent of each other. Thus many combinations of each "Differences" are possible. 5.1.1 Ternary Tree for Surface Functionality Models of Internationalization Consolidation of all the discussions concerning internationalization has led to the ternary tree model shown as Figure 4 for the conceptualization and aspects of the surface functionality requirements of internationalization. Note that most classifications and explanations can be done on the basis of the three-dimensional or three-axis world. ---- Following ASCII picture is for flat file version, Hard copy version has cubic picutures ---------------- Functionalities of information technology -+ | +-----------<-------------<----------------+ | | | +---- Generic functionality aspect +------------+---- Social aspect +---- Cultural aspect ---------+ | +----------<---------------<----------------+ | | +---- Application-oriented culture +-->---+---- Organization-oriented culture +---- Region-oriented culture (internationalization)--+ | +-----------<-----------<------------<-----------------------+ | | +--- Implementation approaches ----------------> to more detail +---->--+--- System structure and Maintenance approach-> to more detail +--- Support Degree (of elements)------------+ | +-------------<-----------<-------------<------------+ | | +--- Cultural & Custom elements +----->---------+--- Target cultures ----------------> to more detail +--- Scripts Figure 4 Ternaly Tree for Surface Functionality Model 5.1.1.1 Cultural aspect (and Generic functionality/Social aspect) The functionalities of information technologies can be divided into three categories as described in Figure 5: Functionality of information technology-----+ | +-----------<-------------<----------------+ | | +---- Generic functionality aspect +------------+---- Social aspect +---- Cultural aspect ---------> Figure 5 Functionality cubic model of information technology 1) "Generic functionalities", 2) "Social-oriented functionalities" and, 3) "Cultural-oriented functionalities" Functionalities of any information systems in use in practice can be categolized as aboves disregading the provider(s) of the system(s) inteded to do so or not. The generic-type functionality is a basic and common functionality that provides the primary contribution to users. Generic functionalities on their own cannot operate with maximum efficiency onece systems installed for real life use. They need to work as part of a whole with functionalities such as security and maintenance, which are social-type functionalities. But in order to achieve "user friendliness" to let people use the system, a third categoly of independent functionality needs to be recognized, that is, functionalities which are cultural in nature. Internationalization is mostly dealing with a part of cultural aspect functionalities. 5.1.1.2 Region-Oriented culture (and Application/Organization-oriented culture) There are several kinds of categories that can be considered within the cultural aspect. Among them, Application, Organization and Region-oriented cultures are the most important categories. (Figure 6) Cultural aspect ---------+ | +----------<---------------<----------------+ | | +---- Application-oriented culture +-->---+---- Organization-oriented culture +---- Region-oriented culture (internationalization)--> Figure 6 Cultural aspect cubic model Each application field needs to have optimized the efficiency of the user interface for maximum usability in that target field. For example, generic graphic software cannot satisfy the mechanical drawing application user even if its base technologies are the same or similar. This fact demonstrates the need for customization for applications oriented to aculture. Normally, the adaptation of software to specific application fields is accomplished by designing a software for the application field, and/or achieved through customization of generic functions on the part of each customer. The application field is not the only catalyst for customization. Even if the application field is the same, each organization (such as military, manufacturing companies and trade firms) also has its own customs and terms. In order to achieve user friendliness in each organization, it is necessary to adapt to those organizational differences. Historically, the work necessitated by this type of adaptation has been done by in-house software specialists within each organization. The geographical culture environment involved also needs to be taken into consideration. Computer systems need to adapt to these types of cultural requirements in order to maximize user friendliness in each geographical area. If the user encounters elements indicating geographical references that are not his/her own, it is clear that the product has not been internationalized and does not meet his/her needs. On the other hand, for information system provider(s), those geographical cultural dependant functions are so natural for the provider(s), the provider (s) concern on other geographical cultures are far less than that for other categoly of caltures. (and thus, provider did hard code the culture). Special attention should be taken in account to support the geographical culture to make the system usable internationally. For our purposes, internationalization will be considered to be the provision of a service that accommodates regional/country/geographical cultural differences. This technical report deals with this aspect of product adaptation only~ and will not address application/organization-oriented culture adaptation. 5.1.1.3 Degree of coverage (and implementation approach, Cultural Elements ) Supports for region-oriented culture (internationalization) involves three separate aspects: 1) differences in implementation approaches, 2) System structure and Maintenance approaches, and, 3) the degree of support for internationalization elements as shown in Figure 7. Different requirements for each aspect leads to differentinternationalization models. Region-oriented culture (internationalization)--+ | +-----------<-----------<------------<-----------------------+ | | +--- Implementation approaches +---->--+--- System structure and Maitenance approaches +--- Support Degree (for cultural elements) ---------> Figure 7 Region oriented culture cubic model Even if surface functionalities for users are the same, there are a variety of ways to provide the functionalities for real use. The internal differences may harbor potential problems for the future. Implementation differences are not only the result of the involvement of different developers, but are also due to historical reasons (starting from local customization as explained before). It started as the modification of hard-coded cultural dependency, and now, the internationalization/localization model is recommended. The differences in implementation approaches is therefore discussed in the hist orical differences section. Similer to the implementation approach difference, there is a difference for the system structure to support international requirements. Most of the cases, the requirements can be organized into layered structure, then, necessary change of the data contents in one of the layer for the requirements can fullfil the needs. But one can invent one's original layered structure freely. In additon to that, if the fact that system provider belongs only one culture, and never in multiple cultures, maintenance of localized system by original system provider should be another concern. The maintenability (and also re-localizability) of the ststem is highly depends on the structure of the system. The detailed discussion of the structure of the system is described in different section. Even if implementation approaches to, system structure for and support mechanisms for cultural elements are the same or similar, the requirements for the degree of support for cultural elements differ. The result may be a totally different definition of internationalization. 5.1.1.4 Target Culture, Script and Cultural elements. Support Degree -------------------------+ | +-------------<-----------<-------------<------------+ | | +--- Cultural & Custom elements +----->---------+--- Target cultures +--- Scripts Figure 8 Support Degree cubic model The degree of support for each internationalization elements is a key to differences between internationalization models, as well as to the implementation approaches and system structures described earlier. The degree of support has three independent aspects: the degree of support for cultural and custom elements, target cultures and scripts as shown Figure 8. Even if the target culture for internationalization is the same, support requirements for scripts might differ from user to user. Details of the degree of support are described in a separate sub-section. 5.2 Three-dimensional aspects of degree of support requirements The three aspects of the degree of support described in section 5.1.1.4 are used to make up the three-dimensional picture of internationalization requirements. A cube illustrating the three aspects provides a tool useful for explaining different internationalization models up to the present day. (In surface functionality models) 5.2.1 Scripts aspect As described in section 4, different scripts is one of the key differences between cultures. At one time, there was an assumption that each language had its own script to provide its expression. This is not necessarily so - in many languages, several languages are sharing one script. In the modern world, cross-cultural communication is on the rise and as a result, multiple scripts are sometimes necessary to represent an idea even in one language. The existence of multiple scripts forms an integral part of internationalization, and it is not sufficient to simply list them (see section 5.1.1.3), "How many scripts need to be supported ?". This is one important difference determining different models for internationalization. 5.2.2 Cultural element aspect In addition to the question of different scripts, there are many cultural elements that must be supported in order to achieve optimum user friendliness. Each of these cultural variables plays a role when developing an internationalization model. "Which cultural elements are to be supported?" is another aspect of the degree of support question. 5.2.3 Target culture aspect Once the necessary scripts and cultural elements are defined, then the question becomes "To which cultures do these apply?". Even if selected scripts and cultural elements are identical, these behave differently according to the culture in question. Therefore, it is necessary to define the target culture(s) as independent from script and cultural elements selected. 5.2.3.1 Country and Culture "Once the country name is defined, the necessary script and cultural elements are automatically defined". This old-fashioned model is no longer valid. Multiple cultures and languages can co-exist within one country. Thus, identifying the culture(s) is far more important than identifying the country. and this principle is also to be applied to "language". "Even the language name is defined, it is NOT necessary to mean that script and cultural elements are automatically defined". However, some countries (and languages) have (and will have), national "defacto"behavior standards for script and cultural elements. In this case, the country name would still be an important parameter. Note: the separation of the scripts and cultural elements from countries and/or languages is one of the key message of this technical report. 5.2.3.2 Mono- and Multi-culture The discussion provided in section 5.2.3.1 gives an idea of monoculture requirements and multiple culture requirements. Some requirements would apply to a single culture which would fall under the category of monocultural requirements, and some may request a solution for multiple cultures, which is to be called a multicultural solution. Discussions relating to questions of mono- (or bi-) lingual solutions vs. multilingual solutions used to deal with scripts and target cultures at the same time. This technical report, however, is making a clear distinction betweenthese two elements. Multilingual requirements can be separated into mono-culture/multiscript and multi-culture/multi-script multi-culture/mono-script solutions under this heading. 5.2.3.3 Global uniformity Multiple solution requirements for the same cultural element invites the opposite idea of "global uniformity". >From a target culture point of view, this "global uniformity" is to be classified as one of many target cultures (which is using international standards). 5.2.3.4 Cross-cultural friendliness In a sense, the "Cross-cultural friendliness" concept is one special kind of the "Global uniformity" concept, and so this is also categorized as one of the target cultures. 5.2.4 Three-dimensional definition of internationalization models. By using the support degree cube, it is possible to explain most internationalization models. On one axis of the cube, all kinds of scripts are to be plotted, and the 2nd axis is to be used for plotting cultural elements and the 3rd for plotting target cultures including the "Global uniformity" and "Cross-cultural friendliness" principles. Thus, a traditional monolingual model can be defined as one with a selected culture, script and necessary cultural elements. Therefore, the internationalization of monolingualism can be defined as "High mobility of moving cross-sections" of selected scripts and target cultures, And multilingualism can be defined as a plane or sub-cube within the cube. For example, if several scripts are needed for some cultures, the model can be defined on the cube by picking necessary scripts, that of the culture, and cultural elements which make one plane within the cube. If multiple cultures are necessary, the target cultures, scripts and cultural elements would make small sub-cubes within the cube. 5.2.5 Implementation history --- To be filled in later --- 5.3 Physical application models The surface functionality model covers almost all possible requirement combinations for internationalization. However, because it also tries to cover a wide range of requirements, it may be easier for the reader to have a separate section explaining the typical internationalization requirements. This section is the "reader friendly explanation of typical internationalization models". In order to organize the services in support of typical models of internationalization, five typical application models will be defined. - Culture specific - Mono-lingual - Dual-script - Consecutive multilingual - Concurrent multilingual When dealing with an application which can migrate to one of these models, it will provide a range of services to assist in the internationalization of the application. Note that this is not a classification of all applications. Culture specific Assume that the first target market is a market in which some functionalities are not necessarily embedded into the original application, functionalities which are actually missing but are required by another cultural environment, and functionalities which inherently have to do with one or more particular culture. These are some of the factors that result in reinvented/redundant type implementations when adapting to other cultural requirements, and with some work the application can be evolved to eliminate them. Other types of cultural dependencies might be grammar checks for a specific language or software which translates one language into another. An application which has inherent cultural dependencies is called culture- specific. It would not make sense to localize a culture-specific application to a different culture. Instead, one would provide a different application with perhaps an analogous functionality. Mono-lingual Some applications are inherently monolingual, they can only process text data consisting of a single language. However, the generic applicability of the application might make it desirable to localize it to process data across a wide number of languages. Dual-script Sometimes there might be a bilingual need, for example to process ASCII and Catalan simultaneously. For historical reasons, many users want applications, and by extension, platforms to support ASCII and one other arbitrary language. These bilingual applications and platforms are usually implemented by combining the local language's character repertoire with an appropriate subset of coded character set. This creates, from an electronic information processing perspective, a new and potentially more complex language to process. Bilingual applications are therefore simply a special case of monolingual applications from an architectural perspective, but occasionally it is useful to discuss them separately. Consecutive multi-lingual When implementation of an application treats two or more languages separately and switches between them as it processes its data, this implementation is said to be consecutively multilingual. To facilite its visualization, think of the data as being separate blocks of text, with each block being only in one language. For example, these applications could use the "setlocale()" function specified by ISO 9899 (C Language) and ISO 9945 (POSIX) to switch between language processing facilities. In a consecutive multilingual application, an operation such as collation could operate on a list of strings within a block of text in only one language. Concurrent multi-lingual Concurrently multilingual applications process data which inherently contain several languages. These data could be intertwined to arbitrary complexity, and they typically require the use of a compound document data structure. The fullest extent of our internationalization vision would have operations such as collation operating on lists of multilingual strings, but the result of such operations are not well defined today. Specific applications can fall loosely into one of these categories either because of some inherent functionality reason, or for historical implementation reasons. An example of the former might be an application whose essence is to process a particular data structure which is not easily changed to deal with more than one language at a time. Of course any program which deals, explicitly or implicitly, with either the structure of a language or with a particular coded character set will tend to be culture-specific, and considerable re-designing effort may be necessary in order to attempt to rectify this. 5.4 Architectual model Internationalization can be analyzed not only from the perspective of its surface functional components. System implementations differ depending upon the programming language, other tools and the system environment, the software architecture of the underlying operating system (for example, the relative performance of managing large areas of memory versus processing language dependent data on the spot), the model of the application (such as the decision of where the language dependencies are kept), and the hardware facilities that enable efficient system implementation. This is the reason why Architectual modeling for internationalization is necessary along with Surface functionality models. 5.4.1 Interface model The simplest information system model is made up of the following elements: - User (Application User and Operator) - Application Platform - Application Software It is necessary to describe the relations and interfaces between these elements. Usually, there are fixed interfaces between the above elements as follows: - API (Application Program Interface)-between Application Software and Application Platform - User Interface-between User and Application Platform/Application Software Each of the above three elements play a necessary role in internationalization. And since the user sees user interfaces not only through application software, but also through application platforms, both interfaces must have the necessary services to support surface functional requirements for the internationalization described earlier. The generic structure of one element is shown as Figure 9. +-------------------------------------+ | culture dependent specification | | | +-------------------------------------+ | universal specification data | | | ===================================== | | | data representation(code etc.) | +-------------------------------------+ | culture independent layer | | | +-------------------------------------+ | | ================================================= interface for the other elements or external environments Figure 9 Generic structure of one element The specifications of culture dependency should be described by universal data (culture independent form), and the data should be represented by a well- defined form, like code, font, graphic ICON and so on. After representing the data, they become a unitype data or culture-independent form, like binary coded data. Once it becomes culture independent, it is possible to send the data to other elements of the said information system or external environment outside the said information system. 5.4.2 Layered Architecture model Software internationalization technologies can be divided into three major categories: - Input services - Character handling services - Internal design methodology - Presentation services 5.4.2.1 Input services ----- To be filled in later ------- 5.4.2.2 Character handling services Character handling services range from the physical representation of data to the interpretation and processing of text. 5.4.2.2.1 Physical layer The physical layer describes the encoding of data in the computer. Simple encoding techniques such as fixed-length characters allow a very low-cost support of textual data when the number of symbols to be represented is small. This is the case with most of the European languages where each symbol can be encoded in a single byte. 5.4.2.2.2 Logical representation layers In some cases, the logical representation of symbols can be further encoded in a more efficient scheme to reduce the amount of space required to store the data. For example, compaction algorithms in data communications may be assigned to this level of the internationalization model. SYNTACTIC LAYER The syntactic layer describes the logical representation of the characters, allowing the identification of encoded characters in an array of both data and meta-data (information related to the data itself). The syntactic layer is the subject of many standardization efforts currently underway. Known examples of recent developments recommend either a simple scheme where the identification of symbols is done by using a rule external to the data or by introducing data tags that differentiate the data elements. For example, both UNICODE and ISO 10646 allow for fixed-length (in bits) encoding of symbols. Character identification tools can identify the symbols in an array by very simple algorithms. Encoding schemes based on the ISO 2022 standard (such as Compound Text) may allow a much more compact representation of data when the set of symbols is very large but the set of frequently used symbols is relatively small. The cost of the efficient representation is the complexity and processing time required to handle individual symbols. SEMANTIC LAYER The semantic layer deals with mappings between single symbols (identified in the previous layer) and actual characters. Semantics is used here as the set of rules deals with the characters. This layer is void when characters are represented by a sequence of a constant number of symbols. For example, ASCII characters are represented by a single symbol, and for any encoding scheme where only ASCII data is represented, the semantic layer is void. In other cases (e.g. some coded character sets support of floating diacritics) the character identification layer includes the rules for identifying characters from a set of symbols (basic character shape and diacritics). 5.4.2.3 Internal design methodology The internal design methodology includes the specification of the language programming interfaces for software designers and developers. Software internationalization specifications such as those proposed by POSIX describes the tools that decouple the application from its specific behavior required by international users. 5.4.2.3.1 System service layer System services make source code (macros), libraries, commands, and shell programs available to the users of specific operating systems. Examples of these services include the tools to interact with the computer in the user's preferred language, the presentation of numeric and date information according to the customs of where the machine operates, the definition or customization of the behavior of native languages. 5.4.2.3.2 Programming language layer An alternative approach to software internationalization is to modify the semantics of the programming language constructs to introduce the native language processing variants in the object program. For example, a programming language such as COBOL could support the sentence to indicate run-time selection of language. "CULTURE XXXXXX" to indicate that dates and numbers have to be formatted according to the definition of the specified culture supported by the platform, and to perform comparisons, determine sub-strings, index character arrays, or issue messages following the conventions of the language. Note that the support of internationalization at this layer can be implemented by using the system services of the previous layer, but the actual implementation in each case may differ. 5.4.2.4 Presentation services The presentation services in the architecture include the specific language variants of the product that communicate with the user. These services are related to the software product but they are not subsets or modules of the application. The reason for their inclusion in the architectural model is to recognize the importance assigned to them by the user (the first internationalization aspect required by users is to have localized messages) and to show the relationship between character handling, internal design, and presentation services. There are two distinct layers in the presentation services: Application functionality layer and Localization tools layer. 5.4.2.4.1 Application functionality layer Application functionality includes the definition of the behavior of the language-dependent features for a particular native language. For example, in both the system services and the programming language layers, the internal design components of the architecture define how the programmer can invoke services to perform language-dependent functions such as displaying the date and time. At the application functionality level, the standards specify that dates are formatted using full month names and the Emperor year when the application runs in a Japanese environment. 5.4.2.4.2 Localization tools layer Localization is the process of setting the appropriate parameters and translating the messages and related documentation of the product to conform to the native language requirements of the users. Standardization at this level has not been fully developed. Some early attempts in the standardization of the system services include the normalization of the formats of messages (such as indicating that text and parameters in a message could be re-ordered according to the language), but there are still many components open to standardization: Program source code analysis tools, in order to determine the areas where the program has to be internationalized. Some of these areas are obvious, such as issuing messages or invoking formatting tools. Others may require more sophisticated techniques, such as identifying the use of bytes either as characters in textual information or just as storage units for non-textual data in the program. Computer-aided translation for internationalized documentation and user messages. Testing tools for localized products, both for the product itself (to check for erroneous assumptions about the data that the program may encounter) and the localized components. Since many of the localization tools in the industry are of a proprietary nature and the number of users of these tools is very small (usually, localizat ioY iY donI ay centerY staffed b*@professionas localizers)~ therI iY ny * critical mass to start standardization activities at this level of the architecture. Figure 10 represents the software internationalization layered model and shows the dependencies between the individual components of the architecture. +---------------------------+--------------------------------+ | Localization tools | | +---------------------------+ Presentation services | | Application functionality | | +---------------------------+--------------------------------+ | Programming language | | +---------------------------+ Internal Design methodology | | System services | | +---------------------------+--------------------------------+ | Character identification | | +---------------------------+ | | Logical representation | Character handling services | +---------------------------+ | | Physical | | +---------------------------+--------------------------------+ | Data conversion | | | (Normalization) | Input services | +---------------------------+ | | Physical | | +---------------------------+--------------------------------+ Figure 10 Software internationalization layered model 6. Expectations and Obligations This chapter introduces service requirements that satisfy the needs of programmers in creating applications geared towards the international community of users. The examples included in this chapter show the functionallity expected by the programmer using the programming language or operating system environment, and describe in greater detail the internationalization services that need to be supported. It is anticipated that most programming languages will provide similar services in their native syntax or by accessing platform-provided services, and that the services will have an equivalent behavior for every specific cultural element supported by the programming language. For example, it is anticipated that programming languages able to format numeric values will be able to do so in a manner satisfactory to the users in the supported cultural environments. A proposed extension is the data model for textual data, that has to accommodate character repertoires other that the single-byte character model. Services are described here as enabling technologies. Programming languages will need to incorporate these techniques to facilitate the communication between the user and the computer by using the user's native language, for example in computer-generated messages, source code literals, comments embedded in the program, or by providing a wider range of characters to name the program identifiers. The diversity of cultures to be supported recommends an implementation strategy based on minimizing the number of versions required. For example, current internationalization standards recommend the dynamic selection of cultural elements at run-time and the support of multiple character encodings. This chapter will also refer to some standards such as ISO/IEC 10646 and POSIX, and to industry proposals such as the X/OPEN Company object-oriented internationalization specifications. These examples should be treated as a base for further discussion and not as an endorsement or mandatory requirement for supporting the described services. 6.1 Service Requirements The service requirements for internationalization are identified in this section. In this context, programming languages can be considered to be applications running on a (HW/SW) platform. The services can be provided by the platform (such as through POSIX interfaces for internationalization to the operating system) or they can be provided as part of the application. Clearly, a solution where these services are provided by the platform for all applications is preferable. Examples of standards which are related to the required internationalization services are identified. 6.2 Character Set and Data Representation Service 6.2.1 User Requirements Dialogues of international users with system platforms or applications in local language require the support of language-specific character sets. For example, German text contains "umlauts" and the "sharp S" - these are characters which do not exist in the English language, but are essential in German. Most languages have similar requirements for characters which are not included in the basic ASCII set (American Standard Characters for Information Interchange). The written languages can be classified into various groups based on fundamental characteristics as described in xxxx. 6.2.2 Character Set Repository Service The Character Set Repository Service provides a central character set repository that contains coded character sets and relevant information about them. It supports character set and data representation related services. Entries in the repository may include: - Code format - Escapement rules: some languages such as Hebrew or Arabic are written from right to left while numbers within the text of these languages escape from left to right. It is necessary to maintain this information with the character set information. - Character set identifier - Data Classes - Mapping rules - Code extension techniques Some standards related to Character Set Repository Service: - ISO 2375, 3ed, 1985 : Procedure for Registration of Escape Sequences - ISO 7350, 2ed, 1990 : Registration of Graphic Character Subrepertoires - ISO 2022 : Code extension techniques 6.2.3 Character Set Handling Service The Character Set Handling Service provides the capability to recognize, process, store, retrieve, communicate, and present different character sets. Some standards related to Character Set Handling Service : - ISO 8859-1, 1ed, 1987 : Latin Alphabet No. 1 - ISO/IEC 10646-1 : Universal Multiple Octet Coded Character Set - JIS X0208:1990 : Code for the Japanese graphic character set - JIS X0201:1976 : Code for information interchange in Japan 6.2.4 Character Set Identification Service The Character Set Identification Service provides unique identification of character sets. This service allows that different character sets can be used concurrently on a system or in an application without the danger of data corruption. It also provides information for the exchange of data between systems or networks, and the possibility to identify appropriate translation tables between different character encodings. 6.2.5 Data Class Definition Service Characters have different character classes. Since processing often depends on the decision whether or not characters are considered space, numeric, alphabetic, or special characters, a service is provided to identify the class of a character . 6.2.6 Case Mapping Service The Case Mapping Service provides upper case to lower case and lower case to upper case mapping. 6.2.7 Data Presentation Service The Data Presentation Service provides the capability to present data on different display units, printers, or other output devices. According to rules in a repository, the service includes escapement of characters and selection of different shapes. Preparing data for presentation may involve extensive translation and/or transliteration due to hardware selections or limitations. The service also provides default presentation forms for coded characters that have no associated graphic shape. 6.2.8 Data Announcement Service The Data Announcement Service provides the capability to recognize the coded character set of data entities (files, messages, etc.). This capability allows the processing and storage of data in different encoding schemes on the same system without the danger of data corruption. International standards bodies are presently addressing the announcement mechanism. 6.2.9 Data Communication Service The Data Communication Service provides the capability to transmit and receive data to and from communication systems while maintaining the integrity of the data. In international communication environments, this may include data translation due to different coded character sets being used in different service categories. 6.2.10 Data Input Service The Data Input Service provides support for keyboards with local characters and other complex input methods, especially for Far East pictographic character sets. Potentially, input data can carry character set identification information. - ISO/IEC 9995-1,-2,-3,-4,-5,-6,-7 Keyboard standard 6.2.11 Character Set Invocation Service The Character Set Invocation Service provides the capability to specify the character set to be used for input, processing, and output of data. This functionality can be invoked through: - user selection - default specification - data announcement techniques - information about the presentation capabilities of specific output devices The service will potentially also allow the user to dynamically switch from one character set to another, if required. 6.3 Cultural Elements Service 6.3.1 See chapter xxxx 6.3.2 Cultural Elements Repository Service The Cultural Elements Repository Service provides the capability to maintain and access rules and conventions for cultural entities. These might be areas with a common language, geographic areas, or areas with common cultural or historic background. The repository contains information that supports other cultural elements services. A standard related to Cultural Elements Repository Service - ISO/IEC DIS 9945-2 : POSIX shell and utilities 6.3.3 Date Format Service The presentation of day, month, and year varies in different countries, as do habits of using long or short names for days and months and prefixes in long date formats. For example, in the US, the date is mostly presented as mm/dd/yy, while in Europe the forms dd/mm/yy or yy-mm-dd are commonly used. Considering the 5th of October in 2001 we will find the following confusing formats: 10/05/01 for the US, 05/10/01 or 01/10/05 for Europe. Japan counts the years of the emperors era. The Date Format Service provides the capability to use these formats. Some standards related to Date Format Service - ISO 8601, 1ed, 1988 : Representation of Dates and Time - JIS X0301:1977 : Identification Code of Dates in Japan - ISO/IEC DIS 9945-2 : POSIX shell and utilities 6.3.4 Time Format Service While some countries prefer the 12-hour cycle with a.m. or p.m. others use the 24-hour clock. The Time Format Service has the capability to handle these formats as well as world time zones and their offset values relative to UTC. 6.3.5 Day Numbering Service Weeks begin on Monday in certain countries, on Sunday in some other countries, on Saturday in Islamic countries. The day numbering service provides the number of the day. 6.3.6 Week Numbering Service In some applications it is often more convenient to use week numbers for calculations than months and days. The first week in a year is defined differently in various countries. The Week Numbering Service supports these conventions and provides conversion routines. 6.3.7 Numeric Formatting Service Interpretation of numeric fields in unfamiliar formats is one of the major contributors to human errors in data processing. The Numeric Formatting Service provides the capability to handle the different cultural conventions: the point as the decimal delimiter is most commonly used in America; most of Europe uses a comma instead. Spaces or periods are used to separate groups of normally 3 digits. Negative numbers are identified with leading or trailing minus signs and also by surrounding an unsigned value by parenthesis. 6.3.8 Currency Formatting Service The Currency Formatting Service describes the handling of currency fields and symbols: not only the symbols for currencies vary from country to country, but also their placement before, after, or between the integer and the fractional part of the amount. The field lengths and the number of digits after the decimal point depend on the monetary system. Negative amounts are indicated according to local rules and regulation. 6.3.9 Measuring System Service Presentation of dimensions in inches, feet, yards, and miles are different from millimeters, centimeters, meters, and kilometers; ounces and pounds convert into grams and kilograms, cups and gallons into liters, and degrees into centigrade. Conversion facilities and country specific presentation are provided by the Measuring System Service, based on the cultural convention repository. 6.3.10 Compare, Sort, and Search Service See annex 1 6.3.11 Paper Format Service This service provides the capability to select various paper formats as defined in the cultural elements repository. 6.3.12 Cultural Elements Invocation Service This service provides the capability to invoke the cultural elements requested by the user, by default from the repository, by the application, or as defined in the user profile. 6.4 Natural Language Support Services 6.4.1 User Requirements and Background Information The use of computers today is no longer limited to the domain of highly trained specialists; they are now a commodity in homes, schools, and businesses, where they must be used by people who do not have significant "data processing" skills. It is impractical to expect that everybody working on a computer will understand English. Instead, the computer must learn to "speak" the local language of the individual user. A service must be provided with the capability to present messages, menus, help information, and online documentation in the language selected by the user, even when more than one language is required in a single document. This service must enable dialogs with applications and operating systems in local languages. Finally, for text processing, the service must include hyphenation, spell checking, and a thesaurus for each language. Only when these facilities are provided can the computer be considered "useful" on a worldwide basis. 6.4.2 Multi-Lingual Support Services The Multi-Lingual Support Service provides the capability to support more than one natural language simultaneously. For example, a text processor works with text in Japanese and French on the same page with synchronized paragraphs. 6.4.3 Message Service The Message Service provides the capability to present (display, print,etc.) messages, menus, forms, help information, and online documentation in the language selected by the user. Different languages can be used simultaneously. The service maintains independence of the messages from the applications, allows for variable message length (German translations of English messages tend to be 30% longer), has a delivery service to insert parameters into translated messages, and uses the cultural convention repository for format definitions. It also allows users to interact with applications and operating systems in the language of their choice. It allows entering of local language, using local characters, and parsing of local formats as defined in the cultural elements repository. 6.4.5 Language Selection Service This service provides the capability for the user to specify the language of interaction with the application. If none is chosen, the default language is selected. 6.5 Internationalization in Fortran - A Possible Approach The following description indicates one way in which the functionality and services required for support of internationalization might provided in Fortran. It is not intended to imply that this is a recommended approach, it is simply presented as a form of existence proof to indicate that it would not be difficult to add the necessary functionality. This example shows one way in which two key issues might be approached, namely the identification and/or specification of an appropriate cultural environment, and the use of that cultural environment to automatically invoke the culturally appropriate form of string comparison to enable an array, or other collection, of textual items to be sorted in the correct order for the environment. In this approach three new intrinsic procedures provide all the necessary control, and these are briefly described first in a simplified version of the style used in the Fortran Standard (ISO/IEC 1539 : 1991). (i) CULTURAL_ENVIRONMENT (ELEMENT) Description. Returns the processor dependant code for the cultural environment specified. Class. Inquiry function. Argument. ELEMENT is optional, but must be scalar and of type default character if present. Result Type. Default integer. Result Value. If ELEMENT is present then it must represent the name of a cultural environment supported by the processor; the result will be a processor-dependent integer which will identify this cultural environment within this processing system. If ELEMENT is not present the result will be the processor dependent integer which identifies the cultural environment in which the processor is currently operating. (ii) REPERTOIRE_KIND (ENVIRONMENT) Description. Returns the kind type of the character repertoire associated with the current cultural environment. Class. Inquiry function. Argument. ENVIRONMENT is optional, but must be scalar and of type default integer if present. Result Type. Default integer. Result Value. If ENVIRONMENT is present then it must represent the integer code for a cultural environment supported by the processor; the result will be the kind type of the character repertoire which is associated by default with this cultural environment. If ENVIRONMENT is not present the result will be the kind type of the character repertoire which is associated by default with the cultural environment in which the processor is currently operating. (iii) SET_ENVIRONMENT (ENVIRONMENT,CHAR_KIND) Description. Changes the current cultural environment. Class. Subroutine. Arguments. ENVIRONMENT must be scalar and of type default integer. It specifies the cultural environment to be used for subsequent processing. CHAR_KIND (optional) must be scalar and of type default integer. If present, it specifies the kind type to be used by default for any subsequent character declarations; if absent the kind type associated by default with the environment specified will be used. An example of the use of these procedures to automatically localise a program to the environment which is current when the program commences execution might be as follows: PROGRAM Automatic_Localization IMPLICIT NONE ! Establish current cultural environment INTEGER, PARAMETER :: environment=CULTURAL_ENVIRONMENT(), & ch_kind=REPERTOIRE_KIND() ! Character variable declarations CHARACTER(KIND=ch_kind,LEN=20) :: string1,string2 ! etc. . . ! Start of execution CALL SET_ENVIRONMENT(environment) ! This is not strictly . ! necessary, but is . ! probably good practice One of the effects of setting an environment could be to overload the comparison operators between character strings of the character kind specified or implied so as to use the correct culturally comparison algorithm. Thus the statement IF (string1 < string2) THEN . . which would normally compare two strings character-by-character using the rules specified in the Fortran Standard would compare them using the correct culturally sensitive algorithm once a SET_ENVIRONMENT statement had been obeyed. A call to a standard sorting routine would then carry out the sorting using this same algorithm without the need to make any adjustment to the sorting procedure at all. There are many other internationalization functionalities which could be incorporated into Fortran in a similar way with relatively little effort, and with only minimal extensions to the language. It is believed, moreover, that this functionality could initially be added by means of a module, as has been suggested for Fortran's varying length string datatype in CD 1539-2. Annex A Related activities ---- Contents of this Annex is incomplete any of information to make this Annex better welcomed---- A.1 International activities Internationalization is a subject of considerable current interest in information processing, and there are several standardization committees in different organizations which are making proposals for internationalization support within their work programme. The following is a list of some of these activities: POSIX RIN(rapporter Group of Internationalization) : POSIX(SC22 WG15) organized a rapporter group of internationalization to investigate POSIX related internationalization matters such as locale mechanism, national profiles, relevant kernel functions and so on. Character Coding(SC2) : Recent proposal for DIS 10646-1.2 shows the importance of character coding in the internationalization problem. Natural Character Support Functions in Programming Languages(SC22) : In the programming language specification, it is necessary to include such functions to support national character sets, and Japan proposed a guideline of introducing national character handling mechanisms in programming languages. In each of the programming languages, it is investigated to define the support functions to handle national characters. Database(SC21 WG3) : National character handling mechanisms in database language like SQL2 are investigated. Text and Office System(SC18) : Text related software especially document processing is quite national language dependent, and the effort is given to handle national characters in document generation. Activities in UNIX related industrial standard organizations : UNIX based industrial standard organizations, X/OPEN, UI and OSF also have their own working groups for internationalization to investigate locale switching mechanisms, message mechanisms and so on. In such a category, there are also UNIFORUM, X-Consortium and EurOpen. A.2 Japanese activities (International Standard ) JISC | +-------------------------------------+ | | IPSJ/ITSCJ INTAP (Standard committee) | | | | | +------+-----+-------+-------+------------+ +----+----+ | | | | | | | | SC2 | SC21 | I18N Kanji Std. OSE OSI (Char code) | (Data Base) | (I18N Overall) (CJK JRG) | | (SC22/WG20) | | SC18 SC22 (Text) (Prog Lang) | POSIX (SC22/WG15) (Japanese National Standard) JISC | JISA | INSTAC | +----+----+-----+----+ | EDPS-WG3 ( Vendor consortia ) <- International level | | | | UI J L10N SIG OSF I18N J local SIG | | +---------+--------+ | UI-OSF J L10N Group ( Governmental project ) SIGMA CICC ( User's group ) JUS | | X11 Study Group (Contact point) JISC: (National Standard body & National Body of ISO) Japanese Industrial Standards Committee c/o Standard Department (Secretariat of JISC) Agency of Industrial Science and Technology, Ministry of International Trade and Industry IPSJ/ITSCJ: (Standard committee corresponding to ISO) Information Technology Standards Commission of Japan Information Processing Society of Japan Kikai Shinko Building 3-5-8 Shiba-kouen Minato-ku, Tokyo 105, Japan IPSJ/ITSCJ/SC2: (Character set and information Coding) IPSJ/ITSCJ/SC18: (Text and Office System) IPSJ/ITSCJ/SC21: (Information Retrieval, Transfer and Management for Open Systems Interconnection) IPSJ/ITSCJ/SC22: (Programming Languages and Systems Software Interfaces) IPSJ/ITSCJ/SC22/POSIX: (POSIX) IPSJ/ITSCJ/SWG on Internationalization (Internationalization) IPSJ/ITSCJ/SWG on Kanji standardization (Kanji) JSA: Japanese Standard Association 4-1-24 Akasaka, Minato-ku, Tokyo 107 Japan JSA/INSTAC/EDPS-WG3: Japanese Collation and Kana-Kanji Conversion Work Group Information Technology and Standardization Center Japan Standard Association 4-1-24 Akasaka, Minato-ku, Tokyo 107 Japan INTAP: (Secretariate of Asia Oceania Workshop) Interoperability Technology Association for Information Processing, Japan Sumitomo Gaien Bldg. 3F 24 Daikyo-cho, Shinjuku-ku, Tokyo 160 Japan CICC: (Facilitator of Asian Forum of Standardization for Information Technology AFSIT) Center of the International Cooperation for Computerisation Mita 43 Mori Bldg. 3-13-16 Mita, Minato-ku, Tokyo 108 Japan SIGMA: SIGMA system Inc. Akihabara Sanwa Toyo Bldg. 6F 3-16-8, Sotokanda, Chiyoda-ku Tokyo 101, Japan OSF Internationalization Japan Local SIG: c/o Seita Iida IBM Japan, Ltd. 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken 242 Japan UNIX International Japanese Localization SIG: UNIX International Asia/Pacific Office Shinei Bldg. 1F 2-35 Kameido, Koto-ku, Tokyo 136 Japan UI-OSF Japanese Localization Group c/o Toshinori Numata Fujitsu Ltd. 1015, Kamikodanaka Nakaharaku, Kawasaki-shi, Kanagawa-ken 211 Japan c/o Akio Kido IBM Japan, Ltd. 1623-14 Shimotsuruma, Yamato-shi, Kanagawa-ken 242 Japan JUS: Japan UNIX Society Marusyo Bldg. 5F 3-12 Yotsuya Shinjuku-ku, Tokyo 160 Japan JUS/X11 Study group: (A study group on X11 Window system) ANNEX B: RELATED STANDARDS -------- Following standards are lited as related standards ------------- -------- Additon and Deletion to be done -------------------------------- -------- Full formal and up-dated name to be used ----------------------- Standards are pblished by following organizations: AFNOR : Association francaise de normalisation ANSI : American National Standards Institute ASMO : Arab Standards and Metrology Organization CAS : China Association for Standards CCITT : Consultative Committee for International Telegraph and Telephone CSA : Canadian Standards Association DIN : Deutsches Institut fuer Normung ECMA : European Computer Manufacturers Association FIPS : Federal Information Processing Standards IEC : International Electrotechnical Commission ISO : International Standards Organization JIS : Japanese Industrial Standards Committee NNI : Nederlands Normalisatie-instituut SFS : Suomen Standardisoimisliitto SCC : Standards Council of Canada ANSI BSR X3.134.1 : 8-Bit ASCII Structure and Rules ANSI BSR X3.134.2 : 8-Bit ASCII Supplemental Multilingual Graphic ANSI X3.32 : Graphic Representations of the Control Characters ANSI X3.4 : American National Standard Code for Information ANSI X3.41 : Code Extention Techniques for Use with ASCII ANSI X3.64 : Additional Controls Use with ASCII ANSI X3.110-1983--CSA T500: Videotwx/Teletex Presentaion Level Protocol Syntax ANSI X4.16 : American National Standard Magnetic Stripe Encoding ANSI X4.22 : For Office Machines and Supplies-Alphanumeric ANSI X4.23 : For Office Machines and Supplies-Alphanumeric ASMO 445 : Billingual Arabic Latin telex 5-bit code and keyboard ASMO 449 : 7-Bit coded Arabic character set for information interchange ASMO 584 : Conversions between ASMO 445 and ASMO 449 ASMO 662 : 8-Bit corded Arabic character set for information interchange ASMO 663 : Arabic keyboard terminal layout ASMO 708 : 8-Bit coded Arabic/English character set for imformation interchange CNS 11643 : contains 13.051 most commonly used Chinese characters, which can be CSA Z243.200-1988 : Canadian Keyboard Standard for the English and CSA Z243.4-1985 : 7-Bit and 8-Bit Coded Character Sets for DIN 2137 : German Standard: Office Machines, Alphanumeric Arrangement DIN 66008 : German Standard: Paper Sizes ENV 41 501 : Graphic character repertoire for Videotex systems ENV 41 502 : Graphic character repertoire for Teletex ENV 41 503 Information Systems Interconnection : European graphic character ENV 41 504 Information Systems Interconnection : Character repertoire and ENV 41 505 Information Systems Interconnection : Character repertoire and ENV 41 506 Information Systems Interconnection : Data stream formats compatible ENV 41 507 Information Systems Interconnection : Data stream formats compatible ENV 41 508 Information Systems Interconnection : East European graphic ENV 41 509 : Formatted documents - Basic character content ENV 41 510 : Formatted documents - Extended mixed mode ENV 41 511 : Layout independent documents - Simple messaging profile Processable GB 1988-80 : 7-Bit Coded Character Sets for Imformation Processing and GB 2311-80 : Information Processing-7Bit Coded Character Set-Code Extention GB 2312:1980 : Basic Chinese character set GB 2312-80 : Code for Chinese Graphic Character Set for Informa tion GB 7589:1987 : 7327 additional Chinese characters GB 7590:1987 : 7039 additional Chinese characters GB 8565:1989 : Chinese coded character set for text communication GB 12345:1990 : Complex Chinese characters GB 13000:1990 : Proposal for a unified Han Character Set (HCS) IEC-417 : Graphic Symbols for Use on Equipment IEEE P1003.2 POSIX shell and utilities ISO 639, Names of languages ISO 646, 2ed, 1983 : ISO 7-Bit Coded Character Set ISO 843, Transliteration of Greek to Latin ISO 1000 SI ISO R1090 : Functions Key Symbols for Typrwriters ISO 1091 *@ : Typewriters-Layout of Printing and Function Keys ISO 1092 : ISO R1093 : Keytop and Printed or Displayed Symbols for Adding ISO 2014, 1ed, 1976 : Writing of Calendar Dates in All-numeric Form ISO 2022, 3ed, 1986 : 4ed 1993? ISO 7-bit and 8-bit coded character sets - Code ISO 2047:1975, Graphical representations for the control characters of ISO 2126 : Basic Arrangement for the Alphanumeric Section of ISO 2375, 3ed, 1985 : Procedure for Registration of Escape Sequences ISO 2530 : Keyboard for international information processing ISO 3166, Names of countries ISO 3243 : Keyboards for Countries Whose Languages Have Alpha-betic ISO 3244, Principles governing the positioning of control keys on ISO 3307, 1ed, 1975 : Representation of Time of the Day ISO 3461 : Graphic Symbols, General Principles for Presentation ISO 3791, Keyboard layout for numeric applications ISO 4031, 1ed, 1987 : Representation of Local Time Differentials ISO 4062 : Dictation Equipment symbols ISO 4169, Key numbering systems and layout charts ISO 4197 : Office MAchines-Keyboards-Key Numbering System and Layout ISO 4217, 3ed, 1987 : Codes for the Representation of Currencies and funds ISO 4873, 2ed, 1986 : (3ed, 1991?) 8-bit code for information interchange - ISO DIS 4882 : Office Machines and Data Processing Equipment, Line ISO 5426:1983 Extension of the Latin alphabet coded character set ISO 5427:1983 Extension of the Cyrillic alphabet coded character set ISO 5428:1984 Greek alphabet coded character set for bibliographic ISO 6093, 1ed, 1985 : Presentation of Numerical Values ISO DIS 6329 : Symbols for Duplicating and Document Copying Machines ISO 6429, 1ed, 1988 : (1ed 1992?) Control Functions for Coded Character Sets ISO 6438:1984 African coded character set for bibliographic ISO 6630 ? Bibliographic control functions ISO 6861 DIS Cyrillic alphabet coded character sets for Slavonic ISO 6862 DIS Mathematical coded character set ISO 6936, 2ed, 1988 : Conversion between ISO 646 and CCITT ITA 2 ISO 6937, 1992? Coded character sets for text communication ISO 6937-1, 1ed, 1983 : Coded Character Sets for Text Communication ISO 6937-2, 1ed, 1983 : Latin alphabetic and non-alphabetic graphic characters ISO 6937-2, 1ed, 1983 : CCS for Text Communication - Latin Characters ISO 7154, Bibliographic filing principles ISO 7350, 2ed, 1990 : (3ed 1991?) Text communication - ISO 8601, 1ed, 1988 [ 2014, 3307 is replaced by 8601 ] --Keld ISO 8613, 1989 : Office Document Architecture and Interchange Format (ODA) -1.2 : Introduction and General Principles -2.2 : Document Structures -3 : Document Processing Reference Model -4.2 : Document Profile -5.2 : Office Document Interchange Format -6.2 : Character Content Architectures -7 : Raster Graphics Content Architectures -8 : Geometric Graphics Content Architectures ISO 8824, 1ed,1987 : Specification of Abstract Syntax Notation One ASN.1 ISO 8825, 1ed, 1987: Specification of Basic Encoding Rules for Abstract Syntax ISO 8859-1:1987 Latin alphabet no. 1 ISO 8859-2:1987 Latin alphabet no. 2 ISO 8859-3:1988 Latin alphabet no. 3 ISO 8859-4:1988 Latin alphabet no. 4 ISO 8859-5:1988 Latin/Cyrillic alphabet ISO 8859-6:1987 Latin/Arabic alphabet ISO 8859-7:1987 Latin/Greek alphabet ISO 8859-8:1988 Latin/Hebrew alphabet ISO 8859-9:1989 Latin alphabet no. 5 ISO 8859-10:1992? ISO 8879 SGML ISO 8884, 1988 : Keyboard Layout for Multiple Latin-alphabet Languages ISO 8957 CD Hebrew coded character set for bibliographic information ISO 9036 :1987, Arabic 7-bit coded character set for information interchange ISO 9069, SGML support facilities ISO 9241 : Ergonomic aspects of hardware and software products. ISO 9541 Font Information Interchange ISO 9541-1 DIS Architecture ISO 9541-2 DIS Interchange Format ISO 9945-1, 1ed, 1990 : POSIX - System Application Program Interface ISO CD 9945-2 : POSIX shell and utilities ISO 9995 Keyboard layout for Text and Office Systems ISO 9995-1 CD General principles governing keyboard layouts ISO 9995-2 CD Alphanumeric section ISO 9995-3 CD Common secondary layout of the alphanumeric zone of the ISO 9995-4 DIS Numeric section ISO 9995-5 CD Editing section ISO 9995-6 CD Function section ISO 9995-7 CD Symbols used on keyboards to represent functions ISO 9995-8 CD Allocation of letters to keys of a numeric keypad ISO/IEC DIS 10036 Procedure for registration of glyph and glyph collection ISO/IEC TR 10176 ISO/IEC DTR 10182 Programming language, their Environments and System ISO/IEC DIS 10367 : Repertoire of standardized coded graphic charactersets ISO/IEC 10538 Control functions for text communication ISO/IEC 10646-1 : Universal Multiple-octet coded character set ISO-IR : International Register of Coded Character Sets to be Used with Escape Sequences-Registration JIS X0201:1976 : Code for information interchange in Japan JIS X0202 :.see ISO 2022 JIS X0208:1990 : Code for the Japanese graphic character set information JIS X0211:1991 : Control functions for coded character sets in Japan JIS X0212:1990 : Code of the supplement Japanese graphic character set for JIS X0301:1979 : Identification Code of Dates JIS X0302:1977 : Identification Code of Times JIS X6002:.... : Keyboard Layout for Information Processing Using the KS C 5601:1987 : Korean national character set standard KS C 5657:1991 : Korean national character set standard NBR 9612, Rules for key numbering system and keyboard layouts NBR 9613, Principles for control key positioning NBR 10346, Minimum configuration for the Brazilian Coding for Information NBR 10347, Subsets for numeric applications NI NEN 2294 : Netherlands Standard Keyboard Layout SFS 3548 :Alphanumeric Keyboard ------ As well as sequencial list above, categolization for following items to be included in this annex ----------------------------------------- Standards related to Character Set Identification Service Standards related to Font and Glyph Standards related to Bibliogrphy Standards related to Data Communication Service Standards related to Data Input Service WANTED: Does any one know the name of above two ISs ? (Editor likes easy life) Standards related to Character Set Invocation Service Standards related to Cultural Conventions Repository Service Standards related to Date Format Service Standards related to Time Format Service Standards related to Numeric Formatting Service Standards related to Currency Formatting Service Cultural Convention Set Invocation Service Others ANNEX C: GLOSSARY ------ Following terminologies are collected as glossary -------------- ------ This is NOT complete list yet, addition deletion to be done ---- ------ Same difinition within JTC1 (incl. SC1) to be used. ------------ Accelerator (shortcut key) Alphabet Alphabet script Alphabetic Application Application environment profile Application platform Application program Application software Application specific environment Architecture (of an information system) Argument Argument list Array Base character Base language (or software, or product) Base message Bi-directional data Bi-lingual (system) Bit combination Buffer Build environment Building Byte Cancel Case sensitive Character Character boundary Character class Character path Character position Character progression Character set Character-imaging device CJK-Ubification Code Element Code extension Code point Code set Code set independent Code table Coded character Coded character set; code Coded-character-data-element (CC-data-element) Collate (see Order) Collating element Collating sequence Collation Collation table Combined Character Command Composite graphic symbol Concatenate (to) Context Context search Control Character Control function Control sequence Control string Cultural elements Cursor Cursor movement keys Data Data communications Data processing Data transparent Decimal mark Default Delete character Delimiter Device Diacritic (Character) Dialog Digraph Display Downshifting Editor Editor function Environment (of information system) Equivalence class Escape sequence Field File code Fixed-length coding FONT Formator function Form-of-use Full-screen editor Glossary Glyph Glyph immage Grammar file Grammar rules Graphic character Graphic rendition Graphic symbol Hangul Hanja Han unification Help Icon Idepgraph Implementation: Information: Information processing Information processing system Information system Integer Interchange Interface International Standards Organization International standardized profile (ISP) Internationalization Interoperability Invoke (to) Isolated character Kana Kanji Keyboard Language Latin (Character) Layer Lexical analyzer Lexicon Life cycle Ligature Line Line editor Line home position Line limit position Literal string Local customs Local language Locale Locale independent Localization Mail Mail box Menu Message Message catalogue Metacharacter Morphology Multi-lingual (system) Multiple-octet coded character set Multiuser mode National custom National language National Language Support (NLS) Native language Network Non-Spacing characters Octal Octet Open system environment Operation system [To be operating system] Operator --Rafik Order Page Page home position Page limit position Parameter Parameter Byte Parser Pictogram Portability (inter culture) Portability (software) Portability (application) Portable application (information processing) Position POSIX Profile POSIX National Profile POSIX National Body Conformance Present (to) ; presentation Presentation component Presentation variant (or presentation form) Profile (for ISO Standardization) Program Prompt Protocol Punctuation mark Re-Engineering Regular expression Rendering Repertoire Script Scroll Semantics Service Sillabic (Character) Software Standardization Standards (ISO) Syllabe Syntax System software Tabulation Territory Transcribe Translate Transliterrate Transparent Unilingual Upshifting (downshifting) User (of an information system) User User Interface User requirements Variable Wariable-length coding Variant form Vowel sign Window Workstation Annex. D Example of Legal Requirements Most of countries have legal requirements for information products. This annex shows examples of the legal requirements. ------ Some of the legal typical regal requirements to be added -------- D.1 National Laws Canada: Quebec Law 101, Charter of the French Language, August 26, 1977; Canadian Official Language Act of 1969. Sweden: The Work Environment Act, 1st July 1978. Venezuela: Consumer Protection Law, Article 10. ANNEX-E From a requirement to its implementation - Compare, Sort, Search E.1.0 The case of the Canadian requirement for ordering English and French One of the most important specification of cultural elements is the specification of the characteristics of ordering for text data strings. The first normative requirement for comprehensive, fully predictable culturally valid requirement for ordering has been the Canadian Standard CSA Z243.4.1-1992, adopted as a preliminary standard (to be confirmed as national standard of Canada in the beginning of 1994) after 6 years of work with input from 7 different countries (Canada, France, USA, Belgium, The Netherlands, Germany, Switzerland) to fine tune the Canadian proposal issued out of a Quebec government proposal dated 1986. The Canadian Standard describes collating weights usable at once for dictionary ordering of English, French, German, Dutch, Portuguese and Italian without excluding other languages that can be handled with slight modifications. This technique assigns four levels of weights that can be used for fine-tuning the ordering function and provide absolute predictability of the results while being culturally acceptable to a majority of users of those languages. It is based mainly on ordering rules used in main dictionaries of the French, English and German languages, the primary rules learned at school by all young children, and unlike other more sophisticated classification techniques, understood by all street people, and not only by scholars. E.1.1 Example of cultural requirement. Those rules are essentially the following: 1. The 3 languages agree on a single alphabetic order, from A to Z, where no consideration is normally done for diacritical signs for the single purpose of ordering, unless there is a tie due to quasi-homography, i.e. words that look the same if diacritics are removed. 2. For ligatures, expansion is done as if they were written as 2 separate letters (ae, oe, ss [and ij in Dutch for that matter]); 3. English and German dictionaries state that unaccented words precede accented ones in case of quasi-homography; French dictionaries need more precise rules as it it quite frequent that lists of 3 or 4 quasi homographs are encountered for a series of identical letters, accented differently; in case of quasi-homography, the rule is in French that "the last difference in the word determines the order" (which means scanning the words to be compared from the end and back, until a difference is encountered in accentuation). Not to make French a special case, many sources recommended to use the French rule for solving ties due to that fact, as it is generally recognized that this does not bring extra overhead (a stack is as easy to use as a list) for other languages and that this is culturally equal for other languages. 4. In case of homography on alphabetic characters and diacritics, then case becomes significant to determine a difference. English and German dictionaries agree that small letters should precede capital letters. French dictionaries do not make a difference, as they generally use only capital letters for their general language words (including accented capital letters, contrary to a wide-spread belief that capital accented letters are not used in French!) French encyclopedias and proper name dictionaries tend to use capital letters first but as there are numerous exceptions and that dictionaries are mute on this subject it has been decided to use English and German rules in the Canadian Standard to harmonize rules without really making the rule culturally incorrect for French (to be noted that Danish dictionaries specifically state that capitals precede small letters). 5. Characters not part of the alphabet (spaces, hyphens, apostrophes, asterisks and so on, orthographic or not) are not significant for dictionary ordering. So that ordering can be predictable the Canadian standard specifies an order for these, but only in case all other 3 levels of significance of text data (alphabetic data, diacritics and case) are absolutely identical. So far it is the way the Canadian Standard specifies it, even if the normative benchmark contains a list of English words that could be ordered in a more refined way, but no English-speaking native objected so it is assumed to be culturally acceptable: "coop, co-op, COOP, CO-OP" constitutes a list sorted correctly according to the Canadian Standard; a refinement would have been possible to make "coop, COOP, co-op, CO-OP" the preferred order, but this seems at this point to be a matter of preference and of very fine tuning that could require extra overhead in some environments - 5 levels instead of 4 - although it would seem to be more consistent for very specific orthographic characters like hyphens, apostrophes and spaces, and this for all languages involved (these characters would have to be processed after the diacritics have been considered but before case, the other specials being processed after case). For example the following records are ordered correctly per the Canadian Standard specification. COTE / last difference / C[o>]te / on 2nd letter /last difference cot[e'] / on 4th letter / equal except/ c*o*t*[e'] --. /for specials / C[o>]t[e'] |__ last difference on 2nd character / last difference coter --. / on last characters Coter |__ equal except for case where [o>] represents the SMALL LATIN LETTER O WITH CIRCUMFLEX ACCENT and [e'] represents the SMALL LATIN LETTER E WITH ACUTE ACCENT. E.1.2 Example of a specification technique The Canadian Standard describes this behavior using words (in English and in French), diagrams, and tables. To simplify the understanding of such tables, let's consider here the following tables of relative numbers: ALPHA ACCENTS CASE SPECIALS 1st level 2nd level 3rd level 4th level INPT token token token token CHAR (serial) (stacked) (serial) (serial) c 6 3 1 N/A C 6 3 2 N/A e 7 3 1 N/A [e'] 7 4 1 N/A E 7 3 2 N/A o 8 3 1 N/A [o>] 8 5 1 N/A O 8 3 2 N/A r 9 3 1 N/A t 10 3 1 N/A T 10 3 2 N/A * N/A N/A N/A 1 The Canadian Standard then suggests a conformance algorithm that establishes a series of subkeys to be numerically composed as follows for our examples (numbers are only indicative here, and showing a relation only between the subset of characters chosen for the only purposes of the example). To avoid composing a fourth key with place holders for all non-special characters, comparison is done on the positions of the special characters, and in case of equality, on the weight assigned to the special character itself. For that algorithm to work if all 4 subkeys are concatenated for a multi-level one-pass sort, a special logical zero delimiter (which could be 1 for C if all other relative numbers are offset by 1!) is coded between the 3rd and 4th subkey. If certain conditions are not met in the careful choice of relative numbers, such a logical zero delimiter would be advisable between each of the subkeys. Original string Subkey 1 Subkey 2 Subkey 3 Logical Subkey 4 delim. COTE 6,8,10,7 3,3,3,3 2,2,2,2 0 C[o>]te 6,8,10,7 3,3,5,3 2,1,1,1 0 cot[e'] 6,8,10,7 4,3,3,3 1,1,1,1 0 c*o*t*[e'] 6,8,10,7 4,3,3,3 1,1,1,1 0 2,1,4,1,6,1 C[o>]t[e'] 6,8,10,7 4,3,5,3 2,1,1,1 0 coter 6,8,10,7,9 3,3,3,3,3 1,1,1,1,1 0 Coter 6,8,10,7,9 3,3,3,3,3 2,1,1,1,1 0 The Canadian Standard presents a reduction technique (non normative but originating from the Quebec government which implemented it [designer: Alain LaBonte']; it is now in the public domain) to reduce these subkeys without affecting the comparison process if keys are to be stored by an application for further comparison by a dumb process (such as a hardware device able to search on binary sequences or an old unmodifiable "indexed sequential" access method of any kind that orders keys numerically): the net effect is that for most French words and more than 99% of English words, no storage is required for the second key, as no accent is present, and in certain conditions storage is highly reduced for the third subkey. The same subkeys reduced and concatenated to give a one-pass directly comparable numerically would be: COTE 6,8,10,7, 2,2,2,2, 0 C[o>]te 6,8,10,7, 3,3,5, 2, 0 cot[e'] 6,8,10,7, 4, 0 c*o*t*[e'] 6,8,10,7, 4, 0, 2,1,4,1,6,1 C[o>]t[e'] 6,8,10,7, 4,3,5, 2, 0 coter 6,8,10,7,9, 0 Coter 6,8,10,7,9, 2, 0 One should not implement this reduction technique without carefully looking at the Canadian standard and its references for caveats in designing other tables. Only in certain conditions can such an optimizing technique be used. However even without reduction, the principle of forming a single key (out of a multilevel specification) to be passed to old applications that "know" how to sort numerical data strings is highly valid and economically very important to support past applications that can be "internationalized" without significant modifications if any. E.2 Other specification techniques After the Canadians released their specifications, POSIX defined a model that could handle it in a general way. This is but an abstract specification technique that can be used but it nevertheless does the job adequately. A good recommendation would be not to reinvent the wheel and use it as it is the only international specification technique that exists so far for describing collation tables. It does not use relative numbers but rather a clever sequential ordering system that allows the description of multilevel weights without specifying any numerical data. Order can be changed just by inserting lines for specific additional characters or swapping lines. Moreover it is, like the Canadian Standard, a codeset-independent specification technique which will not necessitate as many specifications as there are equivalent character sets. Hence it is a very flexible technique that allows the handling of a general specification. It might be that refinements be made in the future but it represents, like the Canadian Standard, the state of the art in this domain and it is expected that future work will build on this specification technique. To the knowledge of different experts, it can handle most of the languages and scripts of the world without major difficulties. Extensions are conceivable to handle combining sequences as present in ISO 10646 level 3 for those languages that absolutely require those combinations to be handled to give a culturally valid ordering. The way to specify the minimum table to describe the previous example according to the POSIX ordering specifications would be (simplified): ... collating-symbol collating-symbol Definitions of symbols that are not collating-symbol known as characters but which are needed collating-symbol to describe relative weights collating-symbol ... |This statement order_start forward;backward;forward;forward,position |describes the |scanning direction |for each level |(even allows | position tokens | if desired) |will result in SMALL=1 |will result in CAPITAL=2 |will result in NONE=3 |will result in ACUTE=4 |will result in CIRCUMFLEX=5 ;; |... c=6 and reused for itself ;; |... e=7 and reused for itself ;; |... o=8 and reused for itself ;; |... r=9 and reused for itself ;; |... t=10 and reused for itself ;; | from now on, ;; | no new value that needs to ;; | be resolved; numeric weights ;; | are all already known ;; > ;; <*> IGNORE;IGNORE;IGNORE;SMALL |SMALL=1, why not? |IGNORE means no value assigned |at each level that specifies it ... E.3 User group requirements and functionality Interestingly enough, SHARE Europe, having had a look on Canadian specifications, described a series of programming requirements expressed in a White Paper published in 1990 in Geneva and titled "National Language Architecture". Contrarily to the POSIX standard (ISO/IEC 9945-1 and ISO/IEC 9945-2) which, surprisingly, do not define the functions that could be associated with the specifications of ordering (the standards refer to the C standard which is explicit about that but other languages could implement the C language equivalents or not, in addition to defining new ones), SHARE Europe requires the support of a series of functions at the operating system level to exploit to its full potential the specification of ordering. It is to be noted that if an operating system does not provide those functions they could be implemented in a common set of library routines available to different programming environments, and that is exactly what the Quebec government has done for its data centres and is about to implement for small machines (PCs, Macintoshes, minis, and so on) without any modification to the compilers it uses. For economic reasons, as surprising as it may seems, COBOL has been used for that, in spite of the recommendation to use a more portable programming language. For other environments than the mainframes, other decisions have been taken (C language routines are being developed). Obviously if some of these functions would be implemented in programming language syntax, development of applications would be easier for programmers with possibly some gains in performance (so far performance has not be significantly affected, though: before taking the decision to do such a project it is reasonable to consider it is more productive to do things right, with potential productivity gains for the end-users, than producing lightning-fast garbage that results in numerous operational mistakes done by end-users who can't retrieve the information they are searching with "traditional" methods [it took centuries if not millenaries to develop universally accepted traditions in ordering for each given script, but a few years of technology usage to scramble them and create a so-called "collating tradition" that fools programmers themselves, even if they often don't want to admit it]). E.3.1 SHARE Europe Requirements of functions The following requirements have been addressed by SHARE Europe in the above-mentioned White Paper on National Language Architecture: E.3.1.1 Extended key generation Given a string and identification of its coding, a function should exist to return the 4 subkeys of the Canadian specification (note: this could be generalized to N levels instead of 4, with a possible information being returned on the number of levels and a table of dimension N for the N subkeys). E.3.1.2 Original key regeneration Given the N keys generated by the previous function, regenerate the original in the coding specified (coding which could be different from the original but the original character string would be functionally equivalent to the original from the user's point of view). This supposes that the system of tables used to generate extended subkeys is known to the underlying process, of course. Since all the information is contained in the extended subkey (principle of absolute predictability, this has been shown to be possible and implemented later on by the Quebec government for the only needs of the Canadian specification). E.3.1.3 Comparison operation Given 2 character strings on input and their coding, or the N subkeys, return the following information: Case 1: The 2 strings are absolutely equal (ex. "ABC"="ABC"); Case 2: The 2 strings are equivalent up to the level N of comparison; Case 2.1 Canadian spec (ex. "COTE"=="Cot[e'] at level 1 only); Case 2.2 Canadian spec (ex. "cote"=="Cote" up to level 2); Case 2.3 Canadian spec (ex. "c*o*t*e"=="cote" up to level 3) Case 3: String 1 comes before string 2 in order (ex. "Cote"<"cot[e']") Case 4: String 1 comes after string 2 in order (ex. "cot[e']>"COTE") Case 5: Fuzzy match: "Phydeault" ~ "FIDO" for French (snobbish dogs obviously write their names using the first spelling!) Notes: Case 2 is a rephrasing of the SHARE Europe requirement: the original requirement specifies on input what kind of equivalence is accepted, the answer indicating equality only for this case if absolute equality is not returned, i.e. equivalence required if only different because of specials, or because of case, or because of diacritics. This is better generalized to N levels with this respecification. Case 5 requires algorithmic fuzzy pattern matching functions that go beyond economical development in most environments because they require expert system technology which, for example, "knows" the exact phonetic environment in which it is applied: for example phonetic equivalent in cockney English for certain populations of London, which are not valid elsewhere, or foreign accent biases on the language, and so on: to buy that function for the Quebec government, function which by the way was commercially available (for pedagogic applications teaching French to young children of different origins) at time of development for the Quebecer accent applied on French and the different foreign accents commonly encountered in Montreal applicable to French [including what Quebecers call the "French accent"], the cost of implementing it would have more than tripled the cost of the basic functions mentioned above. In this particular case it has been decided not to implement this last requirement, useful for a police department, but generally not much for most commercial applications. E.3.1.4 Coding conversion Given a string and its coding, and a resulting coding identification, return an equivalent string in its new coded equivalent. E.3.1.5 Sort Given a list of strings, perform an internal sort using the comparison operation described above to obtain consistent results. For an external sort, the same function should be used to obtain the same consistency of operation. E.3.1.6 Merge Given two lists assumed to be sorted according to the previous function, merge the two lists in one using the same comparison operation described above. E.3.1.7 Substring Search Given two strings, search for the occurrence of the second one in the first one, with parameters indicating what kind of equivalence level is acceptable; return the offset of the retrieved string and its length (which can be different from the one searched if equivalences are encountered, as for example if ligatures are equivalent to separate letters in a given specification). E.3.1.8 Conversion to upper case unaccented data Given a rich text data including accented/unaccented lower/upper-case data, return the "traditional" unaccented upper-case equivalent (see below under section A.4 on how the at-first-glance-unrealistic reciprocal function has been implemented, even if it is not a requirement so far in the international community). E.4 Complementary functions out of the scope of programming languages For the information of the readers, the implementation of these functions from scratch will be very useful in all new end-user environments (including American sites, some of which are said to have also implemented the Canadian specifications as it was considered a requirement to solve problems of character data processing in an unilingual English environment). However it may be interesting to know that old data bases in Quebec, as in Europe in general, have long used unaccented-capital-letters-only data to avoid many of the problems solved by these specifications (not all, though, as the presence of special characters, even if less visible, is an existing problem, with various degrees of seriousness). To implement these new functions, mixing upper-lower-accented-unaccented data is necessary for talking with external sites. The requirements of SHARE Europe allow such mixing. Furthermore the Quebec government also implemented automatic functions to add accents and lower-case to existing person data and geographica name data that was unaccented before (with a 99,7% accuracy, the remaining cases necessitating human intervention because unsolvable by automatic means: homographs like "Masse" and "Mass[e'], 2 existing family names), to avoid having to retype that information in huge data bases. These are problems for which no requirement is likely to be addressed to programming language standards designers but which are nevertheless real and solvable in existing environments, and that shows that it is also possible to deal with the past without the necessity to start from scratch, in which case no action would be possible forever. E.5 Consequences of imbedding these functions in languages The basic functions that were previously mentioned are implementable using present tools or with extensions of languages. The latter would be highly desirable, the consequence being that resulting programs can be designed to be portable in different cultures, the behavior of the functions being parametrically provided outside of the language, but the functionality being fully provided by the language to directly interface those external specifications, while optimizing performance goals. E.6 Specific language implementation examples E.6.1 Fortran The following example indicates how a Fortran module might be used to implement culturally sensitive string comparison, using the approach outlined above. This module assumes the existence of the intrinsic functions suggested in section 6.5, together with two additional intrinsic functions GENERATE_KEYS and COMPARE_STRINGS which provide the functions described in A.3.1.1 and A.3.1.3, respectively. MODULE Cultural_Strings IMPLICIT NONE PRIVATE ! This module provides the necessary services to support the ! character handling requirements in a particular model of ! internationalization and localization. ! As written here it operates automatically in the cultural ! environment which is current when the program begins ! execution. It could be extended to allow for a user-specified ! cultural environment. ! Establish current cultural environment and character kind INTEGER, PARAMETER :: environment=CULTURAL_ENVIRONMENT(), & ch_kind=REPERTOIRE_KIND() ! Specify overloaded comparison operators INTERFACE OPERATOR ( < ) LOGICAL FUNCTION cultural_lt (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 END FUNCTION cultural_lt END INTERFACE INTERFACE OPERATOR ( <= ) LOGICAL FUNCTION cultural_le (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 END FUNCTION cultural_le END INTERFACE . . ! Specify those entities to be exported from the module PUBLIC environment,ch_kind,OPERATOR(<),OPERATOR(<=), ... CONTAINS LOGICAL FUNCTION cultural_lt (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 INTEGER :: key1(LEN(s1),4),key2(LEN(s2),4) LOGICAL :: comp(5) ! The intrinsic function GENERATE_KEYS takes a character ! string, and returns the four subkeys of the Canadian ! Standard for each character as a rank two integer array of ! dimension four by the number of characters in the string. key1 = GENERATE_KEYS(s1) key2 = GENERATE_KEYS(s2) ! The intrinsic function COMPARE_STRINGS takes two arrays ! of integer subkeys and returns a rank one logical array ! of dimension 5. Each element of the array value of the ! function specifies the truth or otherwise of the ! corresponding case in the specification of the comparison ! operation. comp = COMPARE_STRINGS(key1,key2) ! Return result of comparison as true if Case 3 is true (s1 ! before s2) and Cases 1 and 2 are false (s1 not equal and ! not equivalent to s2) cultural_lt = comp(3) .AND. .NOT.(comp(1) .OR. comp(2)) END FUNCTION cultural_lt LOGICAL FUNCTION cultural_le (s1,s2) CHARACTER(KIND=ch_kind,LEN=(*)), INTENT(IN) :: s1,s2 INTEGER :: key1(LEN(s1),4),key2(LEN(s2),4) LOGICAL :: comp(5) key1 = GENERATE_KEYS(s1) key2 = GENERATE_KEYS(s2) comp = COMPARE_STRINGS(key1,key2) ! Return result of comparison as true if any of Case 1 ! (s1 equals s2), Case 2 (s1 equivalent to s2) or Case3 ! (s1 before s2) is true cultural_lt = comp(1) .OR. comp(2) .OR. comp(3) END FUNCTION cultural_lt . . END MODULE Cultural_Strings A program which wished to use this module to provide culturally correct character handling could do so as follows: PROGRAM Culturally_correct IMPLICIT NONE ! Obtain access to all public elements in Cultural_Strings USE Cultural_Strings ! Declare two 50-character strings of the default type for the ! current environment CHARACTER(KIND=ch_kind,LEN=50) :: string_1,string_2 . . ! Read data into these strings READ *,string_1,string_2 ! Print the two strings in their correct order, using the ! overloaded <= operator to ensure culturally correct ordering, ! with the first input coming first if they are equal, or at ! least equivalent IF (string_1 <= string_2) THEN PRINT *,string_1,string_2 ELSE PRINT *,string_2,string_1 ENDIF . . END PROGRAM Culturally_correct