From kido@vnet.IBM.COM Mon Mar 15 02:52:35 1993 Received: from vnet.IBM.COM ([192.239.48.4]) by dkuug.dk with SMTP id AA28787 (5.65c8/IDA-1.4.4j for ); Mon, 15 Mar 1993 02:52:35 +0100 Message-Id: <199303150152.AA28787@dkuug.dk> Received: from YMTVM8 by vnet.IBM.COM (IBM VM SMTP V2R2) with BSMTP id 3990; Sun, 14 Mar 93 20:51:54 EST Date: Mon, 15 Mar 93 10:51:22 JST From: "Akio Kido" To: sc22wg15@dkuug.dk, sc22wg20@dkuug.dk, XoJIG@xopen.co.uk, sig-international@osf.org, uojlg-bse@uiap.ui.org, efischer@donald.aix.kingston.ibm.com Subject: MSE 1.mm X-Charset: ASCII X-Char-Esc: 29 .H 1 Scope .P This amendment defines extensions to \*(AC that provide a more complete set of multibyte and wide character utilities, as well as alternative spellings for certain tokens. Use of these features can help promote international portability of C programs. .P This amendment specifies extensions that affect various clauses of \*(AC: .DL .LI To the compliance clause (clause 4), the additional header .Cf is provided by both freestanding and hosted implementations. .LI To the language clause (clause 6), six additional tokens are accepted. .LI To the library clause (clause 7), new capabilities are specified for the existing formatted input/output functions (7.9.6), and additional types, macros, and many functions are defined: .BL .LI wide character testing functions, .Cf iswalnum for example. .LI extensible wide character classification functions, .Cf wctype and .Cf iswctype . .LI wide character case mapping functions, .Cf towlower and .Cf towupper . .LI formatted wide character input/output functions, .Cf fwprintf for example. .LI wide character input/output functions, .Cf fgetwc for example. .LI wide string numeric conversion functions, .Cf wcstod for example. .LI wide string general utility functions, .Cf wcscpy for example. .LI a wide string time conversion function, .Cf wcsftime . .LI restartable multibyte/wide character conversion functions, .Cf mbrtowc for example. .LI restartable multibyte/wide string conversion functions, .Cf mbsrtowcs and .Cf wcsrtombs . .LE .LE .HU Background .P Most traditional computer systems and computer languages, including traditional C, have an assumption (sometimes undocumented) that a ``character'' can be handled as an atomic quantity associated with a single memory storage unit \(em a ``byte'' or something similar. This is not true in general. For example, a Japanese, Chinese, or Korean character usually requires two or three bytes to represent; this is a .I "multibyte character" as defined by \*(AC subclause 3.13. Even in the Latin world, a multibyte coded character set may appear in the near future. This conflict is called a .IR "byte and character problem" . .P A related concern in this area is how to address having at least two different meanings for string length: number of bytes and number of characters. .P To cope with these problems, many technical experts, particularly in Japan, have developed their own sets of additional multibyte character functions, sometimes independently and sometimes cooperatively. Fortunately, the developed extensions are actually quite similar. It can be said that in the process they have found common features for multibyte character support. Moreover, the industry currently has many good implementations of such. .P The above in no way denigrates the important groundwork in multibyte and wide character programming provided by \*(AC: .DL .LI Both the source and execution character sets can contain multibyte characters (with possibly different encodings), even in the .Cf \&"C" locale. .LI Multibyte characters are permitted in comments, string literals, character constants, and header names. .LI The language supports wide character constants and strings. .LI The library has five basic functions that convert between multibyte and wide characters. .LE .P However, these five functions are often too restrictive and too primitive to develop portable international programs that manage characters. Consider a simple program that wants to count the number of characters, not bytes, in its input. The prototypical program, .Cb #include .sp .5v int main(void) { int c, n = 0; .sp .5v while ((c = getchar()) != EOF) n++; printf("Count = %d\en", n); return 0; } .Ce does not work as expected if the input contains multibyte characters; it always counts the number of bytes. It is certainly possible to rewrite this program using just some of the five basic conversion functions, but the simplicity and elegance of the above is lost. .P The \*(AC standard deliberately chose not to invent a more complete multibyte and wide character library, choosing instead to await their natural development, as the C community acquired more experience with wide characters. The task of the \*(WG committee was to study the various existing implementations and, with care, develop this first amendment to \*(AC. .H 1 Compliance .eX "clause 4" .P The description is adjusted so that the standard header .Cf is included in the list of headers that must be provided by both freestanding and hosted implementations. .rF Alternate spellings .Cf (4.4).