From gwyn@BRL.MIL Mon Apr 12 19:23:10 1993 Received: from vgr.brl.mil ([138.18.1.6]) by dkuug.dk with SMTP id AA19590 (5.65c8/IDA-1.4.4j for ); Mon, 12 Apr 1993 21:27:19 +0200 Date: Mon, 12 Apr 93 19:23:10 GMT From: Doug Gwyn (ACISD/MCSB) To: Keld J|rn Simonsen Cc: John C Klensin , sc22@dkuug.dk, sc22wg15@dkuug.dk, sc22wg14@dkuug.dk Subject: Re: (SC22WG14.354) Re: (SC22.316) characters in C - forwarded Message-Id: <9304121923.aa06967@VGR.BRL.MIL> X-Charset: ASCII X-Char-Esc: 29 >From: Keld J|rn Simonsen >For WG14 the problem is not recognized. My personal belief is >that it was wrong to introduce multibyte and widechar concepts >in the ISO C standard, and there were other proposals on character >set handling which were closer to the SC2 terms when the standard was >written. The Japanese have addressed this problem in WG14 many times. >Maybe the problem can be solved by the revision of 9899. >It is not in the scope of the addendum to solve it. I agree with that assessment. Largely this may be resolvable by a "revisionist" interpretation of the C standard's terminology: type "char": really just a storage unit mbc: really a user-oriented character wchar_t: just a way to handle a character as a unit "basic" character set: characters usable in portable source code Then portable text-processing programs have to use the mbc-based set of I/O functions, and to do anything with a character as a unit it must be converted to a wchar_t; fortunately we provided "..."L and '...'L in the language to make this less of a hassle to program. I think the confusion comes when people persist in thinking that type "char" should be used to represent a character. That usage is now obsolete, and it is a pity that we had to support existing practice in that regard. (That's why the C standard insists that the basic, i.e. portable, programming characters must fit into a byte.) From this point of view, it would have been better to reserve ALL use of the term "character" for use in conjunction with (external) mbc or (internal mapped equivalent) wchar_t, and use "byte" uniformly every time the basic storage unit was referred to. However, accommodating the existing basic-character-fits-into-a-byte practice forced the use of the word "character" in connection with some fundamental uses of bytes. The Japanese "long char" proposal didn't avoid there being two distinct kinds of "character" in C programs, while my "short char" proposal did unify "character" into a single data type. As noted in my JCLT article, with the proposed normative addendum to the C standard we have essentially completed reinstating the "long char" proposal, although we call the long-char type "wchar_t" instead, and also we provide explicit access to the external<->internal mapping functions. So I think it is easy enough to understand what the C character model amounts to, if one keeps in mind that the use of "char" and (non- multibyte) "character" is associated with what is really a storage unit that supports a very special set of standardized character codes with their own rules distinct from the general native-environment character facilities. General character sets are to be handled as (external) mbc and (internal) wchar_t. This isn't as clean as I wanted it to be, but with the proposed normative addendum it is not as bad as it was without the full set of general-character handling functions. Too bad we now have both the old-style special-"char" functions PLUS the new-style general-character functions; it causes confusion to have two ways to do something only one of which is really the "right" way to do the job.