From gwyn@BRL.MIL Thu Apr 8 13:52:51 1993 Received: from vgr.brl.mil ([138.18.1.6]) by dkuug.dk with SMTP id AA13721 (5.65c8/IDA-1.4.4j for ); Thu, 8 Apr 1993 23:54:31 +0200 Date: Thu, 8 Apr 93 17:52:51 EDT From: Doug Gwyn (ACISD/MCSB) To: Keld J|rn Simonsen Cc: wg15rin@dkuug.dk, sc22wg14@dkuug.dk Subject: Re: (SC22WG14.352) Johan van Wingen on POSIX and C characters Message-Id: <9304081752.aa21557@VGR.BRL.MIL> X-Charset: ASCII X-Char-Esc: 29 > --- Forwarded mail from "Johan van Wingen" > Obviously, it escaped the "C" and Posix people at that moment. > ... > It is apparent that any binding that does not take notice of this kind > of mismatch will result in a disaster, as will any discussion or meeting > with people who had never become aware of these essential differences. Obviously? I think he implies that his views are so clearly correct that anyone who disagrees with him would have to be ignorant. The people involved in the C standard specification for character set issues included many who have had a long history of dealing with them in practical implementations, for example in operating systems for sale on the international market. I know of at least one other whose understanding of character sets was developed via practical experience with international cryptography. I'm sure we're sorry if you don't like our terminology, but largely it was constrained by existing practice (for example, the C Reference Manual) and the choice of terms was thoroughly debated in committee, at which time your input on the matter would have been far more useful than it is now. The equating of a glyph to a character to an encoding thereof is widespread. This is analogous to the mathematical treatment of objects as identical if there is an "obvious" isomorphism between them. It only becomes confusing when there are alternatives for the isomorphism. In the C multibyte world, indeed there are three distinct mappings of "characters" (roughly meaning standardized glyphs) into encodings. One of these applies to the source code set and is interesting only when "..."L and '...'L are involved. The other two occur at execution time and are the code set used with the basic-storage-unit representation, for example each member of arrays written as ordinary string literals, and the potentially large code set represented with so-called "multibyte" encoding, which in fact need not involve more than 1 byte if not more than 1 byte is required to represent the entire multibyte character set. > This is what I would call a C-character. It is a way to identify a byte. In the C standard, the term (plain) "character" is MORE than just a synonym for "byte"; the basic addressable storage unit for the C implementation is CONSTRAINED by this definition to be able to serve as a container for any of the basic code set representations. There is another constraint elsewhere in the C standard that requires the basic addressable storage unit to be able to represent all integral values from 0 through AT LEAST 255. Since there are fewer than 255 characters in the basic code set required of a C implementation, the latter constraint is more stringent than the former if a reasonable representation is chosen for the basic code set; otherwise the code set representation may well pose a more stringent constraint. For example, a standard-conforming C implementation is permitted to represent the required character 'a' by the code value 34567 (decimal), but if it does then it will have to provide at least 16 bits for a basic storage unit. An important point is that details of the choice are left up to the implementation; this freedom is deemed important to accommodate the widest possible variety of C environments. The implementation is constrained in its choice in a few ways that are explicitly stated in the C standard, and is unconstrained in all other ways. Anyone who thinks "byte" necessarily implies exactly 8 bits has not been involved in a wide enough variety of computing for a long enough period of time. Bytes come in many sizes, and CDC mainframes allowed operations on bytes of variable length. The idea of 8-bit bytes originated with the IBM 360 community and spread to the microcomputer world from which the "appliance" small computers arose. There are technical reasons why a power of two is a convenient size to implement fixed-sized bytes, but no particular reason why that power has to be the third power. 8-bit byte size is a common convention but is not implied by the standard DP dictionary referenced by the ANSI C standard and is certainly not implied by the terms "byte" and "character" in the C standard. In the Internet world the term "octet" was introduced to refer to 8-bit data. > C-character: > Because the number of octets is limited to 256, there cannot be more > different characters than that. Wrong. A standard-conforming C implementation can choose to represent as large a character set as is deemed appropriate as plain characters. "Byte" does not mean "octet". If an implementation has decided to employ an octet to hold a byte, then indeed the BASIC execution character set cannot encode more than 255 distinct characters. If more are needed, then such an implementation must insist that multibyte encodings be used for the larger code sets. An alternative would be for the implementation to use more than an octet to hold a byte, in which case it could support the entire large character set as a "basic execution character set", and its users would not have to program using the multibyte-character facilities. (But they might want to do that anyway, to ensure portability of the applications to other C implementations.) > If this is not enough, a new concept is needed: "multibyte character". > Instead of a single concept of character we have now two, ... Yes, that is a design problem with explicit multibyte encodings. The C standard does not require that more than one "byte" be required to represent any character from the desired large code set; however, if the implementors choose small (typically 8-bit) byte sizes then they will have to rely on the multibyte machinery to support large (external) code sets. In my article in the latest issue of the Journal of C Language Translation, there is a discourse on the technical alternatives for this. > It should be noted that these definitions contain several unexplained > terms. It is unbelievable how these ever passed a competent editor. Most of the "unexplained" words can be found in any English language dictionary, and the data-processing specific terms were (in the ANSI C standard at least) deliberately inherited from ANSI X3/TR-1-82 which was cited in the References section of the C standard. (If the ISO C standard cites some other DP dictionary, that was a mistake since the drafters of the C standard did not in fact refer to that other dictionary; we did, however, often look up terms in the ANSI dictionary.) > The terms "basic" and "extended" character set are in C nowhere defined > explicitly. The basic character set is, first of all, a character set, which we didn't think needed to be explained, and secondly it is constrained (explicitly) by section 2.2.1 (X3.159 numbering) to include certain characters. Section 1.6 imposes a constraint that the encodings of members of the basic character set fit within a basic storage unit. Two further constraints on the encoding used to represent the basic character set are mentioned below. And that is, insofar as I can recall, all that we wanted to specify about the basic character set. The extended character set is representable (2.2.1.2) using the multibyte encoding scheme, and (1.6) it is a superset of the basic character set (obviously, in the glyph sense rather than the encoding sense). That is about all we wanted to say about that.. In the proposed normative addendum to the C standard, another property or two of the relationship between the basic and multibyte encodings are specified. In mathematical terms it is essentially an insistence that a specified subset of the (not necessarily one-to-one) mappings commute. This cleans up some issues involving the functions and permits portable implementations of the multibyte library functions. > Thus, is what is coded with a 16-bit byte a character, but what is coded > with two octets a multibyte character? Multibyte character encodings are explained in the C standard. Except for "..."L and '...'L syntax, they arise only as external representations, and are converted to internal "wide characters" (wchar_t) through various (explcitly described) mapping functions. The idea that all 16-bit patterns have to encode some character, even in a C implementation that chose 16 bits to represent a basic storage unit, is nowhere implied by the C standard. Further, the C standard imposes only a couple of constaints on the encoding apart from the requirement that the characters listed in 2.2.1 be encodable within the basic storage unit of the implementation: (1) The code value 0 is reserved for use as a string terminator and thus cannot be assigned as the code value for a member of the basic character set. (2) The encodings for '0' through '9' must have contiguous ascending values. > If a letter A is coded with two octets (like in ISO 10646), is it then > a multibyte character, but with a single octet a character? This is asking the question the wrong way. Whether a bit pattern represents a multibyte or plain character encoding depends entirely on how it is being used, and if used incorrectly it may not represent either. In many ISO 10646 C implementations, a multibyte encoding of the character whose glyph is "A" will require 16 bits, while characters in C source code (execution character set) will require 8 bits. There are variable-length multibyte encodings for the ISO 10646 character set such as the one devised for Plan 9 from Bell Labs (which has been proposed as an X/Open standard) that require only one octet for the multibyte encoding corresponding to "A" while being able to represent other atomic glyphs (which is essentially the ISO 10646 notion of linguistic "character") using two or three octets. The C standard allows such multibyte encodings, as well as other schemes involving "shift out" and "shift in" invisible embedded control codes. In summary, I think most of us involved understand the actual character set issues well enough. If there are reasonable suggestions for improvement of the standard wording to clarify these various aspects of character sets, fine, but suggesting changing all occurrences of "character" to "C-character" is simplistic and would really not help clarify anything. The real technical problem is the use of two distinct execution- time character encodings, because that feature means that there is not a unique isomorphism permitting identification of all aspects of "character" as referring to the same thing. That was a deliberate committee decision, perhaps motivated by existing experience in tackling the international character-set issue using similar approaches. For futher discussion as to the advantages and disadvantages, I refer you to the article I previously mentioned.