From keld@dkuug.dk Thu Apr 8 20:41:10 1993 Received: by dkuug.dk id AA06361 (5.65c8/IDA-1.4.4j for wg15rin); Thu, 8 Apr 1993 18:41:13 +0200 Message-Id: <199304081641.AA06361@dkuug.dk> From: keld@dkuug.dk (Keld J|rn Simonsen) Date: Thu, 8 Apr 1993 18:41:10 +0200 X-Charset: ASCII X-Char-Esc: 29 Mime-Version: 1.0 Content-Type: Text/Plain; Charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Mnemonic-Intro: 29 X-Mailer: Mail User's Shell (7.2.2 4/12/91) To: wg15rin@dkuug.dk Subject: Johan van Wingen on POSIX and C characters Cc: sc22wg14@dkuug.dk --- Forwarded mail from "Johan van Wingen" >From SC22-request@dkuug.dk Thu Apr 8 16:44:15 1993 Date: Thu, 08 Apr 93 16:31 CET From: "Johan van Wingen" To: SC22 List Subject: (SC22.313) C and Posix My comments on the NL position seem to have created some constrnation. This surprised me, because I said the same things not long ago. (For the really concerned, no vote has been mailed by NNI as yet.) In mailing 197 of 1992-11-30 I replied to a question from John Klensin about the relation, or the mapping, between "characters" and "graphic symbols". This contained a discussion of the different concepts of "character" one finds in "C", Posix and SC2 standards. I reproduce it here, somewhat corrected, because it presents my views very well. Obviously, it escaped the "C" and Posix people at that moment. ....... The problem is that you cannot base anything in a standard on a concept (eg. "graphic symbol") that was never meant to have something based on it. The definitions (as of ISO 646, 4873, 10367:1991) are: 4.4 character : A member of a set of elements used for the organization, control or representation of data. 4.7 coded character set; code : A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their coded representation by one or more bit combinations. 4.14 graphic character : A character, other than a control function, that has a visual representation normally handwritten, printed or displayed, and that has a coded representation consisting of one or more bit combinations. 4.15 graphic symbol : A visual representation of a graphic character or of a control function. This is all, no (one-to-one) mapping (between characters and graphic symbols) is assumed. Anything, on paper, on screen or otherwise, large or small, red or blue, wide or narrow, bold or italic, that is recognizable without causing confusion with the visual representation of a different character will do. But we see that all these visual things are not alike to each other. How do we know that a blue or a red graphic symbol represents the same character? The only answer is: from context. We know, or better, we have learned what to ignore to arrive at an abstraction. But that is just the nature of a character. A graphic symbol is like one of these many shadows of a Platonic idea, something not too well defined, meant only for our mortal eyes ("handwritten, printed or displayed"). The real problem (for us in SC22) behind this all, is that "C", and subsequently POSIX use a different concept of character and graphic symbol. I spent many thoughts how both could be reconciled, but I had to conclude that it is impossible. The definition of "character" in "C" is (p. 3): Character -- a bit representation that fits in a byte. The representation of each member of the basic character set in both the source and execution environments shall fit in a byte This is what I would call a C-character. It is a way to identify a byte. On the contrary, several programming languages specify characters as abstractions, members of a set, and Enumerate them in the standard, (SC2 uses the same idea). These things I would like to call E-characters. If we now compare which concepts correspond to each other, then we see: C E C-character bit combination graphic symbol E-character glyph graphic symbol It is apparent that any binding that does not take notice of this kind of mismatch will result in a disaster, as will any discussion or meeting with people who had never become aware of these essential differences. The way the term "character" is defined has important implications. E-character: Because the members of the set can be specified by enumeration, there is no logical upper limit to the number of different characters. Is that number small then we may code them with 7-bit bytes, is it larger, then octets may do. But this is a matter of implementation, not of design of a programming language. C-character: Because the number of octets is limited to 256, there cannot be more different characters than that. If this is not enough, a new concept is needed: "multibyte character". Instead of a single concept of character we have now two, and at bindings problems will arise what is meant at transferring a datatype E-character to a program in "C". Thus we find the following definitions in the C Standard: Multibyte character -- a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. The extended set is a superset of the basic character set. Byte -- the unit of data storage large enough to hold any member of the basic character set of the execution environment. It shall be possible to express the address of each individual byte of an object uniquely. A byte is composed of a contiguous sequence of bits, the number of which is implementation defined. It should be noted that these definitions contain several unexplained terms. It is unbelievable how these ever passed a competent editor. The terms "basic" and "extended" character set are in C nowhere defined explicitly. One has to guess at 2.2.1 (p. 11) what is meant. The basic set need not to be restricted to those of ISO 646 IRV, it is said. Thus, is what is coded with a 16-bit byte a character, but what is coded with two octets a multibyte character? If a letter A is coded with two octets (like in ISO 10646), is it then a multibyte character, but with a single octet a character? The matter is further complicated by introducing the term wide-character which I leave out of the discussion for the moment. The designers of C may specify their standards as they want, as long as they avoid conflict of terminology. If they call their constructs C-characters, or use some other convenient term, I'll approve all their standards. This is enough on C (for the time being), but there is also Posix. Quoting from P1003.2/D12, POSIX (sent to me by Jim Isaak): 2.2.2.29 character: A sequence of one or more bytes representing a single graphic symbol. NOTE: This term corresponds in the C Standard to the term "multibyte character", noting that a single-byte character is a special case of multibyte character. Unlike the usage in the C Standard, "character" here has no necessary relationship with storage space, and "byte" is used when storage space is discussed. 2.2.2.27 byte: An individually addressable unit of data storage that is equal to or larger than an octet, used to store a character or a portion of a character. Neither "graphic symbol" nor "glyph" are defined in 9945-2 (12). And that is strange because the terms are used in a meaning different from that in the ISO standard where they have been introduced (ISO 646 and ISO 9541-1). It looks like that the Posix people accepted the E-character as their basic concept, but hesitate to throw away their C-habits. To my opinion no harm will be done if the SC2 definitions are adopted for character and graphic symbol. As for byte, just put a full stop behind octet, and remove the rest. But at present we are faced with the problems of double terminology. What are we to do with this all? Are we sure what we are speaking about, should we go to a meeting with other SC's on character issues? I hope that further exchange of ideas by this medium will contribute to unravel the knot. Has the dollar sign one or two strokes in POSIX? Best regards from J. W. van Wingen PRECAL@HLERUL2.BITNET or PRECAL@rulmvs.LeidenUniv.nl --- End of forwarded message from "Johan van Wingen"