From keld@dkuug.dk Thu Apr 9 00:16:33 1992 Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8) id AA24603; Thu, 9 Apr 92 00:16:33 +0200 Date: Thu, 9 Apr 92 00:16:33 +0200 From: Keld J|rn Simonsen Message-Id: <9204082216.AA24603@dkuug.dk> To: wg15rin@dkuug.dk Subject: plan 9 - FYI X-Charset: ASCII X-Char-Esc: 29 ------- Forwarded Message From: dmr@alice.att.com (Dennis Ritchie) Newsgroups: comp.std.internat Organization: AT&T Bell Laboratories, Murray Hill NJ It's worth announcing that the Plan 9 operating system has incorporated Unicode as its character set. In view of the harmonization of Unicode with DIS 10646, and especially because of certain vital details of representation, we should be prepared for full ISO 10646 as well. These considerations were fundamental to the design: 1) It is essential to maintain compatibility with the existing ASCII world. In particular, ASCII text files (and files containing ASCII, like directories and symbol tables) must retain their meaning. Communication between Plan 9 and other systems must be maintained, whether by remote login, or across remote file systems. 2) Nevertheless, Unicode must be fully integrated, and not confined to specially marked files, or special programs. It must work everywhere. Moreover, there must be no problems with endianness, since Plan 9 runs on processors of both byte sexes. These principles led to the following implementation. 16-bit Unicode is represented using the scheme called UTF, which is described in Annex F of DIS 10646. This takes 32-bit 10646 characters (and, by restriction, 16-bit Unicode) into sequences of one or more bytes. In UTF, ASCII characters turn into single bytes with the same numeric code, thus achieving compatibility. Other characters turn into multi-byte sequences. Every representation of text external to programs uses UTF: files, pipes, the system call interface. Internally, programs (if convenient) convert UTF into Unicode. We call the resulting characters `runes' and supply appropriate library routines to convert between byte-based character streams and runes. These routines and the corresponding C language support are compatible with the `wide characters' of the C standard. There were many engineering consideration to the change. Although the UTF mapping is complicated, it yields only ASCII and Latin-1 graphic characters in a multi-byte sequence. (In fact, that is why it is complicated.) Thus, programs may continue to interpret null, new-line, and space characters without having to worry about UTF input. There seem to be three classes of program: 1) Those that do not interpret a byte stream. `cat' is the canonical example. Cat does not care that is being given UTF, it just reads and writes bytes. 2) Those that need minimal care: for example, programs that parse file names (to find `/' characters) need to be aware that this character may appear in a UTF sequence. There is a `utfrune' routine that searches a UTF string for a character, by analogy to 'strchr'. 3) Those that need to be fully aware of Unicode: all the interesting ones, really. For example, regular-expression routines need to understand the fact that there can be 65536 characters in a character set, among many other adaptations. The window system, editors, and the operating system have to deal with displaying lots of different characters (this means dynamic loading of bitmaps). Adaptation of utilities is virtually complete, and the result is impressive. It makes a smashing demo to grep for a Japanese string in a file whose name is Cyrillic, and have it all work. Dennis Ritchie ------- End of Forwarded Message