From keld@dkuug.dk Sat Sep 21 21:56:06 1991 Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8) id AA19578; Sat, 21 Sep 91 21:56:06 +0200 Date: Sat, 21 Sep 91 21:56:06 +0200 From: Keld J|rn Simonsen Message-Id: <9109211956.AA19578@dkuug.dk> To: donn@hpfcrn.fc.hp.com Subject: Re: (wg15rin 145) Re: Ballot resolution Cc: greger@ism.isc.com, hlj@posix, wg15rin@dkuug.dk X-Charset: ASCII X-Char-Esc: 29 > if the EBCDICs are shuffled, a translation of most files will > be needed anyway, and I don't see why charmap and localedef files would > be an exception. (There is simply no reason to *expect* portability of > files to machines running different character sets, no matter how > similar they are.) I presume that all EBCDICs contain #. Well all EBCDICs do not contain #. Out of 27 listed EBCDICs, 13 did not contain the # and one had it on a different place than the rest (IBM GA-27-2837-9 p 10-45). EBCDIC has also an "invariant" set like ISO 646, which is almost the same. So if we are carefull we can make locales and charmaps which are representable within all 646 and EBCDIC character sets, and furthermore has the property of being portable across all 646/EBCDIC conversions. > There is a fundamentally different problem for EBCDIC than there is for > 646: EBCDIC doesn't (to my knowledge) overload in the same way 646 does, > so that a (reletively common) character such as # is replaced with another > in some contexts. Translation may be necessary, but it's 1:1, rather > than 1:2 or 1:0 depending on how you look at it. Oh, you don't know the ebcdics! They are much worse than 646. They have 14 National use (NU) positions (where 646 has 12 - with 2 fairly bound) and one generation of ebcdics used these just like the national 646-es. But then they went to 8859-1 compliant ebcdics and this was made by making a lot of national 8-bit ebcdics, which all had all the characters of 8859-1, but with the national positions of the former generation ebcdics retained! So you have the same mess in the ebcdic 8859-1-like world as in the 646 world. > >Another thing is for portability, and where automatic conversion > >happens between the ASCII/8859 and EBCDIC worlds - the "number sign" > >and the other cause problems. You may very well risk a program sent > >by email in this world to be screwed up when it is received (this would > >happen in Denmark for instance) and it would then be more portable > >to be able to specify a comment character that was (EBCDIC) invariant. > > How about being more concrete... are you saying that the (net) translation > from 646/8859 to/from EBCDIC is wrong? (Or is it that a translation from > 646 (ASCII) to (US) EBCDIC is actually occurring when a translation from > Danish 646 to Danish EBCDIC is what should be occurring?) More concretely: The code for # in "standard" EBCDIC is the same as for the letter AE in Danish EBCDIC. Then there is another code for # in the current Danish 8859-1-like EBCDIC (which is not actually used as much as 8859-1 in the ASCII-like world). Then there are several possibilities of character set conversion: 1. receiving the program on a tape and having automatic conversion in the tape reading program - often these uses only standard ebcdic conversions, but they may also do the national conversion. Dependent on this you may either have # or AE. 2. by email, and the same conversion schemes apply, with the same results. 3. by a program on the IBM machine. Same story. Conclusion: there are many possibilities to have your program messed up. If you would restrict yourself to invariant EBCDIC (and 646) these problems would not exist, and the programs/sources will be more portable. > >> This has the advantages that: > > >> Users have a constant comment character (or at worst two). > > >> Translation between character sets is simplified (at least > >> in that case) because it often goes across as a bit pattern. > > >Yes, but often it does not just go over as a bit pattern. > >And in those cases we create a portability problem. > > ASCII won't go over to EBCDIC as bit patterns, period. I don't see why > translation shouldn't be expected in all cases where the character set > changes. (Sure, it's nice to be able to get away with being sloppy about > translations, but supporting such sloppyness is not a goal of standardization > that I can identify, particularly when that support has a future cost while > addressing a current (and expected to be temporary) problem.) I think you are contradicting yourself :-) But: first, in EBCDIC you do not have neither # nor the pound sign in 13 national ebcdics, so you are not able to have a comment sign like the one proposed by you. Second: you will create portability problems in the real world we are living in today, because ascii/ebcdic conversions are not well defined, and also this may arise in national 646 conversions. > >One way of getting away with 646 and all its national variants > >is to provide good support for 8859 and better, and we should work > >further in RIN and other places to faciliate this. > > I agree, however as long as we keep doing things to accomodate 646, > there won't be much reason to go to 8859. Many vendors already support > 8859 (or something like it, e.g. EBCDIC or Roman 8), and it won't be > long until (nearly) all new systems do. The support for 646 seems to > be addressed more to users of existing hardware than it is to any vendor > changes. As such, until users get rid of their existing hardware, > the issue won't be resolved, and one way to encourage that is to stop > encouraging 646! Well, this is much in line with a vendor's dream: to sell more equipment. But it really contrasts the recent work of TSG1 which emphasises the need to protect users' investments and avoiding creating troubles for users when making new standards. I take the uses' and ISO standpoint here, but I am also representing users! > >I do not find "comment-char" very expensive to implement, and the > >specification is not a lot of lines either. > > It's technically not that expensive, I agree, but it has secondary costs > (in terms of support and abuse) that seem to be expensive. A standard > shouldn't go overboard in doing things that aren't really necessary. It facilitates portability, which is what POSIX is all about, and that also means in the long run, a lot of savings to everybody (This is also what POSIX is all about:-) People can always make errors, and an indefinite loop because of wrong termination condition is not illegal either because of the specs. On the other hand, the more portable a source is, and thus the less human intervention required in porting, the less errorphrone and rugged is the source. > >I would be happier, tho, and everything would be simpler, if we chose > >an invariant non-EBCDIC-problem character like percent-sign as > >the comment character. The localedef/charmap syntax does not need > >many metacharacters, and the ones used could be chosen with care > >for good engineering results in portability. I do not think we have > >a long historic tradition (for localedef/charmap) to take into > >account, like we have for the shell. > > I agree that there isn't much history for these files. However there are > lots and lots of well-trained fingers that think that comments are spelled > # or /* .. */. Introducing a new convention seems to be very poor > erganomics. Yes, there is some reason to this. Maybe we could use -- as in C++ ? sh used the colon sign at some point. Fortran uses "C". Or we could leave it as it was in d11, where the default is # and it then can be redefined with the comment-char spec? Keld Keld