From kido@vnet.ibm.com Mon Nov 16 01:59:23 1992 Received: from vnet.ibm.com by dkuug.dk with SMTP id AA16321 (5.65c8/IDA-1.4.4j for ); Mon, 16 Nov 1992 01:59:23 +0100 Message-Id: <199211160059.AA16321@dkuug.dk> Received: from YMTVM8 by vnet.ibm.com (IBM VM SMTP V2R2) with BSMTP id 2867; Sun, 15 Nov 92 19:57:34 EST Date: Mon, 16 Nov 92 09:55:35 JST From: "Akio Kido" To: sig-international@osf.org, sc22wg20@dkuug.dk, sc22wg15@dkuug.dk Cc: noda@nec.co.jp Subject: Redactor's Report on N205R(1992-10-06)(plain ASCII) X-Charset: ASCII X-Char-Esc: 29 Folks, Today, I get the attached redactor's report re latest version of MSE from Japanese C committee. Could you refer it for your review of MSE, please. Best regards, Akio Kido ------------------------------------------------------------------ Redactor's Report on N205R(1992-10-06) IPSJ/ITSCJ/SC22/C WG November 13, 1992 This document summarizes the changes that have been made in MSE between SC22/WG14/N205 and its latest draft SC22/WG14/N205R(1992-10-06.) ITSCJ/SC22/C WG thinks that N205R is almost stable. Any technical change in MSE shall be made according to the resolutions adopted by SC22/WG14. 1 Clarification of the behavior of encoding error 1.1 Definition of the term encoding error A new paragraph is appended to the clause "3.4 input/output" in order to define a new term encoding error. 1.2 Description of the behavior of the wide character I/O functions In each of the two subclauses "3.4.1.1 the fgetwc function" and "3.4.1.3 the fputwc function", we append a new paragraph which describes the behavior of the function when it encounters an encoding error. 1.3 A new macro EILSEQ The new clause "3.2 Errors " describes a new macro EILSEQ which is set to the variable errno in a encoding error. 1 1.4 The encoding error included in the input failure in the scanf function In the WG14 Salt Lake City meeting we assured that if the scanf function encounters an encoding error, the function treats it as an input failure. So we change the description of the "extension" section in the subclause "3.4.3.2 the fscanf function". 2 Clarification of the description about wide character We change the description in the clause "2.1 clarification" so as to specify ex- plicitly that an invalid wchar_t value causes an encoding error. 3 The new name of the wcsstr function We change the name of the function whose previous name wcswcs to wcsstr according to the result of a straw vote in the WG14 Salt Lake City meeting. 4 "C" locale restriction for the iswxxx func- tions "C" locale restriction for widechar testing functions has been replaced with supersetting constraint. When the character c is true for a isxxx function, the corresponding widechar wc shall be true for a iswxxx function. isxxx(c) != 0 ==> iswxxx(wc) != 0 5 Clarification of %C and %S conversions in the printf functions The description for %C and %S of the printf function has been modified to clari* *fy the behavior when a precision is specified. It was specified as ignoring a prec* *ision in the previous draft but now is defined. In any case, only valid multibyte sequences shall be generated by the function. Any truncation shall never occur in the middle of multibyte sequences which correspond to a character. 2 6 Clarification of %C and %S conversions in the fscanf function The description for %C and %S of the fscanf function is modified to clarify the behavior in reading multibyte characters as the input. In the new description, "as-if" rule is used to specify that the fscanf function always reads zero or more multibyte characters from the input and that never stops reading at the mid-character byte (the position that is not proper for delimiting a multibyte character) unless an encoding error occurrrs. 7 Addition of a extra parameter to the wcstok function A parameter is added to the wcstok function to eliminate the internal memory. The new parameter points to a caller-provided wchar_t pointer into which the wcstok function stores information necessary for it to continue scanning the same string. 8 Elimination of %[ in the wscanf function The description and the example for %[ of wscanf function are eliminated ac- cording to the result of a straw vote in the WG14 Salt Lake City meeting. The reason of this elimination is several difficulties in implementation. 9 Change in the behavior of wchar I/O apply- ing to the binary stream The last sentence in the clause "2.3.2 states of a stream", "if any of the wide character input/output functions is performed on a binary stream, the behavior is undefined", is eliminated according to the result of a straw vote in the WG14 Salt Lake City meeting. The reason of this elimination is that it seems to be an unnecessary constraint. 3 10 Elimination of redundant description about conversion specifier At the description for %c of fwprintf function in 3.4.2.1, the redundant de- scription, "if the precision is specified, the behavior is undefined", is elimi* *nated. And at the description for %C of the fprintf function in 3.4.3.1, the redundant description, "if the precision is specified, it is ignored", is eliminated. Bec* *ause the description of precision ISO/IEC 9899:1990 mentions enough. 11 Magic Number 509 There was a magic number 509 at the environment limit of fwprintf function in 3.4.2.1. The unit 'wide characters' was revised to 'bytes'. 12 The description, "implementation-defined", in the iswctype function At the description of the iswctype function in 3.3.3.2, the usage of the iswcty* *pe function, that has invalid value in second parameter wc_prop, should not be allowed. So we revised the description of the iswctype function in 3.3.3.2, "the behavior is implementation-defined", to "the behavior is undefined". 13 New conversion functions We decided to move on adding new multibyte conversion functions according to the result of the straw vote in WG14 Salt Lake City meeting. [quote from the minutes: 11 add conversion functions with controllable shift state 4 no end of quote] 4 13.1 Current specification for the state information The specification of the current multibyte handling functions (mblen, mbtowc, wctomb, mbstowcs, wcstombs) of the ISO/IEC 9899:1990 are as follows: - The character handling functions hold the conversion state information of a string in the internal memory, but the information is not associated with the string. - Changing LC_CTYPE category causes the shift state of these functions to be indetermine. - The multibyte string functions return -1 if an invalid multibyte charac- ter/wide character is encountered, though no information to determine which character is invalid is given to the application. 13.2 Possible problems The above specifications may bring out the following problems: - The character handling functions can not manage the conversion state information of two or more strings at one time, so that they can be ap- plied for neither the stateful encoding environment nor multiple stream environment. - Since changing of LC_CTYPE category is not allowed, these functions can not manage plural strings represented in different encodings at one time. (Different encoding may mean different LC_CTYPE.) - The multibyte string functions can not be restarted since there is no in- formation to determine where the conversion stops. - Conversion between multibyte character/string and wide character/string may be performed implicitly in the I/O functions defined in MSE. There- fore, the problem will be inherited to these I/O functions as far as the current conversion functions are used. 13.3 Solutions 13.3.1 Overview We have introduced five new multibyte conversion functions which correspond to the current conversion functions, one new function, and one new data type. 5 The following shows the brief descriptions on these functions and the macro. - RESTARTABLE MULTIBYTE CHARACTER FUNCTIONS These functions differ from the corresponding functions in the 9899:1990 (mblen, mbtowc, wctomb) in that they have an additional parameter to store the conversion state information independently from the function. They are called mbrlen, mbrtowc, wcrtomb, respectively. - RESTARTABLE MULTIBYTE STRING FUNCTIONS These functions differ from the corresponding functions in the 9899:1990 (mbstowcs, wcstomb) in that they have two additional parameters to store the conversion state information and restart position independently from the function. They are called mbsrtowcs, wcsrtombs, respectively. - NEW DATA TYPE mbstate_t In order to make above functions applicable to multiple stream environ- ment, a new data type mbstate_t has been introduced. It may be a nonar- ray object type that can hold the conversion state information needed to convert between sequences of multibyte characters and wide characters. - NEW FUNCTION FOR mbstate_t OBJECT A new function, sisinit(), will compare any mbstate_t object to the initial shift state. 13.3.2 State object mbstate_t Mbstate_t may be a nonarray object type that can hold the conversion state information. Changing the contents of mbstate_t object can be performed by assignment. We did not restrict the data type to an integral type in order to support multiple encoding environments in the future. In such environments, mbstate_t object is required to contain both shift state information and encod- ing information. There shall be at least one conversion state called 'initial', whose shift stat* *e is initial but the encoding is not determined yet. The values to represent contents of mbstate_t objects are not defined in MSE. Only one exception is the case when an mbstate_t object is filled with zero either implicitly or explicitly. This object represents an initial conversion s* *tate. The encoding will be determined depending on the LC_CTYPE of the locale at the first call of the mbrlen or mbrtowc function with the last argument pointing to the mbstate_t object. Changing LC_CTYPE will not affect the object once the encoding is determined. The sisinit function is provided to test whether an mbstate_t object describes initial shift state or not. We did not define a function to compare mbstate_t 6 objects, since it is difficult to define the equality of shift states which dep* *end on encodings. 13.3.3 new parameter for conversion state information The new conversion functions have a parameter of type pointer to mbstate_t. The conversion will be performed according to the contents of the mbstate_t object pointed to by the parameter. The contents of the mbstate_t object will be updated after conversion. The state information can be associated by preparing one mbstate_t object for each string. 13.3.4 new parameter for restart point information The new multibyte string handling functions have an extra parameter of type pointer to tt *char/*wchar_t. These functions will store the address just past the last multibyte/wide character converted when encoding error occurs. The mbstate_t object will hold the state information of which the address points. 13.3.5 other faculties The new multibyte character functions will return number of bytes needed to return to the initial state if the pointer to the source multibyte/wide string * *is a null pointer and the pointer to mbstate_t is not a null pointer. The new multibyte string functions will return the number of elements required for multibyte/wide character array to store converted result if the pointer to * *the destination buffer is a null pointer. On environments which do not need to hold the conversion state information, the parameter to mbstate_t can be substituted by null pointer. in this case, new multibyte character functions shall work as if the corresponding current functions are called. and new multibyte string functions shall work as if the corresponding current functions are called, except restartable faculty and dete* *r- mination faculty of required element number. 7