From isaak@csac.zko.dec.com Mon May 24 09:24:21 1993 Received: from crl.dec.com by dkuug.dk with SMTP id AA05054 (5.65c8/IDA-1.4.4j for ); Tue, 25 May 1993 09:40:37 +0200 Received: by crl.dec.com; id AA25597; Tue, 25 May 93 03:40:41 -0400 Received: by easynet.crl.dec.com; id AA04788; Mon, 24 May 93 13:22:53 -0400 Received: by csac.zko.dec.com (5.65/fma-100391/BobG-15-Feb-93);id AA20736; Mon, 24 May 1993 13:24:21 -0400 Date: Mon, 24 May 1993 13:24:21 -0400 From: isaak@csac.zko.dec.com (Jim Isaak-respond via isaak@decvax.dec.com) Message-Id: <9305241724.AA20736@csac.zko.dec.com> To: sc22wg15@dkuug.dk Subject: Minutes from SC22 Char Set ad hoc meeting X-Charset: ASCII X-Char-Esc: 29 Return-Path: codjig::wasted::SC22-request@dkuug.dk Received: by csac.zko.dec.com (5.65/fma-100391/BobG-15-Feb-93); id AA11156; Sat, 22 May 1993 19:25:51 -0400 Date: Sat, 22 May 1993 19:25:50 -0400 From: codjig::wasted::SC22-request@dkuug.dk To: SC22 List Subject: (SC22.340) Minutes of Ad Hoc on character handling Draft Minutes of SC2/SC18/SC22 Ad Hoc Meeting on Character Handling Danish Standards Building Hellerup, Denmark, 1993-04-21--23 ---------------------------------------------------------------------- 10:00h Wednesday Phase 0: Administration Fredie Vogelius, DS, welcomed the meeting attendees to DS. Leroy Dickey, convener of SC22/WG3, Canada, acted as the chairperson in place of Mr. Hopper, New Zealand, who could not attend. .0 Roll Call: 21 experts from different countries working groups of SC22, and subcommittees SC2 and SC18 attended the meeting. The list of attendees is in document Hellerup 18. The attendees can be classified broadly classified as those who are involved in developing programming language standards, those who are involved in the internationalization (I18N) related standards and representatives from other sub committees and organizations (SC2, SC18 and X/Open - a category C liaison member of SC22/WG20). COBOL, APL, C, Prolog, Posix, I18N, WG5 and WG21 -- working groups of SC22 were represented. Other SC22 WGs were not represented. Also, SC21 was not invited to this meeting - it was felt that SC21 has issues similar to SC18 and SC22 related to character handling. .0.1 Appointments for the meeting .0.1.1 Secretary: V.S. Umamaheswaran, Canada, SC2, was appointed to record the highlights of the meeting. Lonnie Christensen, DS, helped with copying and typing matters. .0.1.2 Document Register: All the documents distributed at this meeting and all the output documents will be given the numbers HELLERUP NN. See the list of documents in document register Hellerup xxx. Keld Simonsen, DS, will maintain the numbers for this meeting. .0.1.3 Drafting Committee: Glenn Adams, USA, Mike Ksar, Convener, SC2/WG2, USA, Alan Griffee, SC18 rep, USA, Arnold WInkler, SC22/WG20 (chair of drafting committee), USA, and Keld Simonsen, SC22/WG15, Denmark, volunteered to draft the recommendations from the meeting. .0.2 Approval of Agenda: Agenda (Hellerup 1) was adopted with the following modifications: Phase 1: Input -- added .3.1 Implementation Issues of ISO/IEC 10646/ Level 3 by Glenn Adams .3.2 Posix handling - by Gary Miller. Phase 2: Processing -- .1 SC22/N1239 List of Issues (Hellerup 16) was included as a key input document to be addressed. Phase 3: Output -- .1 Resolutions was renamed .1 Recommendations. ---------------------------------------------------------------------- Phase 1: Input .1.1 Comments from the Chair WHAT IS THIS MEETING CONVENED FOR? Mr. Hopper had circulated document outlining the issues to be addressed by this meeting (refer documents Hellerup 2, 3, 4, 5. In addition earlier SC22 documents also list the concerns in the area of character handling (see documents Hellerup 12, 16, 17) and a set of what was thought to be the concerns of this meeting by the chair (document Hellerup 15). This ad hoc meeting will produce a set of recommendations primarily for SC22 and its working groups. If there are any items identified for other SCs of JTC1 these will be communicated to them -- Alan Griffee for SC18 and Sten Lindberg for SC2. Alan Griffee, representing SC18, will make a presentation on SC18 concerns and Mike Ksar, convener, SC2/WG2, will give a presentation on ISO/IEC 10646 standard. A presentation by Glenn Adams on ISO/IEC 10646 Level 3 implementation issues and one by Gary Miller on how XPG4 character handling model deals with UCS were identified. WHAT IS THE GOAL FOR THIS MEETING? SC22/WG20 has a work item on I18N and a Technical Report (10176) is being drafted. Character handling has been identified as an item of concern in view of the increased I18N needs and particularly in view of the new ISO/IEC 10646 standard. SC22 does not have a generic character handling model that can be used across all programming language standards within the scope of SC22. A model as to where SC22 standards fit in the overall character handling aspects of information processing is needed -- SC22 standards may be TOOLS in support of application development to satisfy I18N. After much discussion and comments the following was set as the GOAL for this meeting: IDENTIFY POLICIES AND PRINCIPLES REGARDING THE USE OF CHARACTER SETS, WITH EMPHASIS ON ISO/IEC 10646, AND TO IDENTIFY AREAS OF CONCERN INCLUDING AT LEAST THOSE PRESENTED AT THE SC22 PLENARY AT ELLEVOURI, FINLAND, 1992, AS DOCUMENTED IN SC22 N1239 (see Hellerup 16). [Editors note: From time to time, there may be reference to simply 10646 This is be understood as referring to ISO/IEC 10646.] ---------------------------------------------------------------------- .1.2 Collection of Written Submissions The following documents were submitted for distribution. Some of these were presented and others were for study at the end of the day by the attendees: SC 2: Extract of selected pages from ISO/IEC 10646 (pre-publication copy) for use at this meeting, with a word of caution from Mike Ksar, convener, SC2/WG2 that 'the final document from ISO/IEC should be used instead of this extract' (see Hellerup 8 and 14). (Expected cost of the 754pp 10646-1 is around 350 SFr). SC18: Concerns from SC18 (documents Hellerup 6, 9 and 10) submitted by Alan Griffee. ISO/IEC 10646 Level 3 Implementation: Glenn Adams -- (document Hellerup 7). Unicode in XPG4/POSIX Model (Hellerup 11) and Set of associated overhead charts (Hellerup 21) from Gary Miller. Multi-Script Ordering for Unicode, Alain LaBonte, Canada (Hellerup 13) ---------------------------------------------------------------------- .1.2: Presentation from SC2/WG2 on characters Mike Ksar, convener, SC2/WG2, presented the latest status of ISO/IEC 10646 (see charts in Hellerup 14). Some highlights: ISO/IEC 10646-1.2 ballot was approved 5/92, with 20 out of 24 P members approving, 6 out of 30 (P and O) members disapproving. Most of the comments were accommodated at the SEOUL WG2 meeting in 1992, with no 'substantive' changes. (Japan, Denmark, Turkey, Tunisia, Poland - ballots are still negative). The document was finalized -- along with the necessary character glyphs (fonts) from Association of Fonts Information Interchange (AFII), the ISO Font Registrar -- the fonts were donated by various national bodies to AFII for the explicit purpose of producing the Camera Ready Copy of ISO/IEC 10646. The glyphs in the printed ISO/IEC 10646 are not NORMATIVE. Unicode 1.1 (which has the same set of code points and character names as in ISO/IEC 10646-1) is planned to be available on FTP (excluding glyphs). There are 2 canonical forms -- two octet and four octet. There are three levels of conformance -- Level 1 - no composite strings are permitted; Level 2 - restricted composite strings are permitted; and Level 3 - unlimited composite strings are permitted including possible duplicate representation of some characters using composed strings. Level 2 was added during the comment resolution and subsequent editing of the DIS. Approximately 38 percent of the Basic Multi Lingual Plane is currently unpopulated. Based on the known additional requirements for inclusion of characters, the UCS-2 will be insufficient -- though UCS-2 satisfies most of the commonly used characters in the various scripts in the world. Some items of concern that were noted: a. Should SC22 recommend a specific level, canonical form etc. for consistency across all programming languages. b. Should a recommendation be made as to a normalized (or preferred) form when there are potentially multiple representations for a given graphic symbol? c. Should a recommendation be made to specify normalizing the ordering of combining characters when more than one is present and when any order is equivalent to the same graphic symbol?. d. How does one identify selected subsets (annex in 10646) for example in ASN.1? A proposal to use IGS (of ISO 6429) along with ISO/IEC 7350 as the sub-repertoire registry is in front of SC2. ISO-IR has escape sequences to invoke UCS-2, UTF-1, UCS-4 ... as complete code designation sequences. For Info: CEN/CENELEC/TC 304 is considering specifying a subset of ISO/IEC 10646 to satisfy European script requirements covering Latin, Greek and Cyrillic scripts. Contemplating Level 1 currently. Not decided as to UCS-2 or UCS-4. ---------------------------------------------------------------------- .1.3 Presentation from SC18 on glyphs Alan Griffee, convener, SC18/WG8, made a presentation on the SC18 WG8, AFII, X3V1 ad hoc on character / glyphs view of the glyph mapping problems and concerns associated with handling ISO/IEC 10646 encoded data in text processing and presentation (see Hellerup 9 and 10). The SC18 model consists of using the coded graphic character data in the revisable form for processing and for interchange. The formatting and layout process converts this to the presentation form using information on the fonts and a link between the coded character data and the font via a map between the character codes and the glyph identifiers. The font resources use the glyph identifiers as an index into the font metrics and shape information. The link between the coded characters and their corresponding glyphs is currently not standardized. The ISO Font Registry for ISO 10036 is maintained by Association for Font Information Interchange (AFII), USA. Whereas any short name can be used for the glyph identifiers to link the coded characters to the corresponding glyph in the selected font, the registered glyph identifiers from the Font Registry permit portability of font definitions across different systems. Characters vs Glyphs: "A graphic character conveys meaning and has an associated shape and linguistic use" whereas "A glyph conveys shape and has an associated meaning and linguistic use. The set of GLYPHS is a superset of all GRAPHIC CHARACTERS. While a font collection could contain the glyphs needed to present a number of 8-bit codes containing maximum of 256 graphic characters each (with overlaps), the same is not true of fonts needed to support larger (16-bit or more bit) codes -- such as ISO/IEC 10646. A font collection corresponds only to a sub repertoire of such large codes. Some method of ensuring matching of the subsets selected and their corresponding fonts is needed. With ISO/IEC 10646 permitting unlimited number of composite sequences, the number of graphic symbols that may be coded using ISO/IEC 10646 is unbounded. This poses a specific concern from the presentation point of view in that a glyph corresponding to a composite sequence may not be available in a font selected. Alan also proposed a set of recommendations to deal with the concerns expressed on page 7 of document Hellerup 9. These are to be studied by the attendees for the next day. The following are some of the comments / questions and answers during and following Alan's presentation: SC2 does NOT code meaning of characters - it codes only SHAPES. If a coded font has 52 different shapes for the same glyph - say an 'ampersand sign' there will be 52 different glyph identifiers. The current practice of issuing glyph identifiers does not permit easy identification of relationship or similarity etc. between different shapes using just glyph identifiers. AFII is getting several font manufacturers / designers in the Far East to arrive at unified CJK fonts to address the problem of the current multiple glyphs for the same CJK coded character of ISO/IEC 10646 (depending on the language / country) -- to try and arrive at unified CJK glyphs corresponding to the unified CJK coded characters of ISO/IEC 10646. From CEN/CENELEC effort point of view there is a missing standardized LINK between SC2 coded characters and SC18 glyph images. Currently no group in ISO is addressing this problem. It is possible to have more than one such LINKs. At least a default link may be definable. The common market has a requirement for about 2000 glyphs to deal with all the scripts needed to support the presentation of all the symbols needed for all the European languages. There is a difference in the view point between "all glyphs must be predetermined (of SC18 model)" versus "if you want to compose a character using any of the defined coded characters (using multiple accent marks for example) you can do it using ISO/IEC 10646 (10646 composite sequence model) even though a glyph may not exist corresponding to the resultant graphic symbol". If a prudent application will have the ability to deal with the problem of encountering characters outside a subset why should we bother about explicit subset identification? --- Answer: to permit predetermination of actions and deal with necessary resource handling. Short Names for characters / glyphs / graphic symbol is important for programming languages. Also SC22/WG20 is working with 'attributes' associated with characters. (During discussions on the following day, the following points were clarified: Font Glyph Image defined in ISO/IEC 9541 Font Standard is a specific instance of a character image. A graphic symbol defined in 10646 can be considered to be a generic or nominal glyph image). ---------------------------------------------------------------------- .1.3.1 ISO/IEC 10646 Implementation Issues: Glenn Adams Glenn Adams, USA, made a presentation related to Implementation Issues related to ISO/IEC 10646 (particularly Levels 2 and 3 related). Document Hellerup 7 contains the text. An example of an Arabic word illustrating the complexity of glyphs being put together to form a word of 7 letters was shown. The glyphs consisted of 'stroke components' from a HP stoke font (RUHA script from DECOTYPE Amsterdam?). For more complex scripts a single graphic symbol (or glyph) can be generated from several strokes. This example was to show that there need not be a one to one correspondence between a graphic symbol and a glyph. The SC18 model seems to break down at least for the calligraphic scripts. Some highlights: The difference between levels of 10646 was illustrated with some examples. It was also pointed out that Level 2 rules out some composite symbols even though there is no equivalent single character coded in the standard. Some composite symbols are permitted in Level 2 though an equivalent single character has been coded (appears slightly differently) -- an Arabic character example was shown. In Level 3, the complexity related to potentially multiple encoding of the same graphic symbol (composite sequence) is to be dealt with in the application. The equivalence between a composite sequence and a single graphic character may be valid for some script while it may not be for the others. This raises the need for another layer of identification of 'script / orthography / language combination' ... with an associated definitions of 'text elements and their equivalent composite sequences'. Functions such as 'counting number of graphic symbols' in a UCS encoded string are dependent on the application. The beginning and ending of a composite sequence may not always be discernible from the 10646 character sequences alone -- though it is possible in MANY cases from the attributes defined in 10646. While UNICODE has the notion of 'text elements' and 'code elements' ISO/IEC 10646 does not have it. Indexing operations on UCS strings (UCS-2 or UCS-4) based on fixed number of octets per 'UCS character' are guaranteed to fall always on a 'coded character boundary' (unlike encoding such as SJIS or other ISO 2022 based encodings). However, to get the correct 'text element boundary' in UCS Levels 2 and 3 with composite sequences additional linear scans will be necessary. Composite sequences do pose a unique problem for interactive input processes -- especially if the device is hard copy based. What should be displayed for partially composed characters during input stage of the component characters? Innovative techniques are needed. One suggestion is to show the components as single graphic symbols till all the elements of the composite sequence are entered. Special considerations also have to be given for editing processes. Editing on graphic symbols (text elements) as units versus coded characters as units. The maximally decomposed form of composite strings was suggested as one method of normalizing of level 3 (potentially multiple) encodings. For dealing with Level 3 sequences in Level 1 implementation model, the private use areas may be employed. For all self contained processing and with exchange of associated equivalences to coded characters in the private use area between specific applications Level 1 processing logic can deal with Level 3 data. A method of default display of composite sequences on Level 1 supporting devices would be to show the characters as individual symbols if possible with a different glyph to distinguish the combining character from a non-combining character. Most of the existing character oriented displays (as opposed to graphic displays) of today will have to be migrated. -------------------- End of Day 1 (17:00h) ----------------- Phase 1 (Continued) DAY 2: 09:10h .3.2 UNICODE in the XPG/POSIX Model: Gary Miller Gray Miller, representing X/Open I18N group, made a presentation on the XPG4 (X/Open Portability Guide Issue 4) model dealing with UCS (see Hellerup 11 and 21). Some notes taken during the presentation (see the presentation foils for full information): The term used in XPG for composite sequence -- "a base character plus ZERO or more combining characters" is not in accordance with ISO/IEC 10646 -- ("... ONE or more combining ..."). There is no defined term for entities consisting of more than one composite sequence as single unit. (Indic Virama seems to have this inherent property). XPG model is based on the concept of LOCALEs - which is a repository of customs and conventions in support of I18N. Locale by itself is code-independent, however a specific instance of the locale binds the locale information to a code. w_chart is the process code and any other code can be the file code. Use of x'00' as part of UCS-2 and UCS-4 encoding does pose a problem for 'string related functions' because most string handling functions treat x'00' as string terminator. UCS-2 will be process codes. UTF-1 can be a file code. XPG defines another UTF -- file system safe UTF (FSS-UTF). Many existing file systems (primarily Unix based) use 'NUL' and '/' characters in a hard coded (US ASCII) way. FSS UTF (also known as UTF-2) encodes the UCS-4 into UTF consisting of 1 to 6 octets, using an efficient algorithm. The single octet UTF code is identical to the left half of ISO 8859-1. It assumes the file system is 8-bit transparent. The term 'cluster' was proposed to represent 'one or more characters or composite sequences that are to be treated together as a single unit'. Notion of processing of a cluster using specific functions is provided for in XPG4. One could use the 'private use' coding space to represent equivalents of cluster (if processing overhead can be tolerated). w_chart width is initialized or fixed to be 8, 16, or 32 bits in the header of the unit to be compiled. At compilation time the appropriate 8. 16 or 32-bit width processing library is brought in. Most wchar_t is fixed to be 16 or 32 bit implementation in systems. Work to date in XPG assumes UTF-2 form of UCS is ALWAYS the file code, including file names, parameter names etc.. POSIX currently restricts file names to PC (Portable Character Set). POSIX 2B incorporates UTF-2. A file that is to be treated as an ANSI-C text file cannot contain UCS data (because of NUL semantic). If the file is declared to be binary then it can contain UCS-2 or UCS-4. It will be up to the application code to deal with the data correctly. MS Windows NT supports UCS-2 on the file system and deals with the 16-bit NULs correctly. AT&T plans to support UTF-2 in their file systems, with UCS-2 as the processing code. Approximately 30 percent of data is text data. The expansion factor due to UCS-2 and UCS-4 may not be a serious problem. One could employ further compression techniques if needed to store UCS-2 or UCS-4 on files. ---------------------------------------------------------------------- PHASE 2: PROCESSING .0 Review of presentations .1 Review of documents .3 Other Leroy Dickey presented a list of documents to be considered for Phase 2 of the meeting. Document Number Page Contents / Items Hellerup --------------- ------ ------------------------------------- 3 1 Objectives of this meeting 4 Questions to be addressed 6 List of Concerns 9 7 Recommendations 11 20 FSS UTF File System Safe UTF 16 (SC22/N1239) Concerns from Ellevuori SC22 meeting 17 Resolution 216 19 Levels of Conformance (R. Weaver) --------------- ------- --------------------------------------- Other concerns / discussions / proposals (some had dispositions): Gary Miller proposed later that there seems to be a requirement to request that ISO/IEC 10646 should define a Level 2.5 conformance, to match up to the current misconception that level 2 prohibits only those composite sequences for which there is an equivalent single character coded. The meeting did not act on this suggestion. Keld Simonsen: Previous standard dealing with combining characters ISO 6937 had a fixed repertoire, whereas the number of ISO/IEC 10646 composite sequences is unbound. It is proposed that we 'ignore' the combination aspect or attribute and treat all characters as stand alone. SC2 standards typically do NOT specify HOW TO USE the coded characters -- the use is left to the domain of application standards. Some guidelines are needed to deal with different interpretations of composite sequences to be equivalent to other characters or composite sequences in different orthographies. Will I18N be satisfied if -- IF A=B -- cannot be properly dealt with when it comes to composite sequences? A 'string compare' is what is needed. IF A=B should not be used for string compares. Issues of EQUIVALENCEs are expected to be dealt with in SC22/WG20 - I18N. IDENTIFIERS are needed in programming languages to deal with subsets or repertoires and in support of equivalences for different orthographies. Items of concern in Character String handling: a. Equivalences b. Collation / Ordering c. Formatting (presentation) - some programming languages have the concept of 'columns' --- POSIX is working with the notion of a Generic Column. d. Lengths of a string for different purposes -- one being for presentation. COBOL will have problems with LINES, SCREENS etc. with proportional spaced fonts, Graphic Windows and the like. Q: What should be the Unit of Data that SC22 has to deal with, so that one can define the requisite programming interfaces? -- Applications are responsible for defining the processable entities. -- Compiling of source codes is an application by itself -- Supporting functions are needed for application use Q: CC-data-element (coded character data element) is defined in 10646. Can this be the unit of processing? There seems to be a need for a model or framework for organizing the questions and concerns and focusing the discussions at this meeting -- specific to programming languages -- showing the various data flows, kinds of graphic character data that needs to be handled etc.. After some exchanges of views and thoughts along the lines noted above, the chair called a recess for "QUIET TIME" so that all participants can study the various documents and come up with some imaginative thoughts re: processing them. ---------------------------------------------------------------------- The meeting resumed at 14:15 It was suggested that we process the information in reverse order. Proposals from the floor first and then to the previous documents. Glenn Adams made the following proposals. a. The Character Data handling domains of SC22 can be grouped into - Language definitions and Run Time Support The Issues in the Language definition domain are: - Coded Character Representation for Programs with possible restrictions in: Comments, Character Strings and literals, Identifiers (Keywords, Variable names, other specials?) - Supported Data Types: Character (Code Element or Coded Character); String (CCDE) - includes: text element, text, others for different writing systems. - Run Time Support: Interchange: encoded form conversions, character SET conversions. Equivalence: Strings at code element level (strong support), weak support for text element and weaker support (for application defined string equivalences). Ordering: Simple (coded character binary values - for example, use in symbol tables within compilers); proper (using sort keys - to suit different criteria). Predications: Run Time functions: Length (octets, code elements, text elements, ....); classes (letter, digit,...) properties, memberships etc... Transforms: Change of class (Roman, Arabic..), change of property (case conversion) ... Formatting: Character to Glyph (at least some subset of SC18 model) Display width, height, shape ...; Cultural defaults Input: Keying methods (Kana to Kanji), Read code elements, Read Text Elements... b. The expectations was to be able to identify the issues to be addressed, acknowledging several solutions are possible. Some of the issues are: - What is the processing unit: code element or text element? - Language or Orthography bindings: granularity - Char or String? - What is the code representation for source: UCS-2, -4, UTF-x? - Limitations on Language Tokens - Extended Run Time Support Some proposals: - Provide a writing system data type - Provide binding mechanism for this new data type - Global declaration of whether code element or text element is used - Use UTF-2 for SOURCE code till full UCS-2 and UCS-4 support is made available - Extend the lexical scanners to support LARGE Character Sets - New data type 'text' or 'text element' - Extend writing system support at run time Some points captured from the discussions: Writing System Tag: Language, Symbols and Orthography relationship. SC18 and SC22 should inform JTC1 requirement for this new type. SC2 is to define coding of symbols (within current scope). 10646 code elements can be used to describe text elements. Text elements are part of 'contents' of documents. Country / territory are usually independent of 'Language/orthography' collections (I18N work in WG20). Programming Language standards are written in a Code Independent manner NOW. One could increase the repertoires to include all people Languages. The keywords etc can be limited only to UCS Level 1; further extensions with the tags on source files with writing system tags is possible. ---------------------------------------------------------------------- Would it be useful to focus further discussions and outcome of the meeting to: a: Existing systems and Data (SHORT TERM) b: Future Ideal situation (LONG TERM) It was pointed out that most of the identified issues are applicable to character handling in general (to all programming languages) independent of the specific code used (i.e. 10646). SC22 has a tech report on designing programming languages TR 10176. This report incorporates much of Glenn Adam's proposals. Answers to the questions posed above from the different SC22 WG reps present were sought. Ann Wallace, convener of WG4 - COBOL: Any changes required of existing ISO standards on programming languages are by default LONG TERM, simply because it takes about 7 years for each new version. A SHORT TERM can only be an agreement or a request to produce technical reports that indicate directions -- if a WG is planning an addendum the contents of TR can be taken as directions -- towards providing consistency across different programming languages. COBOL group is preparing a CD for 1994. COBOL can support UCS Level 1 in the next release -- the group has a view of UCS Level 1 and a sense of what needs to be done for Levels 2 and 3. It has to be consistent with the existing single octet (byte) support, a new data type will is needed for UCS Level 1. COBOL knows today's records, how to describe the fields in the records. Japanese is handled using 'National Data Type' which describes the code elements in number of octets. Variable number of octets per code element is not well defined. A new unit called POSITION (defined by implementations) is currently defined. Consistency and porting across implementations is up to implementations. Column could be a Unit of measure of position. The link between octet per column has been disconnected. However, not far enough into all aspects of UCS Levels 2 or 3. Another aspect is the storage requirements associated with variable length fields and records - currently dealt with using maximum record length. COBOL is based on record descriptors in a well-defined manner. Another model possible is to normalize 'text elements' also to an internal fixed-width equivalents and process them. Leroy Dickey, as APL WG convener, stated that his questions are: Which elements from 10646 should be taken into APL (if so chosen) -- what are they and what should be the considerations for taking them in? Glenn Adams through an example in Vietnamese -- using the word consisting of: ' t h a-caron-acute n g ' illustrated the need for processing specific definitions. The complexity was due to 't h' being processed a digraph, a-caron-acute can be represented in different ways on different displays (depending on their capabilities), and processed differently in different functions dealing with the string. There is a need to distinguish between code element processing and higher level collections. The need for A Canonical Form of representing composite sequences when multiple representations are possible was emphasized. Unicode consortium is looking at defining atomic Unicode consisting of Maximally Decomposed coding. Programming languages dealing with strings as a basic operand of functions need to deal with a new data type called 'text element'. Mike Ksar pointed that SC22 is undertaking defining new data types. The terminology used should NOT re-define already defined SC2 standards terminology. The set of problems in character handling are common across several different encodings of characters. 10646 unique problems are: code width, the size of the character set and composite sequences. Can UCS be used as a code for inter-language calls? Can UCS Level 1 only implementation allow to proceed with Levels 2 and 3 support in the future? Pawel Molenda, on behalf of C language, said that another C standard is not expected for another 7 years.. ---------------------------------------------------------------------- Antoni Drudis presented his 'Non Goals of this Meeting' chart: a. Permanent ad-hoc group The ad hoc meeting is a one-shot deal. No more ad hoc meetings - should be the recommendation. b. Somebody else should solve my problems.. Programming language work groups are the closes to the user. SC18, SC2 and SC22/WG20 are all further away from the user than work groups like WG4 and WG14 of SC22. The programming language work groups should get their act together. c. Make the Ad Hoc a traveling NLS School. This ad hoc meeting was primarily a tutorial / selling of 10646 to SC22. Cross language solutions must be developed and brought to the attention of SC22 plenary. There have been ad hoc groups on character handling on almost every SC22 plenary. It was pointed out that 10646 has brought only two NEW issues to be dealt with -- the large size of the character set and the composite sequences associated with Levels 2 or 3. ---------------------------------------------------------------------- R. Weaver updated document Hellerup 19 and distributed as Hellerup 23. This document tried to capture all the issues and concerns that were expressed during the meeting and from the various documents listed earlier, to provide a frame work for further progress of the meeting. A list of issue were drawn up on a foil and was handed to the drafting committee. The drafting committee was given the following basic directions: It was to draw up a summary of issues, and proposed recommendations for the main meeting to consider the next day. A foil containing TWO primary issues was also created in an interactive q & a mode. The report from this meeting is to SC22 with copies to other SCs including SC2 and SC18 (who were represented). -------------------- End of Day 2 (17:00h) ----------------- DAY 3 - Phase 3: OUTPUT The meeting resumed at 10:30h giving more time needed by the drafting committee. The drafting committee's output was briefly presented by Arnold Winkler. There were considerable discussions leading more or less to re-writing the recommendations during the meeting. Several proposals, discussions etc. took place. The meeting report was distributed as Hellerup 26, containing a summary of issues and a set of recommendations. These recommendations were further refined and Hellerup 26-1 was circulated by closing time of 16:00h. Several attendees had to leave at this official closing time of the meeting. Further discussions on Hellerup 26-1 resulted in some editorial changes and the final report was Hellerup 26-2. The meeting chair Leroy Dickey was to review the wordings of some of the recommendation with Chair of SC22 and modify as necessary -- some phrases like "SC22 must...", and any grammatical corrections needed. Keld will distribute the latest on E-Mail with expected feedback of minor and editorial comments by 8th of May 1993. The meeting report will be addressed to SC22, with copies to SC2 and SC18. Distribution to other SC (such as SC21 and SC24) should be verified through Joe Cote (Chair of SC22). Leroy Dickey is to follow this up after the meeting. The meeting also thanked Danish Standards for organizing the meeting, providing the copying and typing support and for all the refreshments and lunches provided. (These draft minutes are being sent to Leroy Dickey via E mail for any trimming, editorials etc. before distribution by the recording secretary. The meeting report had captured the essence of the meeting and its recommendations, even though some disagreements on the exact wordings of issues and concerns were expressed.) -------------------- END OF MINUTES -----------------------