==================================== Summary of ISO-2022 version 1.0.1 (2002-Oct-24-Thu) Adam M. Costello http://www.nicemice.net/amc/charset/ ==================================== Purpose ======= ISO-2022, also known as ECMA-35, defines the structure of character sets in such a way that multiple character sets can be used in the same byte stream, using escape sequences to switch between them. ISO maintains an International Register of Coded Character Sets to be Used with Escape Sequences (see ISO-2375). No single code is expect to understand all the registered character sets or all the defined escape sequences; ISO-2022 defines a large number of allowed mechanisms and a code may choose which of them to support. Model ===== A "coded character set" (or just "character set") is a mapping from abstract characters to "code points", which are 7-bit values or tuples of 7-bit values. A coded character set is either a "coded graphic character set", in which case it contains only graphic characters, or a "set of coded control functions", in which case it contains only control characters. There are six "code elements" G0..G3 and C0,C1. Each of G0..G3 can be associated with a graphic character set by "designating" the character set into the code element. Similarly, a set of control functions can be designated into each of C0,C1. There are four "code areas" GL,GR,CL,CR. GL can be associated with one of G0..G3, and GR can be associated with one of G1..G3, by "invoking" (or "shifting") the code element into the area. But C0 and C1 are permanently invoked into areas CL and CR respectively (which is pointless conceptual complexity). Characters are accessed using bytes (or tuples of bytes) as indexes into the code areas: byte range code area ----------------------- --------- 0..31 = 00/00..01/15 CL 32..127 = 02/00..07/15 GL 128..159 = 08/00..09/15 CR 160..255 = 10/00..15/15 GR For 7-bit codes, areas CR and GR do not exist (or you can think of them as being inaccessible without an 8th bit). Graphic characters ================== Every graphic character in a set corresponds to a c-tuple of 7-bit values, where c is constant for any given set. A character set that uses 1-tuples is called a "single-byte" set, while a character set that uses longer tuples is called a "multi-byte" set. For some sets, each entry in the tuple can take on any of the 94 values in the range 33..126 = 04/01..07/14. For all other sets, each entry in the tuple can take on any of the 96 values in the range 32..127 = 04/00..07/15. The characters SPACE and DELETE are special. DELETE is officially neither a graphic character nor a control character, though it is usually considered a control character (it is historical baggage). SPACE is a graphic character, but neither it nor DELETE may be contained in any graphic character set. SPACE and DELETE are available via the bytes 32=04/00 and 127=07/15, respectively, whenever the character set in GL uses 94-value tuple entries. Characters in GL are accessed by tuples of bytes in the range 32..127 = 04/00..07/15, while characters in GR are accessed by tuples of bytes in the range 160..255 = 10/00..15/15. In both cases the lowest 7 bits of the bytes select the character from the set. A graphic character set is identified by: * a tuple length flag: 0 means single-byte; 1 means multi-byte * a range flag: 0 means 94-value entries; 1 means 96-value entries * an intermediate sequence of zero or more bytes first is in the range 17..19 = 02/01..02/03 (span 3) rest are in the range 16..31 = 02/00..02/15 (span 15) * a final byte The range of the final byte depends on the tuple length: final byte tuple ----------------------------- length span range ------ ---- ----------------------- 1 62 64..125 = 04/00..07/13 2 32 64..95 = 04/00..05/15 3 16 96..111 = 06/00..06/15 >= 4 14 112..125 = 07/00..07/13 (Final byte 127=07/15 is forbidden, and 126=07/14 means the empty character set.) To construct an escape sequence for designating a character set into code element Gi (where i is in 0..3): ESCAPE = 27 = 01/11 = %00011011 = 36 = 02/04 = %00100100 = 32 + 4 * range_flag + i = %0011rii For single-byte sets: ESCAPE intermediate_sequence final_byte For multi-byte sets: ESCAPE intermediate_sequence final_byte Exception for historical reasons: instead of: ESCAPE 02/08 04/00..02 omit 02/08: ESCAPE 04/00..02 (The byte was added after these sets were registered.) The sequences for invoking code element Gi into a code area are: G0 -> GL 00/15 = 15 G1 -> GL 00/14 = 14 G2 -> GL ESCAPE 06/14 = 27 110 G3 -> GL ESCAPE 06/15 = 27 111 G1 -> GR ESCAPE 07/14 = 27 126 G2 -> GR ESCAPE 07/13 = 27 125 G3 -> GR ESCAPE 07/12 = 27 124 There are also sequences for accessing G2 and G3 without invoking them into GL or GR. For 7-bit codes: G2 ESCAPE 04/14 tuple = 27 78 tuple G3 ESCAPE 04/15 tuple = 27 79 tuple For 8-bit codes, it must be specified whether the bytes in the tuple should have bit 7 set, and whether C1 control functions should be accessed via 08/00..09/15 rather than ESCAPE 04/00..05/15 (see the section on control functions, below). It may seem curious that iso-2022-jp uses designation rather than invocation to select from among three character sets, but this is presumably because the Japanese character sets were registered before G1..G3 were allowed for multi-byte sets. Control functions ================= There are single control functions, and sets of control functions. Single control functions are either standardized or registered. A standardized single control function is identified by: * a final byte in the range 96..126 = 06/00..07/14 (span 31) and accessed via an escape sequence of the form: ESCAPE final_byte A registered single control function is identified by: * an intermediate sequence of zero or more bytes first is in the range 17..19 = 02/01..02/03 (span 3) rest are in the range 16..31 = 02/00..02/15 (span 15) * a final byte in the range 64..126 = 04/00..07/14 (span 63) and accessed via an escape sequence of the form: ESCAPE 02/03 intermediate_sequence final_byte A set of control functions is either "primary" or "supplementary". A primary set can be designated only into C0, while a supplementary set can be designated only into C1. A primary set must include the ESCAPE character at position 27=01/11, and a supplementary set must not include ESCAPE. Each function in a set corresponds to a 5-bit value in the range 0..31 = 00/00..01/15. Functions in the CL area are accessed by bytes in the range 0..31 = 00/00..01/15, while functions in the CR area are accessed by bytes in the range 128..159 = 08/00..09/15. In both cases the lowest 5 bits of the byte select the function. For 7-bit codes the CR area does not exist, but fortunately the C1 element can be accessed via escape sequences: ESCAPE 04/00..05/15 = 27 64..95 The lowest 5 bits of the last byte select the function. 8-bit codes must use either single bytes to access CR, or escape sequences to access C1, but not both. A set of control functions is identified by: * a type flag: 0 means primary; 1 means supplementary * an intermediate sequence of zero or more bytes first is in the range 17..19 = 02/01..02/03 (span 3) rest are in the range 16..31 = 02/00..02/15 (span 15) * a final byte in the range 64..126 = 04/00..07/14 (span 63) To construct an escape sequence for designating a set of control functions into code element C0 or C1: = 33 + type_flag = %001000_!t_t ESCAPE intermediate_sequence final_byte