========================================================= Multiplexed character set Transformation Format 8 (MTF-8) version 0.1.0 (2002-Oct-24-Thu) Adam M. Costello http://www.nicemice.net/amc/charset/ ========================================================= Motivation ========== ISO-2022 (Character Code Structure and Extension Techniques) and ISO-10646 (Universal Multiple-Octet Coded Character Set, UCS a.k.a. Unicode) differ fundamentally in philosophy. The basic philosophy of ISO-2022 is incorporation and peaceful coexistence of many character sets. So if there is a letter "A" in eighty-four registered character sets, then there are eighty-four ways to represent the letter "A". The basic philosophy of Unicode is unification. So there is only one way to represent the letter "A". The main disadvantage of the coexistence approach is the multiple representations. Two characters from different sets might be effectively equal (e.g. both "A") but a decoder won't know that if it doesn't support both character sets. So an encoder needs to have some idea of which character sets will be supported by the decoder. The main advantage of the coexistence approach is compatibility with existing software and standards. For example, fonts almost always use the ISO-2022-compatible character sets, so software that reads Unicode text usually needs to translate the characters back into the ISO-2022 sets. But it also needs additional hints, like language tags, to decide which font to use when a given character can be found in many fonts of different character sets. Setting aside the unification vs. coexistence debate, notice that the Unicode character set has an encoding called UTF-8 (UCS Transformation Format 8) with several attractive properties: From RFC 2279: Character values from 0000 0000 to 0000 007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values). A direct consequence is that a plain ASCII string is also a valid UTF-8 string. US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g. the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values. Round-trip conversion is easy between UTF-8 and either of UCS-4, UCS-2. The first octet of a multi-octet sequence indicates the number of octets in the sequence. The octet values FE and FF never appear. Character boundaries are easily found from anywhere in an octet stream. The lexicographic sorting order of UCS-4 strings is preserved. Of course this is of limited interest since the sort order is not culturally valid in either case. The Boyer-Moore fast search algorithm can be used with UTF-8 data. UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. It might be interesting to define an MTF-8 encoding (Multiplexed character set Transformation Format 8) with similar properties, but using the coexistence approach rather than the unification approach. The attractive properties for MTF-8 are: US-ASCII characters are represented by themselves in MTF-8, so a plain ASCII string is also a valid MTF-8 string. US-ASCII values do not appear otherwise in an MTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g. the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values. Character codes from ISO-2022-compatible graphic character sets appear unaltered inside the MTF-8 sequences representing them (except that bit 7 is set). These sequences use only octets in the range 160..255, and octets in this range are never used for any other purpose in MTF-8. This makes round-trip conversion between MTF-8 and ISO-2022 very easy. Although it is impossible to have the first octet of a multi-octet sequence indicate the number of octets in the sequence (because ISO-2022 places no bound on the length of the string used to identify a character set), it is at least true that the last octet of a sequence can be identified without having to look at the next octet in the stream. Unfortunately, it's not practical to define MTF-8 in such a way that one or two octet values never appear. Character boundaries are easily found from anywhere in an octet stream. It makes no sense to speak of a lexicographic ordering of ISO-2022 strings because of the nature of the encoding. Even if you consider the letter "A" to be distinct in each different registered character set, there are multiple ways to encode the Latin-2 "A" (for example) inside a string. In MTF-8 there is a unique encoding for each pair. The Boyer-Moore fast search algorithm can be used with MTF-8 data. (In other words, a substring of an MTF-8 string that begins or ends in the middle of a multi-octet sequence can never itself be an MTF-8 string.) MTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid MTF-8 is low, diminishing with increasing string length. MTF-8 Specification =================== An octet of the form 0xxxxxxx (0..127) always represents a US-ASCII character. A multi-octet sequence has the form SSI*G+ (two starting octets followed by zero or more intermediate octets followed by one or more graphic octets). S = 1000xxxx (128..143) I = 1001xxxx (144..159) G = 101xxxxx or 11xxxxxx (160..255) The starting and intermediate octets specify the graphic character set from which the graphic octets will select a character. The two starting octets have the following structure: 1000mrff 1000ffff m is 0 if there is one graphic octet, 1 if there is more than one graphic octet. This bit corresponds to the absence/presence of 36=02/04 as the second byte of the ISO-2022 designation escape sequence. r is 0 if the graphic octets range from 161..254, 1 if the graphic octets range from 160..255. This bit corresponds to bit 2 of the second byte (for single-byte character sets) or third byte (for multi-byte character sets) of the ISO-2022 designation escape sequence. That byte has the form 00101ree, where ee selects which code element (G0..G3) to associate with the character set. ffffff is the lower six bits of the final byte of the ISO-2022 designation escape sequence (the upper two bits of that byte are 01). If m is 1, the most significant two f bits (the lowest two bits of the first starting octet) determine the number of graphic octets: 0x => 2 10 => 3 11 => 4 MTF-8 currently has no way of representing characters from five-byte (or more) character sets, but I know of no registered character sets that use even three bytes, let alone five. A four-byte set would have room for about 78 million characters. For comparison, Unicode has room for only about one million characters. The intermediate octets, if present, have the form: 1001iiii MTF-8 intermediate octets correspond to the ISO-2022 intermediate bytes used by the registration authority in the character set designation escape sequence, immediately preceeding the final byte. An MTF-8 intermediate octet of 1001iiii corresponds to an ISO-2022 intermediate byte of 0010iiii. Note that ISO-2022 places an additional constraint on the first of these intermediate bytes, which implies that the first MTF-8 intermediate octet must be 10010001, 10010010, or 10010011. This allows for a future escape hatch, using the other thirteen possible values for the first intermediate byte. MTF-8 Examples ============== Any character X from the upper half of the Latin-1 character set (ISO-9959-1) is represented as 10000100 10000001 X (132 129 X). The ISO-2022 designation sequence is ESC 001011xx 01000001. Any character X from the Cuban variant of ASCII (ISO-646-CU) is represented as 10000000 10000001 10010001 X (128 129 145 X). The ISO-2022 designation sequence is ESC 001010xx 00100001 01000001. Any character XY from the Japanese Kanji character set (JIS-X-0208) is represented as 10001000 10000010 X Y (136 130 X Y). The ISO-2022 designation sequence is ESC 36=02/04 001001xx 01000010. (Except that if the third byte would be 00100100, it is omitted for historical reasons.) A two-byte character requires four octets in MTF-8, plus an octet for each extra intermediate byte, but I know of no registered two-byte character sets that use extra intermediate bytes. A one-byte character requires three octets in MTF-8, plus an octet for each extra intermediate byte, but the most common sets use no extra intermediate bytes, and the others use only one (as far as I know). So MTF-8 tends to use one more byte than UTF-8 for non-ASCII characters, but it does convey more information.