=========================================================
Multiplexed character set Transformation Format 8 (MTF-8)
version 0.1.0 (2002-Oct-24-Thu)
Adam M. Costello
http://www.nicemice.net/amc/charset/
=========================================================


Motivation
==========

ISO-2022 (Character Code Structure and Extension Techniques) and
ISO-10646 (Universal Multiple-Octet Coded Character Set, UCS
a.k.a. Unicode) differ fundamentally in philosophy.

The basic philosophy of ISO-2022 is incorporation and peaceful
coexistence of many character sets.  So if there is a letter "A" in
eighty-four registered character sets, then there are eighty-four ways
to represent the letter "A".

The basic philosophy of Unicode is unification.  So there is only one
way to represent the letter "A".

The main disadvantage of the coexistence approach is the multiple
representations.  Two characters from different sets might be
effectively equal (e.g. both "A") but a decoder won't know that if it
doesn't support both character sets.  So an encoder needs to have some
idea of which character sets will be supported by the decoder.

The main advantage of the coexistence approach is compatibility with
existing software and standards.  For example, fonts almost always use
the ISO-2022-compatible character sets, so software that reads Unicode
text usually needs to translate the characters back into the ISO-2022
sets.  But it also needs additional hints, like language tags, to decide
which font to use when a given character can be found in many fonts of
different character sets.

Setting aside the unification vs. coexistence debate, notice that the
Unicode character set has an encoding called UTF-8 (UCS Transformation
Format 8) with several attractive properties:

From RFC 2279:

    Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
    correspond to octets 00 to 7F (7 bit US-ASCII values).  A direct
    consequence is that a plain ASCII string is also a valid UTF-8
    string.

    US-ASCII values do not appear otherwise in a UTF-8 encoded character
    stream.  This provides compatibility with file systems or other
    software (e.g. the printf() function in C libraries) that parse
    based on US-ASCII values but are transparent to other values.

    Round-trip conversion is easy between UTF-8 and either of UCS-4,
    UCS-2.

    The first octet of a multi-octet sequence indicates the number of
    octets in the sequence.

    The octet values FE and FF never appear.

    Character boundaries are easily found from anywhere in an octet
    stream.

    The lexicographic sorting order of UCS-4 strings is preserved.  Of
    course this is of limited interest since the sort order is not
    culturally valid in either case.

    The Boyer-Moore fast search algorithm can be used with UTF-8 data.

    UTF-8 strings can be fairly reliably recognized as such by a simple
    algorithm, i.e. the probability that a string of characters in any
    other encoding appears as valid UTF-8 is low, diminishing with
    increasing string length.

It might be interesting to define an MTF-8 encoding (Multiplexed
character set Transformation Format 8) with similar properties, but
using the coexistence approach rather than the unification approach.
The attractive properties for MTF-8 are:

    US-ASCII characters are represented by themselves in MTF-8, so a
    plain ASCII string is also a valid MTF-8 string.

    US-ASCII values do not appear otherwise in an MTF-8 encoded
    character stream.  This provides compatibility with file systems
    or other software (e.g. the printf() function in C libraries) that
    parse based on US-ASCII values but are transparent to other values.

    Character codes from ISO-2022-compatible graphic character sets
    appear unaltered inside the MTF-8 sequences representing them
    (except that bit 7 is set).  These sequences use only octets in the
    range 160..255, and octets in this range are never used for any
    other purpose in MTF-8.  This makes round-trip conversion between
    MTF-8 and ISO-2022 very easy.

    Although it is impossible to have the first octet of a multi-octet
    sequence indicate the number of octets in the sequence (because
    ISO-2022 places no bound on the length of the string used to
    identify a character set), it is at least true that the last octet
    of a sequence can be identified without having to look at the next
    octet in the stream.

    Unfortunately, it's not practical to define MTF-8 in such a way that
    one or two octet values never appear.

    Character boundaries are easily found from anywhere in an octet
    stream.

    It makes no sense to speak of a lexicographic ordering of ISO-2022
    strings because of the nature of the encoding.  Even if you consider
    the letter "A" to be distinct in each different registered character
    set, there are multiple ways to encode the Latin-2 "A" (for example)
    inside a string.  In MTF-8 there is a unique encoding for each
    <character set, code point> pair.

    The Boyer-Moore fast search algorithm can be used with MTF-8 data.
    (In other words, a substring of an MTF-8 string that begins or ends
    in the middle of a multi-octet sequence can never itself be an MTF-8
    string.)

    MTF-8 strings can be fairly reliably recognized as such by a simple
    algorithm, i.e. the probability that a string of characters in any
    other encoding appears as valid MTF-8 is low, diminishing with
    increasing string length.


MTF-8 Specification
===================

An octet of the form 0xxxxxxx (0..127) always represents a US-ASCII
character.

A multi-octet sequence has the form SSI*G+ (two starting octets followed
by zero or more intermediate octets followed by one or more graphic
octets).

    S = 1000xxxx              (128..143)
    I = 1001xxxx              (144..159)
    G = 101xxxxx or 11xxxxxx  (160..255)

The starting and intermediate octets specify the graphic character set
from which the graphic octets will select a character.

The two starting octets have the following structure:

    1000mrff 1000ffff

m is 0 if there is one graphic octet, 1 if there is more than one
graphic octet.  This bit corresponds to the absence/presence of 36=02/04
as the second byte of the ISO-2022 designation escape sequence.

r is 0 if the graphic octets range from 161..254, 1 if the graphic
octets range from 160..255.  This bit corresponds to bit 2 of the second
byte (for single-byte character sets) or third byte (for multi-byte
character sets) of the ISO-2022 designation escape sequence.  That byte
has the form 00101ree, where ee selects which code element (G0..G3) to
associate with the character set.

ffffff is the lower six bits of the final byte of the ISO-2022
designation escape sequence (the upper two bits of that byte are 01).
If m is 1, the most significant two f bits (the lowest two bits of the
first starting octet) determine the number of graphic octets:

    0x => 2
    10 => 3
    11 => 4

MTF-8 currently has no way of representing characters from five-byte (or
more) character sets, but I know of no registered character sets that
use even three bytes, let alone five.  A four-byte set would have room
for about 78 million characters.  For comparison, Unicode has room for
only about one million characters.

The intermediate octets, if present, have the form:

    1001iiii

MTF-8 intermediate octets correspond to the ISO-2022 intermediate bytes
used by the registration authority in the character set designation
escape sequence, immediately preceeding the final byte.  An MTF-8
intermediate octet of 1001iiii corresponds to an ISO-2022 intermediate
byte of 0010iiii.

Note that ISO-2022 places an additional constraint on the first of these
intermediate bytes, which implies that the first MTF-8 intermediate
octet must be 10010001, 10010010, or 10010011.  This allows for a future
escape hatch, using the other thirteen possible values for the first
intermediate byte.


MTF-8 Examples
==============

Any character X from the upper half of the Latin-1 character set
(ISO-9959-1) is represented as 10000100 10000001 X (132 129 X).  The
ISO-2022 designation sequence is ESC 001011xx 01000001.

Any character X from the Cuban variant of ASCII (ISO-646-CU) is
represented as 10000000 10000001 10010001 X (128 129 145 X).  The
ISO-2022 designation sequence is ESC 001010xx 00100001 01000001.

Any character XY from the Japanese Kanji character set (JIS-X-0208)
is represented as 10001000 10000010 X Y (136 130 X Y).  The ISO-2022
designation sequence is ESC 36=02/04 001001xx 01000010.  (Except that if
the third byte would be 00100100, it is omitted for historical reasons.)

A two-byte character requires four octets in MTF-8, plus an octet for
each extra intermediate byte, but I know of no registered two-byte
character sets that use extra intermediate bytes.  A one-byte
character requires three octets in MTF-8, plus an octet for each extra
intermediate byte, but the most common sets use no extra intermediate
bytes, and the others use only one (as far as I know).  So MTF-8 tends
to use one more byte than UTF-8 for non-ASCII characters, but it does
convey more information.