Friendly ASCII-Compatible Encoding (FACE) version 0.2.1 (2000-Sep-04-Mon) Adam M. Costello Goals: 1) To achieve efficient encoding. 2) To require only the characters [A-Z0-9-] (like DNS labels). 3) To be simple to describe and implement. 4) To encode Unicode text as an ASCII string in such a way that substrings that were already ASCII to begin with remain visible, for the benefit of users whose software does not understand the Unicode text. Notation: Let the symbol # denote any of the characters from the set [2-9A-KMNP-Z], which represent quintet values in that order: "2" = 0 = 00000 "3" = 1 = 00001 "4" = 2 = 00010 "5" = 3 = 00011 "6" = 4 = 00100 "7" = 5 = 00101 "8" = 6 = 00110 "9" = 7 = 00111 "A" = 8 = 01000 "B" = 9 = 01001 "C" = 10 = 01010 "D" = 11 = 01011 "E" = 12 = 01100 "F" = 13 = 01101 "G" = 14 = 01110 "H" = 15 = 01111 "I" = 16 = 10000 "J" = 17 = 10001 "K" = 18 = 10010 "M" = 19 = 10011 "N" = 20 = 10100 "P" = 21 = 10101 "Q" = 22 = 10110 "R" = 23 = 10111 "S" = 24 = 11000 "T" = 25 = 11001 "U" = 26 = 11010 "V" = 27 = 11011 "W" = 28 = 11100 "X" = 29 = 11101 "Y" = 30 = 11110 "Z" = 31 = 11111 The digits "0" and "1" and the letters "O" and "L" ("l") are not used, to avoid transcription errors. To encode a sequence of Unicode characters as a sequence of ASCII characters: There are two modes, ASCII mode and base-32 mode. Begin in base-32 mode. There is also a "previous non-ASCII character". Initially it is character U+1A0. In base-32 mode non-ASCII characters are encoded as differences from the previous non-ASCII character. The difference is first represented as a two's complement binary integer. Then the most significant bits are discarded to bring the length down to one of the lengths in the table below. As many bits are discarded as possible without changing the two's complement value, except that bit 31 may always be discarded. The resulting bit string is then encoded as a sequence of two to seven characters, depending on its length, as follows: length encoding ------ --------------------------------------------------- 9 ## = 0xxxx xxxxx 13 ### = 10xxx xxxxx xxxxx 17 #### = 110xx xxxxx xxxxx xxxxx 21 ##### = 1110x xxxxx xxxxx xxxxx xxxxx 31 ####### = 1111x xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx Bit-order and quintet order are both big-endian (most significant first). In ASCII mode, characters in the range U+0..U+7F, except for U+2D ("-"), are encoded literally as themselves. In either mode, character U+2D ("-") is encoded as "--". If the current mode is incapable of encoding the next character, output "-" and switch modes. The previous non-ASCII character is not forgotten during excursions into ASCII mode. To decode a sequence of ASCII characters into a sequence of Unicode characters: Start in base-32 mode, with a previous non-ASCII character of U+1A0. In base-32 mode, decode the various sizes of base-32 numbers as indicated in the table above (the format is implied by the first character). Sign-extend the value to 32 bits and add it to the previous non-ASCII character. If bit 31 is set in the result, clear it. Output the result, and remember it for the next non-ASCII character. In ASCII mode, decode all characters as themselves, except for 2D ("-"). In either mode, decode "--" as "-". In either mode, "-" not followed by "-" switches modes. This requires one character lookahead. The previous non-ASCII character is not forgotten during excursions into ASCII mode. Examples: Suppose the string we wish to encode is "AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a particular sequence of five Japanese characters, whose iso-2022-jp encoding is: 安室奈美恵 The corresponding Unicode values are: U+5B89 U+5BA4 U+5948 U+7F8E U+6075. The 32-bit differences are: 0000 0000 0000 0000 0101 1001 1110 1001 0000 0000 0000 0000 0000 0000 0001 1011 1111 1111 1111 1111 1111 1101 1010 0100 0000 0000 0000 0000 0010 0110 0100 0110 1111 1111 1111 1111 1110 0000 1110 0111 The truncated bit strings are: 0 0101 1001 1110 1001 0 0001 1011 1 1101 1010 0100 0 0010 0110 0100 0110 1 1110 0000 1110 0111 The quintet encodings are: 11000 10110 01111 01001 00000 11011 10111 01101 00100 11000 01001 10010 00110 11011 11000 00111 00111 The final encoded string is: SQHB2VRF6SBK8VS99---with--super--monkeys The encoding of "champs-elysee", with an acute accent over the second-last "e", is: -champs--elys-CB-e Notice how the hyphens help humans pick out the readable ASCII parts and ignore the base-32 gibberish. Use with DNS: It is recommended that a standard prefix (such as "u--") be chosen for all domain labels that use this encoding, so that they can be distinguished from ASCII labels, and so that they never begin with a hyphen. A 3-character prefix leaves room for about thirty characters in a small script like Cyrillic or Hebrew, or about sixteen Han ideographs. Hostnames are case insensitive, and that goes for the base-32 parts as well as the ASCII parts. However, since existing ASCII domain names are usually stored in lower case, it is recommended that the base-32 portions of encoded names be stored in upper case, to help humans with old software distinguish the ASCII from the base-32. Humans with new software that interprets the encoding will, of course, see the Unicode characters rather than the base-32 encoding. Comparisons with other schemes: FACE probably has similar efficiency to SACE, but seems conceptually simpler. FACE is a little more human-friendly because gibberish and ASCII are always separated by a hyphen. FACE and SACE are both more efficient and much more human-friendly than UTF-5. RACE is quite efficient for strings that are limited to Latin-1 plus one other cluster of 256 characters, but one deviation from this set causes the compression to break down for the entire string. FACE and SACE are more tolerant of outliers (for example Japanese strings composed of many kana and a few kanji, which is typical of Japanese text), and more human-friendly. References: FACE borrows heavily from UTF-5, SACE, and RACE. http://www.i-d-n.net/