AMC-ACE-O version 0.0.5 (2001-May-27-Sun) http://www.cs.berkeley.edu/~amc/charset/amc-ace-o Adam M. Costello Abstract AMC-ACE-O is a reversible map from a sequence of Unicode [UNICODE] characters to a sequence of letters (A-Z, a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH characters. Such a map (called an "ASCII-Compatible Encoding", or ACE) might be useful for internationalized domain names [IDN], because host name labels are currently restricted to LDH characters by [RFC952] and [RFC1123]. Besides domain names, there might also be other contexts where it is useful to transform Unicode characters into "safe" (delimiter-free) ASCII characters. (If other contexts consider hyphen-minus to be unsafe, a different character could be used to play its role, like underscore.) Contents Features Name Overview Base-32 characters Encoding and decoding algorithms Signature Case sensitivity models Comparison with other ACEs Example strings Security considerations Credits References Author Example implementation Features Uniqueness: Every Unicode string maps to at most one LDH string. Completeness: Every Unicode string maps to an LDH string. Restrictions on which Unicode strings are allowed, and on length, may be imposed by higher layers. Efficient encoding: The ratio of encoded size to original size is small for all Unicode strings. This is important in the context of domain names because [RFC1034] restricts the length of a domain label to 63 characters. Simplicity: The encoding and decoding algorithms are reasonably simple to implement. The goals of efficiency and simplicity are at odds; AMC-ACE-O aims at a good balance between them. Case-preservation: If the Unicode string has been case-folded prior to encoding, it is possible to record the case information in the case of the letters in the encoding, allowing a mixed-case Unicode string to be recovered if desired, but a case-insensitive comparison of two encoded strings is equivalent to a case-insensitive comparison of the Unicode strings. This feature is optional; see section "Case sensitivity models". Readability: The letters A-Z and a-z and the digits 0-9 appearing in the Unicode string are represented as themselves in the label. This comes for free because it usually the most efficient encoding anyway. Name AMC-ACE-O is a working name that should be changed if it is adopted. (The O merely indicates that it is the fifteenth ACE devised by this author. BRACE was the third. Most were not worth releasing.) Rather than waste good names on experimental proposals, let's wait until one proposal is chosen, then assign it a good name. Suggestions (assuming the primary use is in domain names): UniHost UTF-D ("D" for "domain names") UTF-37 (there are 37 characters in the output repertoire) NUDE (Normal Unicode Domain Encoding) Overview AMC-ACE-O maps characters to characters--it does not consume or produce code points, code units, or bytes, although the algorithm makes use of code points, and implementations will of course need to represent the input and output characters somehow, usually as bytes or other code units. Each character in the Unicode string is represented by an integral number of characters in the encoded string. There is no intermediate bit string or octet string. The encoded string alternates between two modes: literal mode and base-32 mode. LDH characters in the Unicode string are encoded literally, except that hyphen-minus is doubled. Non-LDH characters in the Unicode string are encoded using base-32, in which each character of the encoded string represents five bits (a "quintet"). A non-paired hyphen-minus in the encoded string indicates a mode change. In base-32 mode a variable-length code sequence of one to five quintets represents a delta, which is added to a reference point to yield a Unicode code point, which in turn represents a Unicode character. (Surrogates, which are code units used by UTF-16 in pairs to refer to code points, are not used with AMC-ACE-O.) There is one reference point for each code length; they are chosen by the encoder based on the input string and declared at the beginning of the encoded string, and never change. Locality among the code points is discovered and exploited by the encoder to make the encoding more compact. Base-32 characters "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 The digits "0" and "1" and the letters "o" and "l" are not used, to avoid transcription errors. All decoders must recognize both the uppercase and lowercase forms of the base-32 characters. The case may or may not convey information, as described in section "Case sensitivity models". Encoding and decoding algorithms The algorithms are given below as commented pseudocode. All ordering of bits and quintets is big-endian (most significant first). The >> and << operators used below mean bit shift, as in C. For >> there is no question of logical versus arithmetic shift because AMC-ACE-O makes no use of negative numbers. primitives: to_codepoint() # maps a character to a Unicode code point from_codepoint() # maps a Unicode code point to a character # These are no-ops if the implementation represents characters # using Unicode code points. subroutine names: encode # main encoding function decode # main decoding function find_refpoint # scan the reference points for a suitable one encode_point # encode one code point as base-32 decode_point # decode one code point from base-32 choose_refpoints # choose good reference points for the input census # used by choose_refpoints encode_refpoints # encode the reference points decode_refpoints # decode the reference points bootstrap # used by en/decode_refpoints shared variables: # All others are local to each subroutine. the input/output strings taken/returned by encode() and decode() array refpoint[1..5] # refpoint[k] is for sequences of length k # The rest are used only by the encoder: array prefix[1..3] # prefix[k] is used to encode refpoint[k] integers best_count, best_refpoint constants: array special_refpoint[0..7] = 0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270 # Generally, prefix[k] << (4*k) == refpoint[k], # but for prefix[2] == 0xD8..0xDF, refpoint[2] == # special_refpoint[0..7] respectively. These prefixes would # not otherwise be used because they correspond to surrogates. # These special reference points are used to assist the Latin # script because, unlike almost every other small script, # Latin is split across multiple rows with inconvenient # boundaries, and therefore has a hard time compressing well. function encode(input string): if any input character's codepoint is outside 0..10FFFF then fail # Too-large values could cause array bounds errors later. choose_refpoints() encode_refpoints() let literal = false for each character in the input string (in order) do begin if the character is hyphen-minus then output two hyphen-minuses else if the character is an LDH character then begin if not literal then output hyphen-minus and toggle literal output the character end else begin if literal then output hyphen-minus and toggle literal encode_point(to_codepoint(character)) end end return the output string function decode(input string): decode_refpoints() let literal = false while not end-of-input do begin if the next character is a hyphen-minus not followed by another then consume it and toggle literal if the next character is a hyphen-minus then consume two characters and output a hyphen-minus else if literal then consume the character and output it else output from_codepoint(decode_point()) end let check = encode(the output string) if check != the input string then fail # This comparison must be case-insensitive if ACEs are always # compared case-insensitively (which is true of domain names), # case-sensitive otherwise. See also section "Case sensitivity # models". This check is necessary to guarantee the uniqueness # property (there cannot be two distinct encoded strings # representing the same Unicode string). return the output string function find_refpoint(start,n): let i = start while n < refpoint[k] or (n - refpoint[k]) >> (4*k) != 0 do increment i return i procedure encode_point(n): let k = find_refpoint(1,n) let delta = n - refpoint[k] extract the k least significant nybbles of delta # A nybble is 4 bits. prepend 0 to the last nybble and prepend 1 to the rest output the base-32 characters corresponding to the quintets function decode_point(): input characters and convert them to quintets until a quintet beginning with 0 is obtained (expect at most four quintets beginning with 1) fail upon encountering anything unexpected let k = the number of quintets obtained strip the first bit of each quintet concatenate the resulting nybbles to form delta return refpoint[k] + delta procedure encode_refpoints(): # refpoint[4..5] always end up as 0 and 0x10000. # refpoint[1..3] are implied by prefix[1..3], which are encoded # in reverse order because that often yields a compact encoding. let refpoint[1..2] = 0, 0x10 for k = 3 down to 1 do begin encode_point(prefix[k]) bootstrap(k, prefix[k]) end procedure decode_refpoints(): let refpoint[1..5] = 0, 0x10, 0, 0, 0x10000 for k = 3 down to 1 do bootstrap(k, decode_point()) procedure bootstrap(k,p): # The prefixes need to be left-shifted to become reference # points. As this happens, the current reference points often # become helpful for encoding/decoding the next prefix. for j = 4 down to 2 do let refpoint[j] = refpoint[j-1] << 4 if k == 2 and 0xD8 <= p <= 0xDF then let refpoint[1] = special_refpoint[p - 0xD8] >> 4 else let refpoint[1] = p << 4 procedure choose_refpoints(): # First choose refpoint[1] so that it will be used as often as # possible, then choose refpoint[2] similarly, then refpoint[3]. let refpoint[1..5] = 0, 0, 0, 0, 0x10000 let prefix[1..3] = 0, 0, 0, 0 for k = 1 to 3 do begin let best_count = 0 let best_refpoint = 0 # Try the input code point prefixes, then the special prefixes: for each input character in order do census(k, to_codepoint(character) >> (4*k)) if k == 2 then for i = 0 to 7 do census(k, 0xD8 + i) if k == 3 then census(k,0xD) let refpoint[k] = best_refpoint end function census(k,p): # Determine how many times the reference point corresponding to # prefix p would be used to encode input characters and other # reference points if it were chosen as refpoint[k], and update # best_count, best_refpoint, and prefix[k] accordingly. if k == 2 and 0xD8 <= p <= 0xDF then let refpoint[k] = special_refpoint[p - 0xD8] else let refpoint[k] = p << (4*k) let count = the number of non-LDH input characters for which find_refpoint(1, to_codepoint(character)) == k # Don't forget the non-LDH requirement. increment count once for each i such that 1 <= i <= k and find_refpoint(i+1, prefix[i] << (4*i)) == k if count > best_count then begin let best_count = count let best_refpoint = refpoint[k] let prefix[k] = p end Signature The issue of how to distinguish ACE strings from unencoded strings is largely orthogonal to the encoding scheme itself, and is therefore not specified here. In the context of domain name labels, a standard prefix and/or suffix (chosen to be unlikely to occur naturally) would presumably be attached to ACE labels. (In that case, it would probably be good to forbid the encoding of Unicode strings that appear to match the signature, to avoid confusing humans about whether they are looking at a Unicode string or an ACE string.) In order to use AMC-ACE-O in domain names, the choice of signature must be mindful of the requirement in [RFC952] that labels never begin or end with hyphen-minus. The raw encoded string will never begin with a hyphen-minus, and will end with a hyphen-minus iff the Unicode string ends with a hyphen-minus. If the Unicode strings are forbidden from ending with hyphen-minus (which seems prudent anyway), then there is no problem. Otherwise, AMC-ACE-O would need to use a suffix as the signature. Case sensitivity models The higher layer must choose one of the following four models. Models suitable for domain names: * Case-insensitive: Before a string is encoded, all its non-LDH characters must be case-folded so that any strings differing only in case become the same string (for example, strings could be forced to lowercase). Folding LDH characters is optional. The case of base-32 characters and literal-mode characters is arbitrary and not significant. Comparisons between encoded strings must be case-insensitive. The original case of non-LDH characters cannot be recovered from the encoded string. * Case-preserving: The case of the Unicode characters is not considered significant, but it can be preserved and recovered, just like in non-internationalized host names. Before a string is encoded, all its non-LDH characters must be case-folded as in the previous model. LDH characters are naturally able to retain their case attributes because they are encoded literally. The case attribute of a non-LDH character is recorded in the last of the base-32 characters that represent it, which is guaranteed to be a letter rather than a digit. If the base-32 character is uppercase, it means the Unicode character is caseless or should be forced to uppercase after being decoded (which is a no-op if the case folding already forces to uppercase). If the base-32 character is lowercase, it means the Unicode character is caseless or should be forced to lowercase after being decoded (which is a no-op if the case folding already forces to lowercase). The case of the other base-32 characters in a multi-quintet encoding is arbitrary and not significant. Only uppercase and lowercase attributes can be recorded, not titlecase. Comparisons between encoded strings must be case-insensitive, and are equivalent to case-insensitive comparisons between the Unicode strings. The intended mixed-case Unicode string can be recovered as long as the encoded characters are unaltered, but altering the case of the encoded characters is not harmful--it merely alters the case of the Unicode characters, and such a change is not considered significant. In this model, the input to the encoder and the output of the decoder can be the unfolded Unicode string (in which case the encoder and decoder are responsible for performing the case folding and recovery), or can be the folded Unicode string accompanied by separate case information (in which case the higher layer is responsible for performing the case folding and recovery). Whichever layer performs the case recovery must first verify that the Unicode string is properly folded, to guarantee the uniqueness of the encoding. It is not very difficult to extend the nameprep algorithm [NAMEPREP03] to remember case information. The case-insensitive and case-preserving models are interoperable. If a domain name passes from a case-preserving entity to a case-insensitive entity, the case information will be lost, but the domain name will still be equivalent. This phenomenon already occurs with non-internationalized domain names. Models unsuitable for domain names, but possibly useful in other contexts: * Case-sensitive: Unicode strings may contain both uppercase and lowercase characters, which are not folded. Base-32 characters must be lowercase. Comparisons between encoded strings must be case-sensitive. * Case-flexible: Like case-preserving, except that the choice of whether the case of the Unicode characters is considered significant is deferred. Therefore, base-32 characters must be lowercase, except for those used to indicate uppercase Unicode characters. Comparisons between encoded strings may be case-sensitive or case-insensitive, and such comparisons are equivalent to the corresponding comparisons between the Unicode strings. Comparison with other ACEs Please refer to the comparison in [AMCACEW]. Example strings In the ACE encodings below, no signatures are shown. AMC-ACE-O is abbreviated AMC-O. Backslashes show where line breaks have been inserted in strings too long for one line. The first several examples are all translations of the sentence "Why can't they just speak in ?" (courtesy of Michael Kaplan's "provincial" page [PROVINCIAL]). Word breaks and punctuation have been removed, as is often done in domain names. (A) Arabic (Egyptian): u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F AMC-O: ageekhfuhuiukdefivevjvbuiktr (B) Chinese (simplified): u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 AMC-O: eqpg8nvk6awisp259eupyx2h (C) Czech: Proprostnemluvesky U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D u+0065 u+0073 u+006B u+0079 AMC-O: piq-Pro-p-prost-9m-nemluv-6pp-esky (D) Hebrew: u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 u+05D1 u+05E8 u+05D9 u+05EA AMC-O: afpnqeep8e8jfinaqdb8ijp8cb8ij8k (E) Hindi (Devanagari): u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 u+0939 u+0948 u+0902 AMC-O: ajeurvjvcmthvjvruipugatfpurmscuivjascunmvcvitfuehvjisc (F) Japanese (kanji and hiragana): u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B AMC-O: gvagkxnzr3dkx8fzun243q3c24zbxhgwr2nkweqwm (G) Korean (Hangul syllables): u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C (Hangul syllables) AMC-O: m6hwq6tvi466exi44ia6s4nz2neze7xxn47yp6x5e3znze7xze7xxnu\ 8e4ze6x5n36is3i622mwe48wn (H) Russian (Cyrillic): U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A u+0438 AMC-O: aedRqwhfnwdgfqpipfdqcqwawrwcrqwawdwbwbki (I) Spanish: PorqunopuedensimplementehablarenEspaol U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 u+0061 u+00F1 u+006F u+006C AMC-O: aaq-Porqu-j-nopuedensimplementehablarenEspa-9b-ol (J) Taiwanese: u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 AMC-O: eqpgxstbzuvc6a385psp244kupyx2h (K) Vietnamese: Taisaohokhngthchi\ noitingVit U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069 u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA u+0323 u+0074 AMC-O: aava-Ta-vud-isaoho-vud-kh-9e-ngth-8kj-chi-j-no-b-iti-8k\ b-ngVi-8kvud-t The next several examples are all names of Japanese music artists, song titles, and TV programs, just because the author happens to have them handy (but Japanese is useful for providing examples of single-row text, two-row text, ideographic text, and various mixtures thereof). (L) 3B u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F AMC-O: fb8h-3-e-B-z7we3t7bymwizxtr (M) -with-SUPER-MONKEYS u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D U+004F U+004E U+004B U+0045 U+0059 U+0053 AMC-O: fmij4e3wiz92qyszf---with--SUPER--MONKEYS (N) Hello-Another-Way- U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 AMC-O: daf-Hello--Another--Way---p2nq2nyqx2veyuwa (O) 2 u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 AMC-O: dagzciex6wmy2vjqw8sm-2 (P) MajiKoi5 U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D AMC-O: dag-Maji-h-Koi-xj2m-5-z37cxuwp (Q) de u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 AMC-O: dapbf4d9n-de-8m9da (R) u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 AMC-O: dagxpq5j7e9n6jh The last example is an ASCII string that breaks not only the existing rules for host name labels but also the rules proposed in [NAMEPREP03] for internationalized domain names. (S) -> $1.00 <- u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 u+003C u+002D AMC-O: aac--vqae-1-q-00-avn-- Security considerations Users expect each domain name in DNS to be controlled by a single authority. If a Unicode string intended for use as a domain label could map to multiple ACE labels, then an internationalized domain name could map to multiple ACE domain names, each controlled by a different authority, some of which could be spoofs that hijack service requests intended for another. Therefore AMC-ACE-O is designed so that each Unicode string has a unique encoding. However, there can still be multiple Unicode representations of the "same" text, for various definitions of "same". This problem is addressed to some extent by the Unicode standard under the topic of canonicalization, and this work is leveraged for domain names by "nameprep" [NAMEPREP03]. Credits AMC-ACE-O reuses a number of preexisting techniques. The basic encoding of integers to nybbles to quintets to base-32 comes from UTF-5 [UTF5], and the particular variant used here comes from AMC-ACE-M [AMCACEM]. The idea of avoiding 0, 1, o, and l in base-32 strings was taken from SFS [SFS]. The idea of encoding deltas from reference points declared at the beginning of the encoded string was taken from RACE (of which the latest version is [RACE03]), which may have gotten the idea from Unicode Technical Standard #6 [UTS6]. The latter also uses predefined reference points in the Latin range. From BRACE [BRACE] comes the idea of switching between literal mode and base-32 mode, and the technique of counting how many code points fall within a window (as opposed to checking whether all do). The general idea of using the alphabetic case of base-32 characters to record the desired case of the Unicode characters was suggested by this author, and first applied to the UTF-5-style encoding in DUDE (of which the latest version is [DUDE01]). The bootstrapping method of encoding reference points, which does not require them to nest but takes advantage of nesting when it occurs, is new in AMC-ACE-O. References [AMCACEM] Adam Costello, "AMC-ACE-M version 0.1.4", 2001-Apr-01, update of draft-ietf-idn-amc-ace-m-00, latest version at http://www.cs.berkeley.edu/~amc/charset/amc-ace-m. [AMCACEW] Adam Costello, "AMC-ACE-W version 0.0.0", 2001-May-27-Sun, draft-ietf-idn-amc-ace-w-00, latest version at http://www.cs.berkeley.edu/~amc/charset/amc-ace-w. [BRACE] Adam Costello, "BRACE: Bi-mode Row-based ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19, draft-ietf-idn-brace-00, latest version at http://www.cs.berkeley.edu/~amc/charset/brace. [DUDE01] Mark Welter, Brian Spolarich, "DUDE: Differential Unicode Domain Encoding", 2001-Mar-02, draft-ietf-idn-dude-01. [IDN] Internationalized Domain Names (IETF working group), http://www.i-d-n.net/, idn@ops.ietf.org. [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01. [NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation of Internationalized Host Names", 2001-Feb-24, draft-ietf-idn-nameprep-03. [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", http://www.trigeminal.com/samples/provincial.html. [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding for IDN", 2000-Nov-28, draft-ietf-idn-race-03. [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host Table Specification", 1985-Oct, RFC 952. [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", 1987-Nov, RFC 1034. [RFC1123] Internet Engineering Task Force, R. Braden (editor), "Requirements for Internet Hosts -- Application and Support", 1989-Oct, RFC 1123. [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)", draft-ietf-idn-sace-*. [SFS] David Mazieres et al, "Self-certifying File System", http://www.fs.net/. [UNICODE] The Unicode Consortium, "The Unicode Standard", http://www.unicode.org/unicode/standard/standard.html. [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*. [UTS6] Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag, "Unicode Technical Standard #6: A Standard Compression Scheme for Unicode", http://www.unicode.org/unicode/reports/tr6/. [UTFCONV] Mark Davis, "UTF Converter", http://www.macchiato.com/unicode/convert.html. Author Adam M. Costello http://www.cs.berkeley.edu/~amc/ Example implementation /******************************************/ /* amc-ace-o.c 0.0.2 (2001-May-24-Thu) */ /* Adam M. Costello */ /******************************************/ /* This is ANSI C code (C89) implementing AMC-ACE-O version 0.0.*. */ /************************************************************/ /* Public interface (would normally go in its own .h file): */ #include enum amc_ace_status { amc_ace_success, amc_ace_bad_input, amc_ace_big_output /* output would exceed the space provided */ }; enum case_sensitivity { case_sensitive, case_insensitive }; #if UINT_MAX >= 0x10FFFF typedef unsigned int u_code_point; #else typedef unsigned long u_code_point; #endif enum amc_ace_status amc_ace_o_encode( unsigned int input_length, const u_code_point input[], const unsigned char uppercase_flags[], unsigned int *output_size, char output[] ); /* amc_ace_o_encode() converts Unicode to AMC-ACE-O. The input */ /* must be represented as an array of Unicode code points */ /* (not code units; surrogate pairs are not allowed), and the */ /* output will be represented as null-terminated ASCII. The */ /* input_length is the number of code points in the input. The */ /* output_size is an in/out argument: the caller must pass */ /* in the maximum number of characters that may be output */ /* (including the terminating null), and on successful return */ /* it will contain the number of characters actually output */ /* (including the terminating null, so it will be one more than */ /* strlen() would return, which is why it is called output_size */ /* rather than output_length). The uppercase_flags array must */ /* hold input_length boolean values, where nonzero means the */ /* corresponding Unicode character should be forced to uppercase */ /* after being decoded, and zero means it is caseless or should */ /* be forced to lowercase. Alternatively, uppercase_flags may */ /* be a null pointer, which is equivalent to all zeros. The */ /* letters a-z and A-Z are always encoded literally, regardless */ /* of the corresponding flags. The encoder always outputs */ /* lowercase base-32 characters except when nonzero values */ /* of uppercase_flags require otherwise, so the encoder is */ /* compatible with any of the case models. The return value */ /* may be any of the amc_ace_status values defined above; if */ /* not amc_ace_success, then output_size and output may contain */ /* garbage. On success, the encoder will never need to write an */ /* output_size greater than input_length*5+10, because of how the */ /* encoding is defined. */ enum amc_ace_status amc_ace_o_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ); /* amc_ace_o_decode() converts AMC-ACE-O to Unicode. The input */ /* must be represented as null-terminated ASCII, and the output */ /* will be represented as an array of Unicode code points. */ /* The case_sensitivity argument influences the check on the */ /* well-formedness of the input string; it must be case_sensitive */ /* if case-sensitive comparisons are allowed on encoded strings, */ /* case_insensitive otherwise (see also section "Case sensitivity */ /* models" of the AMC-ACE-O specification). The scratch_space */ /* must point to space at least as large as the input, which will */ /* get overwritten (this allows the decoder to avoid calling */ /* malloc()). The output_length is an in/out argument: the */ /* caller must pass in the maximum number of code points that */ /* may be output, and on successful return it will contain the */ /* actual number of code points output. The uppercase_flags */ /* array must have room for at least output_length values, or it */ /* may be a null pointer if the case information is not needed. */ /* A nonzero flag indicates that the corresponding Unicode */ /* character should be forced to uppercase by the caller, while */ /* zero means it is caseless or should be forced to lowercase. */ /* The letters a-z and A-Z are output already in the proper case, */ /* but their flags will be set appropriately so that applying the */ /* flags would be harmless. The return value may be any of the */ /* amc_ace_status values defined above; if not amc_ace_success, */ /* then output_length, output, and uppercase_flags may contain */ /* garbage. On success, the decoder will never need to write */ /* an output_length greater than the length of the input (not */ /* counting the null terminator), because of how the encoding is */ /* defined. */ /**********************************************************/ /* Implementation (would normally go in its own .c file): */ #include /* is_ldh(cp) returns 1 if the code point cp represents an LDH */ /* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */ static int is_ldh(u_code_point cp) { return cp <= 122 && ( cp >= 97 || cp == 45 || (cp >= 48 && cp <= 57) || (cp >= 65 && cp <= 90) ); } /* base32[n] is the lowercase base-32 character representing */ /* the number n from the range 0 to 31. Note that we cannot */ /* use string literals for ASCII characters because an ANSI C */ /* compiler does not necessarily use ASCII. */ static const char base32[] = { 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ 109, 110, /* m-n */ 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ }; /* base32_decode(c) returns the value of a base-32 character, in the */ /* range 0 to 31, or the constant base32_invalid if c is not a valid */ /* base-32 character. */ enum { base32_invalid = 32 }; static unsigned int base32_decode(char c) { if (c < 50) return base32_invalid; if (c <= 57) return c - 26; if (c < 97) c += 32; if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; return c - 97 - (c > 108) - (c > 111); } /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */ /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */ /* then ASCII A-Z are considered equal to a-z respectively. */ static int unequal( enum case_sensitivity case_sensitivity, const char s1[], const char s2[] ) { char c1, c2; if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0; for (;;) { c1 = *s1; c2 = *s2; if (c1 >= 65 && c1 <= 90) c1 += 32; if (c2 >= 65 && c2 <= 90) c2 += 32; if (c1 != c2) return 1; if (c1 == 0) return 0; ++s1, ++s2; } } /* The decoder_state and encoder_state structures contains */ /* variables that are shared among several of the functions below. */ struct decoder_state { const char *in; /* unread part of ACE input */ u_code_point refpoint[6]; /* reference pts, [0] unused */ }; struct encoder_state { char *out, *out_end; /* ACE output, unwritten part */ const u_code_point *input, *in_end; /* entire Unicode input */ u_code_point refpoint[6], prefix[4]; /* reference pts and prefixes */ unsigned int best_count; /* max found by census() */ u_code_point best_refpoint; /* corresponding refpoint */ }; /* refpoint[k] is for base-32 sequences of length k, and prefix[k] */ /* is used to encode refpoint[k]. Generally, prefix[k] << (4*k) */ /* == refpoint[k], but for prefix[2] == 0xD8 + i, where i = 0..7, */ /* refpoint[2] == special_refpoint[i]. These prefixes, which would */ /* otherwise correspond to surrogates, are instead used to encode */ /* special reference points that help the Latin script compress */ /* better, because unlike most other small scripts it is split */ /* across multiple rows with inconvenient boundaries. */ static const u_code_point special_refpoint[] = { 0x20, 0x50, 0x70, 0xA0, 0xC0, 0xE0, 0x140, 0x270 }; /* find_refpoint(refpoint,start,n) scans the refpoint array, starting */ /* at position start, for a reference point suitable for encoding n, */ /* and returns the index of the first match. */ static unsigned int find_refpoint( u_code_point refpoint[6], u_code_point start, u_code_point n ) { while ((n - refpoint[start]) >> (4*start) != 0) ++start; return start; } /* encode_point(state,n) encodes n as a sequence of base-32 */ /* characters representing a delta from a reference point. The delta */ /* divided into a big-endian sequence of quartets; each quartet */ /* is expanded to a quintet with a highest bit of 0 for the last */ /* quartet, 1 for the others; and the quintets are mapped to base-32 */ /* characters. Returns amc_ace_success or amc_ace_big_output. */ static enum amc_ace_status encode_point( struct encoder_state *state, u_code_point n ) { int k, j; k = find_refpoint(state->refpoint, 1, n); if (state->out_end - state->out < k) return amc_ace_big_output; n -= state->refpoint[k]; state->out += k; state->out[-1] = base32[n & 0xF]; for (j = 2; j <= k; ++j) { n >>= 4; state->out[-j] = base32[0x10 | (n & 0xF)]; } return amc_ace_success; } /* decode_point(state,n) is the reverse of encode_point(): it */ /* consumes base-32 characters and writes the code point into *n. */ /* Returns amc_ace_success or amc_ace_bad_input. */ static enum amc_ace_status decode_point( struct decoder_state *state, u_code_point *n ) { u_code_point q, delta; unsigned int k; for (delta = 0, k = 1; ; ++k) { q = base32_decode(*state->in++); if (q == base32_invalid || k > 5) return amc_ace_bad_input; delta = (delta << 4) | (q & 0xF); if (q >> 4 == 0) break; } *n = state->refpoint[k] + delta; return amc_ace_success; } /* census(state,k,p) sets refpoint[k] to the reference point */ /* corresponding to prefix p, then calculates how many times */ /* that reference point would get used, and sets prefix[k] to */ /* p if the result exceeds the previous maximum. */ static void census( struct encoder_state *state, unsigned int k, u_code_point p ) { unsigned int count, i; u_code_point *refpoint = state->refpoint, *prefix = state->prefix; const u_code_point *in; refpoint[k] = k == 2 && p - 0xD8 <= 7 ? special_refpoint[p - 0xD8] : p << (4*k); /* count times used to encode input code points: */ for (count = 0, in = state->input; in < state->in_end; ++in) { if (!is_ldh(*in) && find_refpoint(refpoint, 1, *in) == k) ++count; } /* count times used to encode other reference points: */ for (i = 1; i < k; ++i) { if (find_refpoint(refpoint, i+1, prefix[i] << (4*i)) == k) ++count; } if (count > state->best_count) { state->best_count = count; state->best_refpoint = refpoint[k]; prefix[k] = p; } } /* bootstrap(refpoint,k,p) adjusts the existing reference points so */ /* they can be used for encoding/decoding another reference point. */ static void bootstrap( u_code_point refpoint[6], unsigned int k, u_code_point p ) { unsigned int j; for (j = 4; j >= 2; --j) refpoint[j] = refpoint[j-1] << 4; refpoint[1] = k == 2 && p - 0xD8 <= 7 ? special_refpoint[p - 0xD8] >> 4 : p << 4; } /* Main encode function: */ enum amc_ace_status amc_ace_o_encode( unsigned int input_length, const u_code_point input[], const unsigned char uppercase_flags[], unsigned int *output_size, char output[] ) { struct encoder_state dummy = {0} /* all zeros */, *state = &dummy; const u_code_point *in, *in_end; char *out_end; unsigned int literal, k, i; u_code_point codept; enum amc_ace_status status; /* Initialization: */ state->out = output; state->out_end = out_end = output + *output_size; state->input = input; state->in_end = in_end = input + input_length; /* Verify that all code points are in 0..10FFFF: */ for (in = input; in < in_end; ++in) { if (*in > 0x10FFFF) return amc_ace_bad_input; } /* Choose the reference points: Choose refpoint[1] so that it will */ /* be used as often as possible, then choose refpoint[2] similarly, */ /* then refpoint[3]. */ state->refpoint[5] = 0x10000; /* refpoint[1..4] and prefix[1..3] are already 0 */ for (k = 1; k <= 3; ++k) { state->best_count = 0; state->best_refpoint = 0; /* Try prefixes of the input code points, then the special prefixes: */ for (in = input; in < in_end; ++in) census(state, k, *in >> (4*k)); if (k == 2) for (i = 0; i <= 7; ++i) census(state, k, 0xD8 + i); if (k == 3) census(state, k, 0xD); state->refpoint[k] = state->best_refpoint; } /* Encode the reference points: */ state->refpoint[1] = 0; state->refpoint[2] = 0x10; for (k = 3; k >= 1; --k) { status = encode_point(state, state->prefix[k]); if (status != amc_ace_success) return status; bootstrap(state->refpoint, k, state->prefix[k]); } /* Main encoding loop: */ literal = 0; for (i = 0; i < input_length; ++i) { codept = input[i]; if (codept == 45) { /* hyphen-minus is doubled */ if (out_end - state->out < 2) return amc_ace_big_output; *state->out++ = 45; *state->out++ = 45; } else if (is_ldh(codept)) { /* encode LDH character literally */ if (out_end - state->out < 1+!literal) return amc_ace_big_output; /* switch to literal mode if necessary: */ if (!literal) *state->out++ = 45; literal = 1; *state->out++ = codept; } else { /* encode non-LDH character using base-32 */ if (out_end - state->out < 1+literal) return amc_ace_big_output; /* switch to base-32 mode if necessary: */ if (literal) *state->out++ = 45; literal = 0; status = encode_point(state,codept); if (status != amc_ace_success) return status; /* the last base-32 character can record the uppercase flag: */ if (uppercase_flags && uppercase_flags[i]) state->out[-1] -= 32; } } /* null terminator: */ if (out_end - state->out < 1) return amc_ace_big_output; *state->out++ = 0; *output_size = state->out - output; return amc_ace_success; } /* Main decode function: */ enum amc_ace_status amc_ace_o_decode( enum case_sensitivity case_sensitivity, char scratch_space[], const char input[], unsigned int *output_length, u_code_point output[], unsigned char uppercase_flags[] ) { struct decoder_state dummy = {0} /* all zeros */, *state = &dummy; unsigned int k, max_out, literal, out, input_size, scratch_size; enum amc_ace_status status; u_code_point p; /* Initialization: */ state->in = input; max_out = *output_length; /* Decode the reference points: */ state->refpoint[2] = 0x10; state->refpoint[5] = 0x10000; /* refpoint[1,3,4] are already 0 */ for (k = 3; k >= 1; --k) { status = decode_point(state, &p); if (status != amc_ace_success) return status; bootstrap(state->refpoint, k, p); } /* Main decoding loop: */ literal = 0; for (out = 0; *state->in != 0; ++out) { /* At the start of each iteration, out is the number of items */ /* already output, or equivalently, the index of the next item */ /* to be output. */ if (state->in[0] == 0x2D && state->in[1] != 0x2D) { /* unpaired hyphen-minus toggles mode */ literal = !literal; ++state->in; } if (max_out - out < 1) return amc_ace_big_output; if (*state->in == 0x2D) { /* double hyphen-minus represents a hyphen-minus */ state->in += 2; output[out] = 0x2D; } else { if (literal) output[out] = *state->in++; else { /* decode one base-32 code point */ status = decode_point(state, output + out); if (status != amc_ace_success) return status; } } /* case of last character determines uppercase flag: */ if (uppercase_flags) { uppercase_flags[out] = state->in[-1] >= 65 && state->in[-1] <= 90; } } /* Re-encode the output and compare to the input: */ input_size = state->in - input + 1; scratch_size = input_size; status = amc_ace_o_encode(out, output, uppercase_flags, &scratch_size, scratch_space); if (status != amc_ace_success || scratch_size != input_size || unequal(case_sensitivity, scratch_space, input) ) return amc_ace_bad_input; *output_length = out; return amc_ace_success; } /******************************************************************/ /* Wrapper for testing (would normally go in a separate .c file): */ #include #include #include #include /* For testing, we'll just set some compile-time limits rather than */ /* use malloc(), and set a compile-time option rather than using a */ /* command-line option. */ enum { unicode_max_length = 256, ace_max_size = 256, test_case_sensitivity = case_insensitive /* suitable for host names */ }; static void usage(char **argv) { fprintf(stderr, "%s -e reads code points and writes an AMC-ACE-O string.\n" "%s -d reads an AMC-ACE-O string and writes code points.\n" "Input and output are plain text in the native character set.\n" "Code points are in the form u+hex separated by whitespace.\n" "An AMC-ACE-O string is a newline-terminated sequence of LDH\n" "characters (without any signature).\n" "The case of the u in u+hex is the force-to-uppercase flag.\n" , argv[0], argv[0]); exit(EXIT_FAILURE); } static void fail(const char *msg) { fputs(msg,stderr); exit(EXIT_FAILURE); } static const char too_big[] = "input or output is too large, recompile with larger limits\n"; static const char invalid_input[] = "invalid input\n"; static const char io_error[] = "I/O error\n"; /* The following string is used to convert LDH */ /* characters between ASCII and the native charset: */ static const char ldh_ascii[] = "................" "................" ".............-.." "0123456789......" ".ABCDEFGHIJKLMNO" "PQRSTUVWXYZ....." ".abcdefghijklmno" "pqrstuvwxyz"; int main(int argc, char **argv) { enum amc_ace_status status; int r; char *p; if (argc != 2) usage(argv); if (argv[1][0] != '-') usage(argv); if (argv[1][2] != 0) usage(argv); if (argv[1][1] == 'e') { u_code_point input[unicode_max_length]; unsigned long codept; unsigned char uppercase_flags[unicode_max_length]; char output[ace_max_size], uplus[3]; unsigned int input_length, output_size, i; /* Read the input code points: */ input_length = 0; for (;;) { r = scanf("%2s%lx", uplus, &codept); if (ferror(stdin)) fail(io_error); if (r == EOF || r == 0) break; if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) { fail(invalid_input); } if (input_length == unicode_max_length) fail(too_big); if (uplus[0] == 'u') uppercase_flags[input_length] = 0; else if (uplus[0] == 'U') uppercase_flags[input_length] = 1; else fail(invalid_input); input[input_length++] = codept; } /* Encode: */ output_size = ace_max_size; status = amc_ace_o_encode(input_length, input, uppercase_flags, &output_size, output); if (status == amc_ace_bad_input) fail(invalid_input); if (status == amc_ace_big_output) fail(too_big); assert(status == amc_ace_success); /* Convert to native charset and output: */ for (p = output; *p != 0; ++p) { i = *p; assert(i <= 122 && ldh_ascii[i] != '.'); *p = ldh_ascii[i]; } r = puts(output); if (r == EOF) fail(io_error); return EXIT_SUCCESS; } if (argv[1][1] == 'd') { char input[ace_max_size], scratch[ace_max_size], *pp; u_code_point output[unicode_max_length]; unsigned char uppercase_flags[unicode_max_length]; unsigned int input_length, output_length, i; /* Read the AMC-ACE-O input string and convert to ASCII: */ fgets(input, ace_max_size, stdin); if (ferror(stdin)) fail(io_error); if (feof(stdin)) fail(invalid_input); input_length = strlen(input); if (input[input_length - 1] != '\n') fail(too_big); input[--input_length] = 0; for (p = input; *p != 0; ++p) { pp = strchr(ldh_ascii, *p); if (pp == 0) fail(invalid_input); *p = pp - ldh_ascii; } /* Decode: */ output_length = unicode_max_length; status = amc_ace_o_decode(test_case_sensitivity, scratch, input, &output_length, output, uppercase_flags); if (status == amc_ace_bad_input) fail(invalid_input); if (status == amc_ace_big_output) fail(too_big); assert(status == amc_ace_success); /* Output the result: */ for (i = 0; i < output_length; ++i) { r = printf("%s+%04lX\n", uppercase_flags[i] ? "U" : "u", (unsigned long) output[i] ); if (r < 0) fail(io_error); } return EXIT_SUCCESS; } usage(argv); return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ }