diff options
Diffstat (limited to 'jni/iconv/NOTES')
-rw-r--r-- | jni/iconv/NOTES | 399 |
1 files changed, 399 insertions, 0 deletions
diff --git a/jni/iconv/NOTES b/jni/iconv/NOTES new file mode 100644 index 0000000..0755a2e --- /dev/null +++ b/jni/iconv/NOTES @@ -0,0 +1,399 @@ +Q: Why does libiconv support encoding XXX? Why does libiconv not support + encoding ZZZ? + +A: libiconv, as an internationalization library, supports those character + sets and encodings which are in wide-spread use in at least one territory + of the world. + + Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a + page "Languages, countries, and the charsets typically used for them". + From this table, we can conclude that the following are in active use: + + ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, + English, Faroese, Finnish, French, Galician, German, + Icelandic, Irish, Italian, Norwegian, Portuguese, + Scottish, Spanish, Swedish + ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak, + Slovenian + ISO-8859-3 Esperanto, Maltese + ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, + Serbian, Ukrainian + ISO-8859-6 Arabic + ISO-8859-7 Greek + ISO-8859-8 Hebrew + ISO-8859-9, CP1254 Turkish + ISO-8859-10 Inuit, Lapp + ISO-8859-13 Latvian, Lithuanian + ISO-8859-15 Estonian + KOI8-R Russian + SHIFT_JIS Japanese + ISO-2022-JP Japanese + EUC-JP Japanese + + Ordered by frequency on the web (1997): + ISO-8859-1, CP1252 96% + SHIFT_JIS 1.6% + ISO-2022-JP 1.2% + EUC-JP 0.4% + CP1250 0.3% + CP1251 0.2% + CP850 0.1% + MACINTOSH 0.1% + ISO-8859-5 0.1% + ISO-8859-2 0.0% + + Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file. + + ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch, + English, Estonian, Faroese, Finnish, French, + Galician, German, Greenlandic, Icelandic, + Indonesian, Irish, Italian, Lithuanian, Norwegian, + Occitan, Portuguese, Scottish, Spanish, Swedish, + Walloon, Welsh + ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish, + Romanian, Serbian, Slovak, Slovenian + ISO-8859-3 Esperanto + ISO-8859-4 Estonian, Latvian, Lithuanian + ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, + Serbian, Ukrainian + ISO-8859-6 Arabic + ISO-8859-7 Greek + ISO-8859-8 Hebrew + ISO-8859-9 Turkish + ISO-8859-14 Breton, Irish, Scottish, Welsh + ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian, + Faroese, Finnish, French, Galician, German, + Greenlandic, Icelandic, Irish, Italian, Lithuanian, + Norwegian, Occitan, Portuguese, Scottish, Spanish, + Swedish, Walloon, Welsh + KOI8-R Russian + KOI8-U Russian, Ukrainian + EUC-JP (alias eucJP) Japanese + ISO-2022-JP (alias JIS7) Japanese + SHIFT_JIS (alias SJIS) Japanese + U90 Japanese + S90 Japanese + EUC-CN (alias eucCN) Chinese + EUC-TW (alias eucTW) Chinese + BIG5 Chinese + EUC-KR (alias eucKR) Korean + ARMSCII-8 Armenian + GEORGIAN-ACADEMY Georgian + GEORGIAN-PS Georgian + TIS-620 (alias TACTIS) Thai + MULELAO-1 Laothian + IBM-CP1133 Laothian + VISCII Vietnamese + TCVN Vietnamese + NUNACOM-8 Inuktitut + + Hint3: The character sets supported by Netscape Communicator 4. + + Where is this documented? For the complete picture, I had to use + "strings netscape" and then a lot of guesswork. For a quick take, + look at the "View - Character set" menu of Netscape Communicator 4.6: + + ISO-8859-{1,2,5,7,9,15} + WINDOWS-{1250,1251,1253} + KOI8-R Cyrillic + CP866 Cyrillic + Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS) + EUC-JP Japanese + SHIFT_JIS Japanese + GB2312 Chinese + BIG5 Chinese + EUC-TW Chinese + Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB) + + UTF-8 + UTF-7 + + Hint4: The character sets supported by Microsoft Internet Explorer 4. + + ISO-8859-{1,2,3,4,5,6,7,8,9} + WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257} + KOI8-R Cyrillic + KOI8-RU Ukrainian + ASMO-708 Arabic + EUC-JP Japanese + ISO-2022-JP Japanese + SHIFT_JIS Japanese + GB2312 Chinese + HZ-GB-2312 Chinese + BIG5 Chinese + EUC-KR Korean + ISO-2022-KR Korean + WINDOWS-874 Thai + WINDOWS-1258 Vietnamese + + UTF-8 + UTF-7 + UNICODE actually UNICODE-LITTLE + UNICODEFEFF actually UNICODE-BIG + + and various DOS character sets: DOS-720, DOS-862, IBM852, CP866. + + We take the union of all these four sets. The result is: + + European and Semitic languages + * ASCII. + We implement this because it is occasionally useful to know or to + check whether some text is entirely ASCII (i.e. if the conversion + ISO-8859-x -> UTF-8 is trivial). + * ISO-8859-{1,2,3,4,5,6,7,8,9,10} + We implement this because they are widely used. Except ISO-8859-4 + which appears to have been superseded by ISO-8859-13 in the baltic + countries. But it's an ISO standard anyway. + * ISO-8859-13 + We implement this because it's a standard in Lithuania and Latvia. + * ISO-8859-14 + We implement this because it's an ISO standard. + * ISO-8859-15 + We implement this because it's increasingly used in Europe, because + of the Euro symbol. + * ISO-8859-16 + We implement this because it's an ISO standard. + * KOI8-R, KOI8-U + We implement this because it appears to be the predominant encoding + on Unix in Russia and Ukraine, respectively. + * KOI8-RU + We implement this because MSIE4 supports it. + * KOI8-T + We implement this because it is the locale encoding in glibc's Tajik + locale. + * PT154 + We implement this because it is the locale encoding in glibc's Kazakh + locale. + * RK1048 + We implement this because it's a standard in Kazakhstan. + * CP{1250,1251,1252,1253,1254,1255,1256,1257} + We implement these because they are the predominant Windows encodings + in Europe. + * CP850 + We implement this because it is mentioned as occurring in the web + in the aforementioned statistics. + * CP862 + We implement this because Ron Aaron says it is sometimes used in web + pages and emails. + * CP866 + We implement this because Netscape Communicator does. + * CP1131 + We implement this because it is the locale encoding of a Belorusian + locale in FreeBSD and MacOS X. + * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and + Mac{Hebrew,Arabic} + We implement these because the Sun JDK does, and because Mac users + don't deserve to be punished. + * Macintosh + We implement this because it is mentioned as occurring in the web + in the aforementioned statistics. + Japanese + * EUC-JP, SHIFT_JIS, ISO-2022-JP + We implement these because they are widely used. EUC-JP and SHIFT_JIS + are more used for files, whereas ISO-2022-JP is recommended for email. + * CP932 + We implement this because it is the Microsoft variant of SHIFT_JIS, + used on Windows. + * ISO-2022-JP-2 + We implement this because it's the common way to represent mails which + make use of JIS X 0212 characters. + * ISO-2022-JP-1 + We implement this because it's in the RFCs, but I don't think it is + really used. + * U90, S90 + We DON'T implement this because I have no informations about what it + is or who uses it. + Simplified Chinese + * EUC-CN = GB2312 + We implement this because it is the widely used representation + of simplified Chinese. + * GBK + We implement this because it appears to be used on Solaris and Windows. + * GB18030 + We implement this because it is an official requirement in the + People's Republic of China. + * ISO-2022-CN + We implement this because it is in the RFCs, but I have no idea + whether it is really used. + * ISO-2022-CN-EXT + We implement this because it's in the RFCs, but I don't think it is + really used. + * HZ = HZ-GB-2312 + We implement this because the RFCs recommend it for Usenet postings, + and because MSIE4 supports it. + Traditional Chinese + * EUC-TW + We implement it because it appears to be used on Unix. + * BIG5 + We implement it because it is the de-facto standard for traditional + Chinese. + * CP950 + We implement this because it is the Microsoft variant of BIG5, used + on Windows. + * BIG5+ + We DON'T implement this because it doesn't appear to be in wide use. + Only the CWEX fonts use this encoding. Furthermore, the conversion + tables in the big5p package are not coherent: If you convert directly, + you get different results than when you convert via GBK. + * BIG5-HKSCS + We implement it because it is the de-facto standard for traditional + Chinese in Hongkong. + Korean + * EUC-KR + We implement these because they appear to be the widely used + representations for Korean. + * CP949 + We implement this because it is the Microsoft variant of EUC-KR, used + on Windows. + * ISO-2022-KR + We implement it because it is in the RFCs and because MSIE4 supports + it, but I have no idea whether it's really used. + * JOHAB + We implement this because it is apparently used on Windows as a locale + encoding (codepage 1361). + * ISO-646-KR + We DON'T implement this because although an old ASCII variant, its + glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT + say it's a tilde, but Ken Lunde's "CJKV information processing" says + it's an overline. And it is not ISO-IR registered. + Armenian + * ARMSCII-8 + We implement it because XFree86 supports it. + Georgian + * Georgian-Academy, Georgian-PS + We implement these because they appear to be both used for Georgian; + Xfree86 supports them. + Thai + * ISO-8859-11, TIS-620 + We implement these because it seems to be standard for Thai. + * CP874 + We implement this because MSIE4 supports it. + * MacThai + We implement this because the Sun JDK does, and because Mac users + don't deserve to be punished. + Laotian + * MuleLao-1, CP1133 + We implement these because XFree86 supports them. I have no idea which + one is used more widely. + Vietnamese + * VISCII, TCVN + We implement these because XFree86 supports them. + * CP1258 + We implement this because MSIE4 supports it. + Other languages + * NUNACOM-8 (Inuktitut) + We DON'T implement this because it isn't part of Unicode yet, and + therefore doesn't convert to anything except itself. + Platform specifics + * HP-ROMAN8, NEXTSTEP + We implement these because they were the native character set on HPs + and NeXTs for a long time, and libiconv is intended to be usable on + these old machines. + Full Unicode + * UTF-8, UCS-2, UCS-4 + We implement these. Obviously. + * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE + We implement these because they are the preferred internal + representation of strings in Unicode aware applications. These are + non-ambiguous names, known to glibc. (glibc doesn't have + UCS-2-INTERNAL and UCS-4-INTERNAL.) + * UTF-16, UTF-16BE, UTF-16LE + We implement these, because UTF-16 is still the favourite encoding of + the president of the Unicode Consortium (for political reasons), and + because they appear in RFC 2781. + * UTF-32, UTF-32BE, UTF-32LE + We implement these because they are part of Unicode 3.1. + * UTF-7 + We implement this because it is essential functionality for mail + applications. + * C99 + We implement it because it's used for C and C++ programs and because + it's a nice encoding for debugging. + * JAVA + We implement it because it's used for Java programs and because it's + a nice encoding for debugging. + * UNICODE (big endian), UNICODEFEFF (little endian) + We DON'T implement these because they are stupid and not standardized. + Full Unicode, in terms of `uint16_t' or `uint32_t' + (with machine dependent endianness and alignment) + * UCS-2-INTERNAL, UCS-4-INTERNAL + We implement these because they are the preferred internal + representation of strings in Unicode aware applications. + +Q: Support encodings mentioned in RFC 1345 ? +A: No, they are not in use any more. Supporting ISO-646 variants is pointless + since ISO-8859-* have been adopted. + +Q: Support EBCDIC ? +A: No! + +Q: How do I add a new character set? +A: 1. Explain the "why" in this file, above. + 2. You need to have a conversion table from/to Unicode. Transform it into + the format used by the mapping tables found on ftp.unicode.org: each line + contains the character code, in hex, with 0x prefix, then whitespace, + then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#' + counts as a comment delimiter until end of line. + Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he + can include it in his collection. + 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the + tools directory to generate the C code for the conversion. You may tweak + the resulting C code if you are not satisfied with its quality, but this + is rarely needed. + If it's a two-dimensional character set (with rows and columns), use the + 'cjk_tab_to_h' program in the tools directory to generate the C code for + the conversion. You will need to modify the main() function to recognize + the new character set name, with the proper dimensions, but that shouldn't + be too hard. This yields the CCS. The CES you have to write by hand. + 4. Store the resulting C code file in the lib directory. Add a #include + directive to converters.h, and add an entry to the encodings.def file. + 5. Compile the package, and test your new encoding using a program like + iconv(1) or clisp(1). + 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless + encoding, create the complete table as a TXT file. For a stateful encoding, + provide a text snippet encoded using your new encoding and its UTF-8 + equivalent. + 7. Update the README and man/iconv_open.3, to mention the new encoding. + Add a note in the NEWS file. + +Q: What about bidirectional text? Should it be tagged or reversed when + converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do + this, see qt-2.0.1/src/tools/qrtlcodec.cpp. +A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and + ISO-8859-E remains to be implemented. + On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email* + is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e. + the same as ISO-8859-8-I. I'm confused. + +Other character sets not implemented: +"MNEMONIC" = "csMnemonic" +"MNEM" = "csMnem" +"ISO-10646-UCS-Basic" = "csUnicodeASCII" +"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646" +"ISO-10646-J-1" +"UNICODE-1-1" = "csUnicode11" +"csWindows31Latin5" + +Other aliases not implemented (and not implemented in glibc-2.1 either): + From MSIE4: + ISO-8859-1: alias ISO8859-1 + ISO-8859-2: alias ISO8859-2 + KSC_5601: alias KS_C_5601 + UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8 + + +Q: How can I integrate libiconv into my package? +A: Just copy the entire libiconv package into a subdirectory of your package. + At configuration time, call libiconv's configure script with the + appropriate --srcdir option and maybe --enable-static or --disable-shared. + Then "cd libiconv && make && make install-lib libdir=... includedir=...". + 'install-lib' is a special (not GNU standardized) target which installs + only the include file - in $(includedir) - and the library - in $(libdir) - + and does not use other directory variables. After "installing" libiconv + in your package's build directory, building of your package can proceed. + +Q: Why is the testsuite so big? +A: Because some of the tests are very comprehensive. + If you don't feel like using the testsuite, you can simply remove the + tests/ directory. + |