diff options
Diffstat (limited to 'jni/iconv/NOTES')
-rw-r--r-- | jni/iconv/NOTES | 399 |
1 files changed, 0 insertions, 399 deletions
diff --git a/jni/iconv/NOTES b/jni/iconv/NOTES deleted file mode 100644 index 0755a2e..0000000 --- a/jni/iconv/NOTES +++ /dev/null @@ -1,399 +0,0 @@ -Q: Why does libiconv support encoding XXX? Why does libiconv not support - encoding ZZZ? - -A: libiconv, as an internationalization library, supports those character - sets and encodings which are in wide-spread use in at least one territory - of the world. - - Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a - page "Languages, countries, and the charsets typically used for them". - From this table, we can conclude that the following are in active use: - - ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, - English, Faroese, Finnish, French, Galician, German, - Icelandic, Irish, Italian, Norwegian, Portuguese, - Scottish, Spanish, Swedish - ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak, - Slovenian - ISO-8859-3 Esperanto, Maltese - ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, - Serbian, Ukrainian - ISO-8859-6 Arabic - ISO-8859-7 Greek - ISO-8859-8 Hebrew - ISO-8859-9, CP1254 Turkish - ISO-8859-10 Inuit, Lapp - ISO-8859-13 Latvian, Lithuanian - ISO-8859-15 Estonian - KOI8-R Russian - SHIFT_JIS Japanese - ISO-2022-JP Japanese - EUC-JP Japanese - - Ordered by frequency on the web (1997): - ISO-8859-1, CP1252 96% - SHIFT_JIS 1.6% - ISO-2022-JP 1.2% - EUC-JP 0.4% - CP1250 0.3% - CP1251 0.2% - CP850 0.1% - MACINTOSH 0.1% - ISO-8859-5 0.1% - ISO-8859-2 0.0% - - Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file. - - ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch, - English, Estonian, Faroese, Finnish, French, - Galician, German, Greenlandic, Icelandic, - Indonesian, Irish, Italian, Lithuanian, Norwegian, - Occitan, Portuguese, Scottish, Spanish, Swedish, - Walloon, Welsh - ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish, - Romanian, Serbian, Slovak, Slovenian - ISO-8859-3 Esperanto - ISO-8859-4 Estonian, Latvian, Lithuanian - ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, - Serbian, Ukrainian - ISO-8859-6 Arabic - ISO-8859-7 Greek - ISO-8859-8 Hebrew - ISO-8859-9 Turkish - ISO-8859-14 Breton, Irish, Scottish, Welsh - ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian, - Faroese, Finnish, French, Galician, German, - Greenlandic, Icelandic, Irish, Italian, Lithuanian, - Norwegian, Occitan, Portuguese, Scottish, Spanish, - Swedish, Walloon, Welsh - KOI8-R Russian - KOI8-U Russian, Ukrainian - EUC-JP (alias eucJP) Japanese - ISO-2022-JP (alias JIS7) Japanese - SHIFT_JIS (alias SJIS) Japanese - U90 Japanese - S90 Japanese - EUC-CN (alias eucCN) Chinese - EUC-TW (alias eucTW) Chinese - BIG5 Chinese - EUC-KR (alias eucKR) Korean - ARMSCII-8 Armenian - GEORGIAN-ACADEMY Georgian - GEORGIAN-PS Georgian - TIS-620 (alias TACTIS) Thai - MULELAO-1 Laothian - IBM-CP1133 Laothian - VISCII Vietnamese - TCVN Vietnamese - NUNACOM-8 Inuktitut - - Hint3: The character sets supported by Netscape Communicator 4. - - Where is this documented? For the complete picture, I had to use - "strings netscape" and then a lot of guesswork. For a quick take, - look at the "View - Character set" menu of Netscape Communicator 4.6: - - ISO-8859-{1,2,5,7,9,15} - WINDOWS-{1250,1251,1253} - KOI8-R Cyrillic - CP866 Cyrillic - Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS) - EUC-JP Japanese - SHIFT_JIS Japanese - GB2312 Chinese - BIG5 Chinese - EUC-TW Chinese - Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB) - - UTF-8 - UTF-7 - - Hint4: The character sets supported by Microsoft Internet Explorer 4. - - ISO-8859-{1,2,3,4,5,6,7,8,9} - WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257} - KOI8-R Cyrillic - KOI8-RU Ukrainian - ASMO-708 Arabic - EUC-JP Japanese - ISO-2022-JP Japanese - SHIFT_JIS Japanese - GB2312 Chinese - HZ-GB-2312 Chinese - BIG5 Chinese - EUC-KR Korean - ISO-2022-KR Korean - WINDOWS-874 Thai - WINDOWS-1258 Vietnamese - - UTF-8 - UTF-7 - UNICODE actually UNICODE-LITTLE - UNICODEFEFF actually UNICODE-BIG - - and various DOS character sets: DOS-720, DOS-862, IBM852, CP866. - - We take the union of all these four sets. The result is: - - European and Semitic languages - * ASCII. - We implement this because it is occasionally useful to know or to - check whether some text is entirely ASCII (i.e. if the conversion - ISO-8859-x -> UTF-8 is trivial). - * ISO-8859-{1,2,3,4,5,6,7,8,9,10} - We implement this because they are widely used. Except ISO-8859-4 - which appears to have been superseded by ISO-8859-13 in the baltic - countries. But it's an ISO standard anyway. - * ISO-8859-13 - We implement this because it's a standard in Lithuania and Latvia. - * ISO-8859-14 - We implement this because it's an ISO standard. - * ISO-8859-15 - We implement this because it's increasingly used in Europe, because - of the Euro symbol. - * ISO-8859-16 - We implement this because it's an ISO standard. - * KOI8-R, KOI8-U - We implement this because it appears to be the predominant encoding - on Unix in Russia and Ukraine, respectively. - * KOI8-RU - We implement this because MSIE4 supports it. - * KOI8-T - We implement this because it is the locale encoding in glibc's Tajik - locale. - * PT154 - We implement this because it is the locale encoding in glibc's Kazakh - locale. - * RK1048 - We implement this because it's a standard in Kazakhstan. - * CP{1250,1251,1252,1253,1254,1255,1256,1257} - We implement these because they are the predominant Windows encodings - in Europe. - * CP850 - We implement this because it is mentioned as occurring in the web - in the aforementioned statistics. - * CP862 - We implement this because Ron Aaron says it is sometimes used in web - pages and emails. - * CP866 - We implement this because Netscape Communicator does. - * CP1131 - We implement this because it is the locale encoding of a Belorusian - locale in FreeBSD and MacOS X. - * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and - Mac{Hebrew,Arabic} - We implement these because the Sun JDK does, and because Mac users - don't deserve to be punished. - * Macintosh - We implement this because it is mentioned as occurring in the web - in the aforementioned statistics. - Japanese - * EUC-JP, SHIFT_JIS, ISO-2022-JP - We implement these because they are widely used. EUC-JP and SHIFT_JIS - are more used for files, whereas ISO-2022-JP is recommended for email. - * CP932 - We implement this because it is the Microsoft variant of SHIFT_JIS, - used on Windows. - * ISO-2022-JP-2 - We implement this because it's the common way to represent mails which - make use of JIS X 0212 characters. - * ISO-2022-JP-1 - We implement this because it's in the RFCs, but I don't think it is - really used. - * U90, S90 - We DON'T implement this because I have no informations about what it - is or who uses it. - Simplified Chinese - * EUC-CN = GB2312 - We implement this because it is the widely used representation - of simplified Chinese. - * GBK - We implement this because it appears to be used on Solaris and Windows. - * GB18030 - We implement this because it is an official requirement in the - People's Republic of China. - * ISO-2022-CN - We implement this because it is in the RFCs, but I have no idea - whether it is really used. - * ISO-2022-CN-EXT - We implement this because it's in the RFCs, but I don't think it is - really used. - * HZ = HZ-GB-2312 - We implement this because the RFCs recommend it for Usenet postings, - and because MSIE4 supports it. - Traditional Chinese - * EUC-TW - We implement it because it appears to be used on Unix. - * BIG5 - We implement it because it is the de-facto standard for traditional - Chinese. - * CP950 - We implement this because it is the Microsoft variant of BIG5, used - on Windows. - * BIG5+ - We DON'T implement this because it doesn't appear to be in wide use. - Only the CWEX fonts use this encoding. Furthermore, the conversion - tables in the big5p package are not coherent: If you convert directly, - you get different results than when you convert via GBK. - * BIG5-HKSCS - We implement it because it is the de-facto standard for traditional - Chinese in Hongkong. - Korean - * EUC-KR - We implement these because they appear to be the widely used - representations for Korean. - * CP949 - We implement this because it is the Microsoft variant of EUC-KR, used - on Windows. - * ISO-2022-KR - We implement it because it is in the RFCs and because MSIE4 supports - it, but I have no idea whether it's really used. - * JOHAB - We implement this because it is apparently used on Windows as a locale - encoding (codepage 1361). - * ISO-646-KR - We DON'T implement this because although an old ASCII variant, its - glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT - say it's a tilde, but Ken Lunde's "CJKV information processing" says - it's an overline. And it is not ISO-IR registered. - Armenian - * ARMSCII-8 - We implement it because XFree86 supports it. - Georgian - * Georgian-Academy, Georgian-PS - We implement these because they appear to be both used for Georgian; - Xfree86 supports them. - Thai - * ISO-8859-11, TIS-620 - We implement these because it seems to be standard for Thai. - * CP874 - We implement this because MSIE4 supports it. - * MacThai - We implement this because the Sun JDK does, and because Mac users - don't deserve to be punished. - Laotian - * MuleLao-1, CP1133 - We implement these because XFree86 supports them. I have no idea which - one is used more widely. - Vietnamese - * VISCII, TCVN - We implement these because XFree86 supports them. - * CP1258 - We implement this because MSIE4 supports it. - Other languages - * NUNACOM-8 (Inuktitut) - We DON'T implement this because it isn't part of Unicode yet, and - therefore doesn't convert to anything except itself. - Platform specifics - * HP-ROMAN8, NEXTSTEP - We implement these because they were the native character set on HPs - and NeXTs for a long time, and libiconv is intended to be usable on - these old machines. - Full Unicode - * UTF-8, UCS-2, UCS-4 - We implement these. Obviously. - * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE - We implement these because they are the preferred internal - representation of strings in Unicode aware applications. These are - non-ambiguous names, known to glibc. (glibc doesn't have - UCS-2-INTERNAL and UCS-4-INTERNAL.) - * UTF-16, UTF-16BE, UTF-16LE - We implement these, because UTF-16 is still the favourite encoding of - the president of the Unicode Consortium (for political reasons), and - because they appear in RFC 2781. - * UTF-32, UTF-32BE, UTF-32LE - We implement these because they are part of Unicode 3.1. - * UTF-7 - We implement this because it is essential functionality for mail - applications. - * C99 - We implement it because it's used for C and C++ programs and because - it's a nice encoding for debugging. - * JAVA - We implement it because it's used for Java programs and because it's - a nice encoding for debugging. - * UNICODE (big endian), UNICODEFEFF (little endian) - We DON'T implement these because they are stupid and not standardized. - Full Unicode, in terms of `uint16_t' or `uint32_t' - (with machine dependent endianness and alignment) - * UCS-2-INTERNAL, UCS-4-INTERNAL - We implement these because they are the preferred internal - representation of strings in Unicode aware applications. - -Q: Support encodings mentioned in RFC 1345 ? -A: No, they are not in use any more. Supporting ISO-646 variants is pointless - since ISO-8859-* have been adopted. - -Q: Support EBCDIC ? -A: No! - -Q: How do I add a new character set? -A: 1. Explain the "why" in this file, above. - 2. You need to have a conversion table from/to Unicode. Transform it into - the format used by the mapping tables found on ftp.unicode.org: each line - contains the character code, in hex, with 0x prefix, then whitespace, - then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#' - counts as a comment delimiter until end of line. - Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he - can include it in his collection. - 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the - tools directory to generate the C code for the conversion. You may tweak - the resulting C code if you are not satisfied with its quality, but this - is rarely needed. - If it's a two-dimensional character set (with rows and columns), use the - 'cjk_tab_to_h' program in the tools directory to generate the C code for - the conversion. You will need to modify the main() function to recognize - the new character set name, with the proper dimensions, but that shouldn't - be too hard. This yields the CCS. The CES you have to write by hand. - 4. Store the resulting C code file in the lib directory. Add a #include - directive to converters.h, and add an entry to the encodings.def file. - 5. Compile the package, and test your new encoding using a program like - iconv(1) or clisp(1). - 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless - encoding, create the complete table as a TXT file. For a stateful encoding, - provide a text snippet encoded using your new encoding and its UTF-8 - equivalent. - 7. Update the README and man/iconv_open.3, to mention the new encoding. - Add a note in the NEWS file. - -Q: What about bidirectional text? Should it be tagged or reversed when - converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do - this, see qt-2.0.1/src/tools/qrtlcodec.cpp. -A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and - ISO-8859-E remains to be implemented. - On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email* - is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e. - the same as ISO-8859-8-I. I'm confused. - -Other character sets not implemented: -"MNEMONIC" = "csMnemonic" -"MNEM" = "csMnem" -"ISO-10646-UCS-Basic" = "csUnicodeASCII" -"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646" -"ISO-10646-J-1" -"UNICODE-1-1" = "csUnicode11" -"csWindows31Latin5" - -Other aliases not implemented (and not implemented in glibc-2.1 either): - From MSIE4: - ISO-8859-1: alias ISO8859-1 - ISO-8859-2: alias ISO8859-2 - KSC_5601: alias KS_C_5601 - UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8 - - -Q: How can I integrate libiconv into my package? -A: Just copy the entire libiconv package into a subdirectory of your package. - At configuration time, call libiconv's configure script with the - appropriate --srcdir option and maybe --enable-static or --disable-shared. - Then "cd libiconv && make && make install-lib libdir=... includedir=...". - 'install-lib' is a special (not GNU standardized) target which installs - only the include file - in $(includedir) - and the library - in $(libdir) - - and does not use other directory variables. After "installing" libiconv - in your package's build directory, building of your package can proceed. - -Q: Why is the testsuite so big? -A: Because some of the tests are very comprehensive. - If you don't feel like using the testsuite, you can simply remove the - tests/ directory. - |