summaryrefslogtreecommitdiff
path: root/jni/iconv/NOTES
diff options
context:
space:
mode:
Diffstat (limited to 'jni/iconv/NOTES')
-rw-r--r--jni/iconv/NOTES399
1 files changed, 0 insertions, 399 deletions
diff --git a/jni/iconv/NOTES b/jni/iconv/NOTES
deleted file mode 100644
index 0755a2e..0000000
--- a/jni/iconv/NOTES
+++ /dev/null
@@ -1,399 +0,0 @@
-Q: Why does libiconv support encoding XXX? Why does libiconv not support
- encoding ZZZ?
-
-A: libiconv, as an internationalization library, supports those character
- sets and encodings which are in wide-spread use in at least one territory
- of the world.
-
- Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
- page "Languages, countries, and the charsets typically used for them".
- From this table, we can conclude that the following are in active use:
-
- ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
- English, Faroese, Finnish, French, Galician, German,
- Icelandic, Irish, Italian, Norwegian, Portuguese,
- Scottish, Spanish, Swedish
- ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
- Slovenian
- ISO-8859-3 Esperanto, Maltese
- ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
- Serbian, Ukrainian
- ISO-8859-6 Arabic
- ISO-8859-7 Greek
- ISO-8859-8 Hebrew
- ISO-8859-9, CP1254 Turkish
- ISO-8859-10 Inuit, Lapp
- ISO-8859-13 Latvian, Lithuanian
- ISO-8859-15 Estonian
- KOI8-R Russian
- SHIFT_JIS Japanese
- ISO-2022-JP Japanese
- EUC-JP Japanese
-
- Ordered by frequency on the web (1997):
- ISO-8859-1, CP1252 96%
- SHIFT_JIS 1.6%
- ISO-2022-JP 1.2%
- EUC-JP 0.4%
- CP1250 0.3%
- CP1251 0.2%
- CP850 0.1%
- MACINTOSH 0.1%
- ISO-8859-5 0.1%
- ISO-8859-2 0.0%
-
- Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
-
- ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
- English, Estonian, Faroese, Finnish, French,
- Galician, German, Greenlandic, Icelandic,
- Indonesian, Irish, Italian, Lithuanian, Norwegian,
- Occitan, Portuguese, Scottish, Spanish, Swedish,
- Walloon, Welsh
- ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
- Romanian, Serbian, Slovak, Slovenian
- ISO-8859-3 Esperanto
- ISO-8859-4 Estonian, Latvian, Lithuanian
- ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
- Serbian, Ukrainian
- ISO-8859-6 Arabic
- ISO-8859-7 Greek
- ISO-8859-8 Hebrew
- ISO-8859-9 Turkish
- ISO-8859-14 Breton, Irish, Scottish, Welsh
- ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
- Faroese, Finnish, French, Galician, German,
- Greenlandic, Icelandic, Irish, Italian, Lithuanian,
- Norwegian, Occitan, Portuguese, Scottish, Spanish,
- Swedish, Walloon, Welsh
- KOI8-R Russian
- KOI8-U Russian, Ukrainian
- EUC-JP (alias eucJP) Japanese
- ISO-2022-JP (alias JIS7) Japanese
- SHIFT_JIS (alias SJIS) Japanese
- U90 Japanese
- S90 Japanese
- EUC-CN (alias eucCN) Chinese
- EUC-TW (alias eucTW) Chinese
- BIG5 Chinese
- EUC-KR (alias eucKR) Korean
- ARMSCII-8 Armenian
- GEORGIAN-ACADEMY Georgian
- GEORGIAN-PS Georgian
- TIS-620 (alias TACTIS) Thai
- MULELAO-1 Laothian
- IBM-CP1133 Laothian
- VISCII Vietnamese
- TCVN Vietnamese
- NUNACOM-8 Inuktitut
-
- Hint3: The character sets supported by Netscape Communicator 4.
-
- Where is this documented? For the complete picture, I had to use
- "strings netscape" and then a lot of guesswork. For a quick take,
- look at the "View - Character set" menu of Netscape Communicator 4.6:
-
- ISO-8859-{1,2,5,7,9,15}
- WINDOWS-{1250,1251,1253}
- KOI8-R Cyrillic
- CP866 Cyrillic
- Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
- EUC-JP Japanese
- SHIFT_JIS Japanese
- GB2312 Chinese
- BIG5 Chinese
- EUC-TW Chinese
- Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)
-
- UTF-8
- UTF-7
-
- Hint4: The character sets supported by Microsoft Internet Explorer 4.
-
- ISO-8859-{1,2,3,4,5,6,7,8,9}
- WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
- KOI8-R Cyrillic
- KOI8-RU Ukrainian
- ASMO-708 Arabic
- EUC-JP Japanese
- ISO-2022-JP Japanese
- SHIFT_JIS Japanese
- GB2312 Chinese
- HZ-GB-2312 Chinese
- BIG5 Chinese
- EUC-KR Korean
- ISO-2022-KR Korean
- WINDOWS-874 Thai
- WINDOWS-1258 Vietnamese
-
- UTF-8
- UTF-7
- UNICODE actually UNICODE-LITTLE
- UNICODEFEFF actually UNICODE-BIG
-
- and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
-
- We take the union of all these four sets. The result is:
-
- European and Semitic languages
- * ASCII.
- We implement this because it is occasionally useful to know or to
- check whether some text is entirely ASCII (i.e. if the conversion
- ISO-8859-x -> UTF-8 is trivial).
- * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
- We implement this because they are widely used. Except ISO-8859-4
- which appears to have been superseded by ISO-8859-13 in the baltic
- countries. But it's an ISO standard anyway.
- * ISO-8859-13
- We implement this because it's a standard in Lithuania and Latvia.
- * ISO-8859-14
- We implement this because it's an ISO standard.
- * ISO-8859-15
- We implement this because it's increasingly used in Europe, because
- of the Euro symbol.
- * ISO-8859-16
- We implement this because it's an ISO standard.
- * KOI8-R, KOI8-U
- We implement this because it appears to be the predominant encoding
- on Unix in Russia and Ukraine, respectively.
- * KOI8-RU
- We implement this because MSIE4 supports it.
- * KOI8-T
- We implement this because it is the locale encoding in glibc's Tajik
- locale.
- * PT154
- We implement this because it is the locale encoding in glibc's Kazakh
- locale.
- * RK1048
- We implement this because it's a standard in Kazakhstan.
- * CP{1250,1251,1252,1253,1254,1255,1256,1257}
- We implement these because they are the predominant Windows encodings
- in Europe.
- * CP850
- We implement this because it is mentioned as occurring in the web
- in the aforementioned statistics.
- * CP862
- We implement this because Ron Aaron says it is sometimes used in web
- pages and emails.
- * CP866
- We implement this because Netscape Communicator does.
- * CP1131
- We implement this because it is the locale encoding of a Belorusian
- locale in FreeBSD and MacOS X.
- * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
- Mac{Hebrew,Arabic}
- We implement these because the Sun JDK does, and because Mac users
- don't deserve to be punished.
- * Macintosh
- We implement this because it is mentioned as occurring in the web
- in the aforementioned statistics.
- Japanese
- * EUC-JP, SHIFT_JIS, ISO-2022-JP
- We implement these because they are widely used. EUC-JP and SHIFT_JIS
- are more used for files, whereas ISO-2022-JP is recommended for email.
- * CP932
- We implement this because it is the Microsoft variant of SHIFT_JIS,
- used on Windows.
- * ISO-2022-JP-2
- We implement this because it's the common way to represent mails which
- make use of JIS X 0212 characters.
- * ISO-2022-JP-1
- We implement this because it's in the RFCs, but I don't think it is
- really used.
- * U90, S90
- We DON'T implement this because I have no informations about what it
- is or who uses it.
- Simplified Chinese
- * EUC-CN = GB2312
- We implement this because it is the widely used representation
- of simplified Chinese.
- * GBK
- We implement this because it appears to be used on Solaris and Windows.
- * GB18030
- We implement this because it is an official requirement in the
- People's Republic of China.
- * ISO-2022-CN
- We implement this because it is in the RFCs, but I have no idea
- whether it is really used.
- * ISO-2022-CN-EXT
- We implement this because it's in the RFCs, but I don't think it is
- really used.
- * HZ = HZ-GB-2312
- We implement this because the RFCs recommend it for Usenet postings,
- and because MSIE4 supports it.
- Traditional Chinese
- * EUC-TW
- We implement it because it appears to be used on Unix.
- * BIG5
- We implement it because it is the de-facto standard for traditional
- Chinese.
- * CP950
- We implement this because it is the Microsoft variant of BIG5, used
- on Windows.
- * BIG5+
- We DON'T implement this because it doesn't appear to be in wide use.
- Only the CWEX fonts use this encoding. Furthermore, the conversion
- tables in the big5p package are not coherent: If you convert directly,
- you get different results than when you convert via GBK.
- * BIG5-HKSCS
- We implement it because it is the de-facto standard for traditional
- Chinese in Hongkong.
- Korean
- * EUC-KR
- We implement these because they appear to be the widely used
- representations for Korean.
- * CP949
- We implement this because it is the Microsoft variant of EUC-KR, used
- on Windows.
- * ISO-2022-KR
- We implement it because it is in the RFCs and because MSIE4 supports
- it, but I have no idea whether it's really used.
- * JOHAB
- We implement this because it is apparently used on Windows as a locale
- encoding (codepage 1361).
- * ISO-646-KR
- We DON'T implement this because although an old ASCII variant, its
- glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
- say it's a tilde, but Ken Lunde's "CJKV information processing" says
- it's an overline. And it is not ISO-IR registered.
- Armenian
- * ARMSCII-8
- We implement it because XFree86 supports it.
- Georgian
- * Georgian-Academy, Georgian-PS
- We implement these because they appear to be both used for Georgian;
- Xfree86 supports them.
- Thai
- * ISO-8859-11, TIS-620
- We implement these because it seems to be standard for Thai.
- * CP874
- We implement this because MSIE4 supports it.
- * MacThai
- We implement this because the Sun JDK does, and because Mac users
- don't deserve to be punished.
- Laotian
- * MuleLao-1, CP1133
- We implement these because XFree86 supports them. I have no idea which
- one is used more widely.
- Vietnamese
- * VISCII, TCVN
- We implement these because XFree86 supports them.
- * CP1258
- We implement this because MSIE4 supports it.
- Other languages
- * NUNACOM-8 (Inuktitut)
- We DON'T implement this because it isn't part of Unicode yet, and
- therefore doesn't convert to anything except itself.
- Platform specifics
- * HP-ROMAN8, NEXTSTEP
- We implement these because they were the native character set on HPs
- and NeXTs for a long time, and libiconv is intended to be usable on
- these old machines.
- Full Unicode
- * UTF-8, UCS-2, UCS-4
- We implement these. Obviously.
- * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
- We implement these because they are the preferred internal
- representation of strings in Unicode aware applications. These are
- non-ambiguous names, known to glibc. (glibc doesn't have
- UCS-2-INTERNAL and UCS-4-INTERNAL.)
- * UTF-16, UTF-16BE, UTF-16LE
- We implement these, because UTF-16 is still the favourite encoding of
- the president of the Unicode Consortium (for political reasons), and
- because they appear in RFC 2781.
- * UTF-32, UTF-32BE, UTF-32LE
- We implement these because they are part of Unicode 3.1.
- * UTF-7
- We implement this because it is essential functionality for mail
- applications.
- * C99
- We implement it because it's used for C and C++ programs and because
- it's a nice encoding for debugging.
- * JAVA
- We implement it because it's used for Java programs and because it's
- a nice encoding for debugging.
- * UNICODE (big endian), UNICODEFEFF (little endian)
- We DON'T implement these because they are stupid and not standardized.
- Full Unicode, in terms of `uint16_t' or `uint32_t'
- (with machine dependent endianness and alignment)
- * UCS-2-INTERNAL, UCS-4-INTERNAL
- We implement these because they are the preferred internal
- representation of strings in Unicode aware applications.
-
-Q: Support encodings mentioned in RFC 1345 ?
-A: No, they are not in use any more. Supporting ISO-646 variants is pointless
- since ISO-8859-* have been adopted.
-
-Q: Support EBCDIC ?
-A: No!
-
-Q: How do I add a new character set?
-A: 1. Explain the "why" in this file, above.
- 2. You need to have a conversion table from/to Unicode. Transform it into
- the format used by the mapping tables found on ftp.unicode.org: each line
- contains the character code, in hex, with 0x prefix, then whitespace,
- then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
- counts as a comment delimiter until end of line.
- Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
- can include it in his collection.
- 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
- tools directory to generate the C code for the conversion. You may tweak
- the resulting C code if you are not satisfied with its quality, but this
- is rarely needed.
- If it's a two-dimensional character set (with rows and columns), use the
- 'cjk_tab_to_h' program in the tools directory to generate the C code for
- the conversion. You will need to modify the main() function to recognize
- the new character set name, with the proper dimensions, but that shouldn't
- be too hard. This yields the CCS. The CES you have to write by hand.
- 4. Store the resulting C code file in the lib directory. Add a #include
- directive to converters.h, and add an entry to the encodings.def file.
- 5. Compile the package, and test your new encoding using a program like
- iconv(1) or clisp(1).
- 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
- encoding, create the complete table as a TXT file. For a stateful encoding,
- provide a text snippet encoded using your new encoding and its UTF-8
- equivalent.
- 7. Update the README and man/iconv_open.3, to mention the new encoding.
- Add a note in the NEWS file.
-
-Q: What about bidirectional text? Should it be tagged or reversed when
- converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
- this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
-A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
- ISO-8859-E remains to be implemented.
- On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
- is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
- the same as ISO-8859-8-I. I'm confused.
-
-Other character sets not implemented:
-"MNEMONIC" = "csMnemonic"
-"MNEM" = "csMnem"
-"ISO-10646-UCS-Basic" = "csUnicodeASCII"
-"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
-"ISO-10646-J-1"
-"UNICODE-1-1" = "csUnicode11"
-"csWindows31Latin5"
-
-Other aliases not implemented (and not implemented in glibc-2.1 either):
- From MSIE4:
- ISO-8859-1: alias ISO8859-1
- ISO-8859-2: alias ISO8859-2
- KSC_5601: alias KS_C_5601
- UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
-
-
-Q: How can I integrate libiconv into my package?
-A: Just copy the entire libiconv package into a subdirectory of your package.
- At configuration time, call libiconv's configure script with the
- appropriate --srcdir option and maybe --enable-static or --disable-shared.
- Then "cd libiconv && make && make install-lib libdir=... includedir=...".
- 'install-lib' is a special (not GNU standardized) target which installs
- only the include file - in $(includedir) - and the library - in $(libdir) -
- and does not use other directory variables. After "installing" libiconv
- in your package's build directory, building of your package can proceed.
-
-Q: Why is the testsuite so big?
-A: Because some of the tests are very comprehensive.
- If you don't feel like using the testsuite, you can simply remove the
- tests/ directory.
-