jni/iconv/NOTES


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399

Q: Why does libiconv support encoding XXX? Why does libiconv not support
   encoding ZZZ?

A: libiconv, as an internationalization library, supports those character
   sets and encodings which are in wide-spread use in at least one territory
   of the world.

   Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
   page "Languages, countries, and the charsets typically used for them".
   From this table, we can conclude that the following are in active use:

     ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
                          English, Faroese, Finnish, French, Galician, German,
                          Icelandic, Irish, Italian, Norwegian, Portuguese,
                          Scottish, Spanish, Swedish
     ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
                          Slovenian
     ISO-8859-3           Esperanto, Maltese
     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                          Serbian, Ukrainian
     ISO-8859-6           Arabic
     ISO-8859-7           Greek
     ISO-8859-8           Hebrew
     ISO-8859-9, CP1254   Turkish
     ISO-8859-10          Inuit, Lapp
     ISO-8859-13          Latvian, Lithuanian
     ISO-8859-15          Estonian
     KOI8-R               Russian
     SHIFT_JIS            Japanese
     ISO-2022-JP          Japanese
     EUC-JP               Japanese

   Ordered by frequency on the web (1997):
     ISO-8859-1, CP1252   96%
     SHIFT_JIS             1.6%
     ISO-2022-JP           1.2%
     EUC-JP                0.4%
     CP1250                0.3%
     CP1251                0.2%
     CP850                 0.1%
     MACINTOSH             0.1%
     ISO-8859-5            0.1%
     ISO-8859-2            0.0%

   Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.

     ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
                          English, Estonian, Faroese, Finnish, French,
                          Galician, German, Greenlandic, Icelandic,
                          Indonesian, Irish, Italian, Lithuanian, Norwegian,
                          Occitan, Portuguese, Scottish, Spanish, Swedish,
                          Walloon, Welsh
     ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
                          Romanian, Serbian, Slovak, Slovenian
     ISO-8859-3           Esperanto
     ISO-8859-4           Estonian, Latvian, Lithuanian
     ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                          Serbian, Ukrainian
     ISO-8859-6           Arabic
     ISO-8859-7           Greek
     ISO-8859-8           Hebrew
     ISO-8859-9           Turkish
     ISO-8859-14          Breton, Irish, Scottish, Welsh
     ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
                          Faroese, Finnish, French, Galician, German,
                          Greenlandic, Icelandic, Irish, Italian, Lithuanian,
                          Norwegian, Occitan, Portuguese, Scottish, Spanish,
                          Swedish, Walloon, Welsh
     KOI8-R               Russian
     KOI8-U               Russian, Ukrainian
     EUC-JP (alias eucJP)      Japanese
     ISO-2022-JP (alias JIS7)  Japanese
     SHIFT_JIS (alias SJIS)    Japanese
     U90                       Japanese
     S90                       Japanese
     EUC-CN (alias eucCN)      Chinese
     EUC-TW (alias eucTW)      Chinese
     BIG5                      Chinese
     EUC-KR (alias eucKR)      Korean
     ARMSCII-8                 Armenian
     GEORGIAN-ACADEMY          Georgian
     GEORGIAN-PS               Georgian
     TIS-620 (alias TACTIS)    Thai
     MULELAO-1                 Laothian
     IBM-CP1133                Laothian
     VISCII                    Vietnamese
     TCVN                      Vietnamese
     NUNACOM-8                 Inuktitut

   Hint3: The character sets supported by Netscape Communicator 4.

     Where is this documented? For the complete picture, I had to use
     "strings netscape" and then a lot of guesswork. For a quick take,
     look at the "View - Character set" menu of Netscape Communicator 4.6:

     ISO-8859-{1,2,5,7,9,15}
     WINDOWS-{1250,1251,1253}
     KOI8-R               Cyrillic
     CP866                Cyrillic
     Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
     EUC-JP               Japanese
     SHIFT_JIS            Japanese
     GB2312               Chinese
     BIG5                 Chinese
     EUC-TW               Chinese
     Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)

     UTF-8
     UTF-7

   Hint4: The character sets supported by Microsoft Internet Explorer 4.

     ISO-8859-{1,2,3,4,5,6,7,8,9}
     WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
     KOI8-R               Cyrillic
     KOI8-RU              Ukrainian
     ASMO-708             Arabic
     EUC-JP               Japanese
     ISO-2022-JP          Japanese
     SHIFT_JIS            Japanese
     GB2312               Chinese
     HZ-GB-2312           Chinese
     BIG5                 Chinese
     EUC-KR               Korean
     ISO-2022-KR          Korean
     WINDOWS-874          Thai
     WINDOWS-1258         Vietnamese

     UTF-8
     UTF-7
     UNICODE             actually UNICODE-LITTLE
     UNICODEFEFF         actually UNICODE-BIG

     and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.

   We take the union of all these four sets. The result is:

   European and Semitic languages
     * ASCII.
       We implement this because it is occasionally useful to know or to
       check whether some text is entirely ASCII (i.e. if the conversion
       ISO-8859-x -> UTF-8 is trivial).
     * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
       We implement this because they are widely used. Except ISO-8859-4
       which appears to have been superseded by ISO-8859-13 in the baltic
       countries. But it's an ISO standard anyway.
     * ISO-8859-13
       We implement this because it's a standard in Lithuania and Latvia.
     * ISO-8859-14
       We implement this because it's an ISO standard.
     * ISO-8859-15
       We implement this because it's increasingly used in Europe, because
       of the Euro symbol.
     * ISO-8859-16
       We implement this because it's an ISO standard.
     * KOI8-R, KOI8-U
       We implement this because it appears to be the predominant encoding
       on Unix in Russia and Ukraine, respectively.
     * KOI8-RU
       We implement this because MSIE4 supports it.
     * KOI8-T
       We implement this because it is the locale encoding in glibc's Tajik
       locale.
     * PT154
       We implement this because it is the locale encoding in glibc's Kazakh
       locale.
     * RK1048
       We implement this because it's a standard in Kazakhstan.
     * CP{1250,1251,1252,1253,1254,1255,1256,1257}
       We implement these because they are the predominant Windows encodings
       in Europe.
     * CP850
       We implement this because it is mentioned as occurring in the web
       in the aforementioned statistics.
     * CP862
       We implement this because Ron Aaron says it is sometimes used in web
       pages and emails.
     * CP866
       We implement this because Netscape Communicator does.
     * CP1131
       We implement this because it is the locale encoding of a Belorusian
       locale in FreeBSD and MacOS X.
     * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
       Mac{Hebrew,Arabic}
       We implement these because the Sun JDK does, and because Mac users
       don't deserve to be punished.
     * Macintosh
       We implement this because it is mentioned as occurring in the web
       in the aforementioned statistics.
   Japanese
     * EUC-JP, SHIFT_JIS, ISO-2022-JP
       We implement these because they are widely used. EUC-JP and SHIFT_JIS
       are more used for files, whereas ISO-2022-JP is recommended for email.
     * CP932
       We implement this because it is the Microsoft variant of SHIFT_JIS,
       used on Windows.
     * ISO-2022-JP-2
       We implement this because it's the common way to represent mails which
       make use of JIS X 0212 characters.
     * ISO-2022-JP-1
       We implement this because it's in the RFCs, but I don't think it is
       really used.
     * U90, S90
       We DON'T implement this because I have no informations about what it
       is or who uses it.
   Simplified Chinese
     * EUC-CN = GB2312
       We implement this because it is the widely used representation
       of simplified Chinese.
     * GBK
       We implement this because it appears to be used on Solaris and Windows.
     * GB18030
       We implement this because it is an official requirement in the
       People's Republic of China.
     * ISO-2022-CN
       We implement this because it is in the RFCs, but I have no idea
       whether it is really used.
     * ISO-2022-CN-EXT
       We implement this because it's in the RFCs, but I don't think it is
       really used.
     * HZ = HZ-GB-2312
       We implement this because the RFCs recommend it for Usenet postings,
       and because MSIE4 supports it.
   Traditional Chinese
     * EUC-TW
       We implement it because it appears to be used on Unix.
     * BIG5
       We implement it because it is the de-facto standard for traditional
       Chinese.
     * CP950
       We implement this because it is the Microsoft variant of BIG5, used
       on Windows.
     * BIG5+
       We DON'T implement this because it doesn't appear to be in wide use.
       Only the CWEX fonts use this encoding. Furthermore, the conversion
       tables in the big5p package are not coherent: If you convert directly,
       you get different results than when you convert via GBK.
     * BIG5-HKSCS
       We implement it because it is the de-facto standard for traditional
       Chinese in Hongkong.
   Korean
     * EUC-KR
       We implement these because they appear to be the widely used
       representations for Korean.
     * CP949
       We implement this because it is the Microsoft variant of EUC-KR, used
       on Windows.
     * ISO-2022-KR
       We implement it because it is in the RFCs and because MSIE4 supports
       it, but I have no idea whether it's really used.
     * JOHAB
       We implement this because it is apparently used on Windows as a locale
       encoding (codepage 1361).
     * ISO-646-KR
       We DON'T implement this because although an old ASCII variant, its
       glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
       say it's a tilde, but Ken Lunde's "CJKV information processing" says
       it's an overline. And it is not ISO-IR registered.
   Armenian
     * ARMSCII-8
       We implement it because XFree86 supports it.
   Georgian
     * Georgian-Academy, Georgian-PS
       We implement these because they appear to be both used for Georgian;
       Xfree86 supports them.
   Thai
     * ISO-8859-11, TIS-620
       We implement these because it seems to be standard for Thai.
     * CP874
       We implement this because MSIE4 supports it.
     * MacThai
       We implement this because the Sun JDK does, and because Mac users
       don't deserve to be punished.
   Laotian
     * MuleLao-1, CP1133
       We implement these because XFree86 supports them. I have no idea which
       one is used more widely.
   Vietnamese
     * VISCII, TCVN
       We implement these because XFree86 supports them.
     * CP1258
       We implement this because MSIE4 supports it.
   Other languages
     * NUNACOM-8 (Inuktitut)
       We DON'T implement this because it isn't part of Unicode yet, and
       therefore doesn't convert to anything except itself.
   Platform specifics
     * HP-ROMAN8, NEXTSTEP
       We implement these because they were the native character set on HPs
       and NeXTs for a long time, and libiconv is intended to be usable on
       these old machines.
   Full Unicode
     * UTF-8, UCS-2, UCS-4
       We implement these. Obviously.
     * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
       We implement these because they are the preferred internal
       representation of strings in Unicode aware applications. These are
       non-ambiguous names, known to glibc. (glibc doesn't have
       UCS-2-INTERNAL and UCS-4-INTERNAL.)
     * UTF-16, UTF-16BE, UTF-16LE
       We implement these, because UTF-16 is still the favourite encoding of
       the president of the Unicode Consortium (for political reasons), and
       because they appear in RFC 2781.
     * UTF-32, UTF-32BE, UTF-32LE
       We implement these because they are part of Unicode 3.1.
     * UTF-7
       We implement this because it is essential functionality for mail
       applications.
     * C99
       We implement it because it's used for C and C++ programs and because
       it's a nice encoding for debugging.
     * JAVA
       We implement it because it's used for Java programs and because it's
       a nice encoding for debugging.
     * UNICODE (big endian), UNICODEFEFF (little endian)
       We DON'T implement these because they are stupid and not standardized.
   Full Unicode, in terms of `uint16_t' or `uint32_t'
   (with machine dependent endianness and alignment)
     * UCS-2-INTERNAL, UCS-4-INTERNAL
       We implement these because they are the preferred internal
       representation of strings in Unicode aware applications.

Q: Support encodings mentioned in RFC 1345 ?
A: No, they are not in use any more. Supporting ISO-646 variants is pointless
   since ISO-8859-* have been adopted.

Q: Support EBCDIC ?
A: No!

Q: How do I add a new character set?
A: 1. Explain the "why" in this file, above.
   2. You need to have a conversion table from/to Unicode. Transform it into
   the format used by the mapping tables found on ftp.unicode.org: each line
   contains the character code, in hex, with 0x prefix, then whitespace,
   then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
   counts as a comment delimiter until end of line.
   Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
   can include it in his collection.
   3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
   tools directory to generate the C code for the conversion. You may tweak
   the resulting C code if you are not satisfied with its quality, but this
   is rarely needed.
   If it's a two-dimensional character set (with rows and columns), use the
   'cjk_tab_to_h' program in the tools directory to generate the C code for
   the conversion. You will need to modify the main() function to recognize
   the new character set name, with the proper dimensions, but that shouldn't
   be too hard. This yields the CCS. The CES you have to write by hand.
   4. Store the resulting C code file in the lib directory. Add a #include
   directive to converters.h, and add an entry to the encodings.def file.
   5. Compile the package, and test your new encoding using a program like
   iconv(1) or clisp(1).
   6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
   encoding, create the complete table as a TXT file. For a stateful encoding,
   provide a text snippet encoded using your new encoding and its UTF-8
   equivalent.
   7. Update the README and man/iconv_open.3, to mention the new encoding.
   Add a note in the NEWS file.

Q: What about bidirectional text? Should it be tagged or reversed when
   converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
   this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
   ISO-8859-E remains to be implemented.
   On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
   is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
   the same as ISO-8859-8-I. I'm confused.

Other character sets not implemented:
"MNEMONIC" = "csMnemonic"
"MNEM" = "csMnem"
"ISO-10646-UCS-Basic" = "csUnicodeASCII"
"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
"ISO-10646-J-1"
"UNICODE-1-1" = "csUnicode11"
"csWindows31Latin5"

Other aliases not implemented (and not implemented in glibc-2.1 either):
  From MSIE4:
    ISO-8859-1: alias ISO8859-1
    ISO-8859-2: alias ISO8859-2
    KSC_5601: alias KS_C_5601
    UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8


Q: How can I integrate libiconv into my package?
A: Just copy the entire libiconv package into a subdirectory of your package.
   At configuration time, call libiconv's configure script with the
   appropriate --srcdir option and maybe --enable-static or --disable-shared.
   Then "cd libiconv && make && make install-lib libdir=... includedir=...".
   'install-lib' is a special (not GNU standardized) target which installs
   only the include file - in $(includedir) - and the library - in $(libdir) -
   and does not use other directory variables. After "installing" libiconv
   in your package's build directory, building of your package can proceed.

Q: Why is the testsuite so big?
A: Because some of the tests are very comprehensive.
   If you don't feel like using the testsuite, you can simply remove the
   tests/ directory.