[I18n-sig] First draft of Unicode howto

"Martin v. Löwis" martin at v.loewis.de
Sun Aug 7 17:35:25 CEST 2005


A.M. Kuchling wrote:
> The 'Tips for Writing Unicode-aware Programs' is also very sparse,
> because I couldn't come up with much of anything very helpful.
> Suggestions for this section would also be appreciated.  

Some remarks as I go through:
- UTF-8 uses 4 bytes, for characters above U+10000 (i.e. non-BMP
  characters), and 3 bytes in the range U+0800...U+FFFF

- if you want to, you can further restrict the value ranges for
  the UTF-8 bytes: the 2nd, 3rd, fourth byte are always between
  128 and 191; the first byte is 192..223 for two-byte, 224..239
  for three-byte, and 240..247 for four-byte sequences.

  Because of this property, you can resynchronize (not that I'm
  aware of any application that commonly uses resynchronization).
  But, for the same reason, it is unlikely that you encounter
  bytes that look like UTF-8 but aren't.

- The example for Unicode literals with encoding errors renders
  incorrectly (I see a question mark)

- If you mention Unicode character categories, you should elaborate
  a bit. Unicode categories are things like "Letter", "Symbol",
  "Punctuation", with subcategories like "Uppercase" or "Dash".
  The list of all categories is at

http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

- reading data: you could point out that IO libraries sometimes
  already input and output Unicode directly, with the most
  prominent examples being GUI, XML, and databases; developers
  should check whether their library supports Unicode.

Regards,
Martin


More information about the I18n-sig mailing list