Newbie question about text encoding

Fri Mar 6 09:50:08 EST 2015

Rustom Mody wrote:

> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

[snip example of an analogous situation with NULs]

> Strawman.

Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
they really should say is "Yes, that's a good argument, I'm afraid I can't
argue against it, at least not without considerable thought", I'd be a
wealthy man...

> Lets please stick to UTF-16 shall we?
> 
> Now tell me:
> - Is it broken or not?

The UTF-16 standard is not broken. It is a perfectly adequate variable-width
encoding, and considerably better than most other variable-width encodings.

However, many implementations of UTF-16 are faulty, and assume a
fixed-width. *That* is broken, not UTF-16.

(The difference between specification and implementation is critical.)

> - Is it widely used or not?

It's quite widely used.

> - Should programmers be careful of it or not?

Programmers should be aware whether or not any specific language uses UTF-16
and whether the implementation is buggy. That will help them decide whether
or not to use that language.

> - Should programmers be warned about it or not?

I'm in favour of people having more knowledge rather than less. I don't
believe that ignorance is bliss, except perhaps in the case that a giant
asteroid the size of Texas is heading straight for us.

Programmers should be aware of the limitations or bugs in any UTF-16
implementation they are likely to run into. Hence my general
recommendation:

- For transmission over networks or storage on permanent media (e.g. the
content of text files), use UTF-8. It is well-implemented by nearly all
languages that support Unicode, as far as I know.

- If you are designing your own language, your implementation of Unicode
strings should use something like Python's FSR, or UTF-8 with tweaks to
make string indexing O(1) rather than O(N), or correctly-implemented
UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in
2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte
per code point format, you fail.

- If you are using an existing language, be aware of any bugs and
limitations in its Unicode implementation. You may or may not be able to
work around them, but at least you can decide whether or not you wish to
try.

- If you are writing your own file system layer, it's 2015 fer fecks sake,
file names should be Unicode strings, not bytes! (That's one part of the
Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
system, whichever you please, but again remember that both are
variable-width formats.

-- 
Steven