Newbie question about text encoding

Fri Mar 6 04:54:00 EST 2015

On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody <rustompmody at gmail.com> wrote:
>> Broken systems can be shown up by anything. Suppose you have a program
>> that breaks when it gets a NUL character (not unknown in C code); is
>> the fault with the Unicode consortium for allocating something at
>> codepoint 0, or the code that can't cope with a perfectly normal
>> character?
>
> Strawman.

Not really, no. I know of lots of programs that can't handle embedded
NULs, and which fail in various ways when given them (the most common
is simple truncation, but it's by far not the only way). And it's
exactly the same: a program that purports to handle arbitrary Unicode
text should be able to handle arbitrary Unicode text, not "Unicode
text as long as it contains only codepoints within the range X-Y". It
doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or
U+1F4A3 - if your code blows up, it's a failure in your code.

> Lets please stick to UTF-16 shall we?
>
> Now tell me:
> - Is it broken or not?
> - Is it widely used or not?
> - Should programmers be careful of it or not?
> - Should programmers be warned about it or not?

No, UTF-16 is not itself broken. (It would be if we expected
codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap
on Unicode, but it's looking unlikely that we'll be needing any more
than that anyway.) What's broken is code that tries to treat UTF-16 as
if it's UCS-2, and then breaks on surrogate pairs.

Yes, it's widely used. Programmers should probably be warned about it,
but only because its tradeoffs are generally poorer than UTF-8's. If
you use it correctly, there's no problem.

> Also:
> Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
> assume he is safe all over?

I don't know what you mean here. Do you mean that your Python 3
program is "at risk" in some way because there might be some other
program that misuses UTF-16? Well, sure. And there might be some other
program that misuses buffer sizes, SQL queries, or shell invocations,
and makes your overall system vulnerable to buffer overruns or
injection attacks. These are significantly more likely AND more
serious than UTF-16 misuses. And you still have not proven anything
about SMP characters being a problem, but only that code can be
broken. Broken code is still broken code, no matter what your actual
brokenness.

ChrisA