Newbie question about text encoding

Rustom Mody rustompmody at gmail.com
Fri Mar 6 05:07:34 EST 2015


On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote:
> On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote:
> >> Broken systems can be shown up by anything. Suppose you have a program
> >> that breaks when it gets a NUL character (not unknown in C code); is
> >> the fault with the Unicode consortium for allocating something at
> >> codepoint 0, or the code that can't cope with a perfectly normal
> >> character?
> >
> > Strawman.
> 
> Not really, no. I know of lots of programs that can't handle embedded
> NULs, and which fail in various ways when given them (the most common
> is simple truncation, but it's by far not the only way).

Ah well if you insist on pursuing the nul-char example...
No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
Nor the code that "can't cope with a perfectly normal character?"

But with C for having a data structure called string with a 'hole' in it.

And it's
> exactly the same: a program that purports to handle arbitrary Unicode
> text should be able to handle arbitrary Unicode text, not "Unicode
> text as long as it contains only codepoints within the range X-Y". It
> doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or
> U+1F4A3 - if your code blows up, it's a failure in your code.
> 
> > Lets please stick to UTF-16 shall we?
> >
> > Now tell me:
> > - Is it broken or not?
> > - Is it widely used or not?
> > - Should programmers be careful of it or not?
> > - Should programmers be warned about it or not?
> 
> No, UTF-16 is not itself broken. (It would be if we expected
> codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap
> on Unicode, but it's looking unlikely that we'll be needing any more
> than that anyway.) What's broken is code that tries to treat UTF-16 as
> if it's UCS-2, and then breaks on surrogate pairs.
> 
> Yes, it's widely used. Programmers should probably be warned about it,
> but only because its tradeoffs are generally poorer than UTF-8's. If
> you use it correctly, there's no problem.
> 
> > Also:
> > Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
> > assume he is safe all over?
> 
> I don't know what you mean here. Do you mean that your Python 3
> program is "at risk" in some way because there might be some other
> program that misuses UTF-16?

Yes some other program/library/API etc connected to the python one

> Well, sure. And there might be some other
> program that misuses buffer sizes, SQL queries, or shell invocations,
> and makes your overall system vulnerable to buffer overruns or
> injection attacks. These are significantly more likely AND more
> serious than UTF-16 misuses. And you still have not proven anything
> about SMP characters being a problem, but only that code can be
> broken. Broken code is still broken code, no matter what your actual
> brokenness.

Roy Smith (and many other links Ive cited) prove exactly that - an
SMP character broke the code.

Note: I have no objection to people supporting full unicode 7.
Im just saying it may be significantly harder than just "Use python3 and you are done"



More information about the Python-list mailing list