[Python-ideas] Processing surrogates in

Sat May 16 01:44:23 CEST 2015

On May 15, 2015, at 14:52, random832 at fastmail.us wrote:
> 
>> On Fri, May 15, 2015, at 15:37, Andrew Barnert wrote:

>> I explicitly mentioned opening the file in binary mode, reading it in,
>> and passing it to some fromstring function that takes bytes, so yes, of
>> course you have a byte array.
> 
> Why would a fromstring function take bytes?

I just gave you a specific example of this (simplejson.loads), and explained why they do it (because the same code is how they work with str in 2.x), in the very next paragraph, which you snipped out. And I'd already explained it in the previous email. I'm not sure how many other ways there are to explain it. I'd bet that the vast majority of modules on PyPI that have a fromstring/loads/parsetxt/readcsv/etc.-style function can take bytes; how is this surprising to you?

> How would you use re.split
> on it?

On a bytes? This is explained in the second line of the re docs: re works with byte patterns and strings just as it works with Unicode patterns and strings.

>> But anyway, I'll grant that you usually shouldn't have WCHARs before
>> you've decoded.
>> 
>> But you definitely should not have WCHARs _after_ you've decoded. In
>> fact, you _can't_ have them after you've decoded, because a WCHAR isn't
>> big enough to hold a Unicode code point.
> 
> You're nitpicking on word choice.

No, I'm not. Pretending 16-bit wide chars are "Unicode" is not just a trivial matter of bad word choice, it's wrong, and it's exactly how the world created the problems that this thread is thing to help solve.

Win32, Cocoa, and Java have the good excuse that they were created back when Unicode only had 64K code points and, as far as anyone believed, always would. So they were based on UCS2, and later going from there to UTF-16 broke less code than going from there to UCS4 would have. But that isn't a good reason for any new framework, library, or app to use UTF-16.

> Going from bytes to UTF-16 words
> [whether as WCHAR or unsigned short] is a form of decoding.

Only in the same sense that going from Shift-JIS to UTF-8 is a form of decoding. Or, for that matter, going from UTF-16 to baudot 6-bit units, if that's what your code wants to work on.

If your code treats UTF-8 or UTF-16 or Shift-JIS strings as sequences of unicode characters, it makes sense to call that decoding. If your code treats them as sequences of bytes or words, then your strings are still encoded bytes or words, not strings, and it's misleading to call that decoding.

> Or don't you
> think python narrow builds' decode function was properly named?

The real problem was that Python narrow builds shouldn't exist in the first place. Which was fixed in 3.3, so I don't think I need to argue that it should be fixed.

>> But many specific static patterns _do_ work with ASCII compatible
>> encodings. Again, think of HTTP responses. Even though the headers and
>> body are both text, they're defined as being separated by b"\r\n\r\n".
> 
> Right, but those aren't UTF-8. Working with ASCII is fine, but don't
> pretend you've actually found a way to work with UTF-8.

But the same functions _do_ work for UTF-8. That's one of the whole points of UTF-8: every byte is unambiguously either a single character, a leading byte, or a continuation byte. This means you can search any UTF-8 encoded string for any UTF-8-encoded substring (or any regex pattern) and it will never have false positives (or negatives), whether that substring or pattern is b'\r\n\r\n' or '🏠'.encode('utf-8').

And that's the only reason that searching UTF-16 works: every word is unambiguously either a single character, a leading surrogate, or a continuation surrogate. So UTF-16 is exactly the same as UTF-8 here, for exactly the same reason; it's not better.

>> Preferring UTF-32 over UTF-8 makes perfect sense. But that's not what you
>> started out arguing. Nick mentioned off-hand that UTF-16 has the worst of
>> both worlds of UTF-8 and UTF-32, Stephen explained that further to
>> someone else, and you challenged his explanation, arguing that UTF-16
>> doesn't introduce any problems over UTF-8.
>> But it does. It introduces all
>> the same problems as UTF-32, but without any of the benefits.
> 
> No, because UTF-32 has the additional problem, shared with UTF-8, that
> (Windows) libc doesn't support it.

But Windows libc doesn't support UTF-16. When you call wcslen on "🏠", that emoji counts as 2 characters, not 1. It returns "the count of characters" in "wide (two-byte) characters", which aren't actually characters. 

> My point was that if you want the benefits of using libc you have to pay
> the costs of using libc, and that means using libc's native encodings.
> Which, on Windows, are UTF-16 and (e.g.) Codepage 1252. If you don't
> want the benefits of using libc, then there's no benefit to using UTF-8.

The traditional libc functions like strlen and strstr don't care what your native encoding or actual encoding are. Some of them will produce the right result with UTF-8 even if it isn't your encoding (strstr), some will produce the wrong wrong even if it is (strlen). There are also some newer functions that do care (mbslen), which are only right if UTF-8 is your locale encoding (which it probably isn't, and you're probably not going to set LC_CTYPE yourself).

The ones that are always right with UTF-8 have corresponding wide functions that are right with UTF-16, the ones the are always wrong with UTF-8 have corresponding wide functions that are always wrong with UTF-16, the ones that are locale-dependent don't have corresponding wide functions at all, forcing you to use functions that are always wrong.

Microsoft's libc documentation is seriously misleading, and refers to functions like wcslen as returning "the count in characters", but is generally not misleading for strlen. Catching UTF-8 strlen-style bugs before release requires testing some non-English text; catching UTF-16 wcslen-style bugs requires testing very specific kinds of text (you pretty much have to know what an astral is to even guess what kind of text you need--although emoji are making the problem more noticeable).

Which part of that is as advantage for UTF-16 with libc? In every case, it's either the same as UTF-8 (strstr) or worse (both mbslen and strlen, for different reasons).