Everything you did not want to know about Unicode in Python 3

Tue May 13 04:38:03 EDT 2014

On Tue, May 13, 2014 at 6:25 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Johannes Bauer <dfnsonfsduifb at gmx.de>:
>
>> Having dealt with the UTF-8 problems on Python2 I can safely say that
>> I never, never ever want to go back to that freaky hell. If I deal
>> with strings, I want to be able to sanely manipulate them and I want
>> to be sure that after manipulation they're still valid strings.
>> Manipulating the bytes representation of unicode data just doesn't
>> work.
>
> Based on my background (network and system programming), I'm a bit
> suspicious of strings, that is, text. For example, is the stuff that
> goes to syslog bytes or text? Does an XML file contain bytes or
> (encoded) text? The answers are not obvious to me. Modern computing is
> full of ASCII-esque binary communication standards and formats.

These are problems that Unicode can't solve. In theory, XML should
contain text in a known encoding (defaulting to UTF-8). With syslog,
it's problematic - I don't remember what it's meant to be, but I know
there are issues. Same with other log files.

> Python 2's ambiguity allows me not to answer the tough philosophical
> questions. I'm not saying it's necessarily a good thing, but it has its
> benefits.

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

ChrisA