Everything you did not want to know about Unicode in Python 3

Tue May 13 01:10:00 EDT 2014

On Tuesday, May 13, 2014 6:48:35 AM UTC+5:30, Steven D'Aprano wrote:
> On Mon, 12 May 2014 17:47:48 +0000, alister wrote:
> 
> > Surely those example programs are not the pythonoic way to do things or
> > am i missing something?
> 
> 
> 
> Feel free to show us your version of "cat" for Python then. Feel free to 
> target any version you like. Don't forget to test it against files with 
> names and content that:
> 
> 
> - aren't valid UTF-8;
> 
> 
> - are valid UTF-8, but not valid in the local encoding.

Thanks for a non-defensive appraisal!

> 
> 
> > if those code samples are anything to go by this guy makes JMF look
> > sensible.
> 
> 
> 
> Armin Ronacher is an extremely experienced and knowledgeable Python 
> developer, and a Python core developer. He might be wrong, but he's not 
> *obviously* wrong.
> 
> 
> 
> Unicode is hard, not because Unicode is hard, but because of legacy 
> problems. I can create a file on a machine that uses ISO-8859-7 for the 
> file name, put JShift-JIS encoded text inside it, transfer it to a 
> machine that uses Windows-1251 as the file system encoding, then SSH into 
> that machine from a system using Big5, and try to make sense of it. If 
> everybody used UTF-8 any time data touched a disk or network, we'd be 
> laughing. It would all be so simple.

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option

> 
> 
> 
> Reading Armin's post, I think that all that is needed to simplify his 
> Python 3 version is:
> 
> 
> 
> - have a bytes version of sys.argv (bargv? argvb?) and read 
>   the file names from that;
> 
> - have a simple way to write bytes to stdout and stderr.
> 
> 
> Most programs won't need either of those, but file system utilities will.

About the technical merits of Armin's post and your suggestions, Ive 
nothing to say, since I am an ignoramus on (the mechanics of) unicode

[Consider me an eager, early, ignorant adopter :-) ]

Its however good to note that unicode is rather unique in the history
not just of IT/CS but of humanity, in the sense that no one (to the best
of my knowledge) has ever tried to come up with an all-encompassing umbrella
for all humanity's scripts/writing systems etc.

So hiccups and mistakes are only to be expected.  The absence of these would
be much more surprising!