Everything you did not want to know about Unicode in Python 3

Mon May 12 21:18:35 EDT 2014

On Mon, 12 May 2014 17:47:48 +0000, alister wrote:

> On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:
> 
>> This was *NOT* written by our resident unicode expert
>> http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
>> 
>> Posted as I thought it would make a rather pleasant change from
>> interminable threads about names vs values vs variables vs objects.
> 
> Surely those example programs are not the pythonoic way to do things or
> am i missing something?

Feel free to show us your version of "cat" for Python then. Feel free to 
target any version you like. Don't forget to test it against files with 
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.

> if those code samples are anything to go by this guy makes JMF look
> sensible.

Armin Ronacher is an extremely experienced and knowledgeable Python 
developer, and a Python core developer. He might be wrong, but he's not 
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy 
problems. I can create a file on a machine that uses ISO-8859-7 for the 
file name, put JShift-JIS encoded text inside it, transfer it to a 
machine that uses Windows-1251 as the file system encoding, then SSH into 
that machine from a system using Big5, and try to make sense of it. If 
everybody used UTF-8 any time data touched a disk or network, we'd be 
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his 
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read 
  the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/