[Python-Dev] Python3 "complexity"

Thu Jan 9 11:15:08 CET 2014

> -----Original Message-----
> From: Python-Dev [mailto:python-dev-
> bounces+kristjan=ccpgames.com at python.org] On Behalf Of Stefan Ring
> Sent: 9. janúar 2014 09:32
> To: python-dev at python.org
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> > just became harder to use for that purpose.
> 
> The entire discussion reminds me very much of the situation with file names
> in OS X. Whenever I want to look at an old zip file or tarball which happens to
> have been lying around on my hard drive for a decade or more, I can't
> because OS X insist that file names be encoded in
> UTF-8 and just throw errors if that requirement is not met. And certainly I
> cannot be required to re-encode all files to the then-favored encoding
> continually – although favors don’t change often and I’m willing to bet that
> UTF-8 is here to stay, but it has already happened twice in my active
> computer life (DOS -> latin-1 -> UTF-8).

Well, yes.
Also, the problem I'm describing has to do with real world stuff.
This is the python 2 program:
with open(fn1) as f1:
    with open(fn2, 'w') as f2:
        f2.write(process_text(f1.read())

Moving to python 3, I found that this quickly caused problems.  So, I explicitly added an encoding.  Better guess an encoding, something that is likely, e.g. cp1252
with open(fn1, encoding='cp1252') as f1:
    with open(fn2, 'w', encoding='cp1252') as f2:
        f2.write(process_text(f1.read())

This mostly worked.  But then, with real world data, sometimes we found that even files we declared to be cp1252, sometimes contained invalid code points.  Was the file really in cp1252?  Or did someone mess up somewhere?  Or simply take a small poet's leave with the specification? 
This is when it started to become annoying.  I mean, clearly something was broken at some point, or I don't know the exactly correct encoding of the file.   But this is not the place to correct that mistake.  I want my program to be robust towards such errors.  And these errors exist.

So, the third version was:
with open(fn1, "b") as f1:
    with open(fn2, 'wb') as f2:
        f2.write(process_bytes(f1.read())

This works, but now I have a bytes object which is rather limited in what it can do.  Also, all all string constants in my process_bytes() function have to be b'foo', rather than 'foo'.

Only much later did I learn about 'surrogateescape'.  How is a new user to python to know about it?  The final version would probably be this:
with open(fn1, encoding='cp1252', errors='surrogateescape') as f1:
    with open(fn2, 'w', encoding='cp1252', errors='surrogateescape') as f2:
        f2.write(process_text(f1.read())

Will this always work?  I don't know.  I hope so.  But it seems very verbose when all you want to do is munge on some bytes.  And the 'surrogateescape' error handler is not something that a newcomer to the language, or someone coming from python2, is likely to automatically know about.

Could this be made simpler?  What If we had an encoding that combines 'ascii' and 'surrogateescape'?  Something that allows you to read ascii text with unknown high order bytes without this unneeded verbosity?  Something that would be immediately obvious to the newcomer?

K