[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Stephen J. Turnbull stephen at xemacs.org
Fri May 27 12:20:24 CEST 2011


Nick Coghlan writes:
 > On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy at gmail.com> wrote:
 > > I love unicode and use unicode when I can use it.
 > > But this is a problem in the real world.
 > > For example, Python 2 is convenient for analyzing line based logs
 > > containing some different encodings.

Where's the use case for bytes here?

 > > Python 3
 > 
 > ...deliberately makes that difficult because it is *wrong*.

Nick, you should have stopped there. :-)  I can see very little
difference between Python 2 and Python 3 in this use case, except that
Python 2 makes it much easier to write easily crashable programs.  In
both versions, the safe thing to do for such a program is either to
slurp the whole log with open(log, encoding=<whatever>,
errors=<something nonfatal>) (that's Python 3 code; Python 2 makes
this more tedious, in fact).  But no need for reading as bytes in
Python 3 visible here, move along, people!

Alternatively, one could write a function that reads lines from the
log as bytes, and tries different encodings for each line (perhaps
interacting with the user) and eventually uses some default encoding
and a nonfatal error handler to get *something*.  This requires
reading as bytes, but it's no easier to write in Python 2 AFAICS.

Granted, such a function will not easily be portable between Python 2
and 3, but that's a different problem.

> Binary files containing a mixture of encodings cannot be safely
> treated as text.

"Safety" is use-case-dependent.  I suppose Inada-san considers using
Python 2 strs to receive file input safe enough for his log analyzer.
While we shouldn't encourage that (and either errors='ignore' or
errors='surrogateescape' should be easy enough for him in the log
analysis case[1]), I don't think we should demand GIGO with 100%
fidelity in all use cases, either.

Footnotes: 
[1]  In new code.  Again, a port of existing Python 2 code to Python 3
might not be trivial, depending on how he handles unexpected encodings
and how pervasively they are manipulated in his program.




More information about the Python-ideas mailing list