[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.
Stephen J. Turnbull
stephen at xemacs.org
Fri May 27 12:20:24 CEST 2011
Nick Coghlan writes:
> On Fri, May 27, 2011 at 4:14 PM, INADA Naoki <songofacandy at gmail.com> wrote:
> > I love unicode and use unicode when I can use it.
> > But this is a problem in the real world.
> > For example, Python 2 is convenient for analyzing line based logs
> > containing some different encodings.
Where's the use case for bytes here?
> > Python 3
>
> ...deliberately makes that difficult because it is *wrong*.
Nick, you should have stopped there. :-) I can see very little
difference between Python 2 and Python 3 in this use case, except that
Python 2 makes it much easier to write easily crashable programs. In
both versions, the safe thing to do for such a program is either to
slurp the whole log with open(log, encoding=<whatever>,
errors=<something nonfatal>) (that's Python 3 code; Python 2 makes
this more tedious, in fact). But no need for reading as bytes in
Python 3 visible here, move along, people!
Alternatively, one could write a function that reads lines from the
log as bytes, and tries different encodings for each line (perhaps
interacting with the user) and eventually uses some default encoding
and a nonfatal error handler to get *something*. This requires
reading as bytes, but it's no easier to write in Python 2 AFAICS.
Granted, such a function will not easily be portable between Python 2
and 3, but that's a different problem.
> Binary files containing a mixture of encodings cannot be safely
> treated as text.
"Safety" is use-case-dependent. I suppose Inada-san considers using
Python 2 strs to receive file input safe enough for his log analyzer.
While we shouldn't encourage that (and either errors='ignore' or
errors='surrogateescape' should be easy enough for him in the log
analysis case[1]), I don't think we should demand GIGO with 100%
fidelity in all use cases, either.
Footnotes:
[1] In new code. Again, a port of existing Python 2 code to Python 3
might not be trivial, depending on how he handles unexpected encodings
and how pervasively they are manipulated in his program.
More information about the Python-ideas
mailing list