[Python-ideas] TextIOWrapper callable encoding parameter

Wed Jun 13 17:56:02 CEST 2012

On 06/12/2012 03:48 PM, Victor Stinner wrote:
>> >> 1.  The most straight-forward way to handle this is to open
>> >> the file twice, first in binary mode or with latin1 encoding
>> >> and again in text mode after the encoding has been determined
>> >> This of course has a performance cost since the data is read
>> >> twice.  Further, it can't be used if the data source is a
>> >> from a pipe, socket or other non-rewindable source.  This
>> >> includes sys.stdin when it comes from a pipe.
> > 
> > Some months ago, I proposed to automatically detect if a file contains
> > a BOM and uses it to set the encoding. Various methods were proposed
> > but there was no real consensus. One proposition was to use a codec
> > (e.g. "bom") which uses the BOM if it is present, and so don't need to
> > reread the file twice.
> > 
> > For the pipe issue: it depends where the encoding specification is. If
> > the encoding is written at the end of your "file" (stream), you have
> > to store the whole stream content (few MB or maybe much more?) into
> > memory. If it is in the first lines, you have to store these lines in
> > a buffer. It's not easy to decide for the threshold.

That's always a problem.  When trying to determine a
character encoding one may have to read the entire file 
because it could consist of all ascii characters except 
the very last one.  (And of course there is no guarantee 
one can determine *the* encoding at all).

Nevertheless, I think thee is a very large class of
problems that can be usefully handled by looking at a 
limited amount of data at the start of a file (or stream).

The Python coding declaration in one example (obviously 
picked hoping it would have some resonance here.)

The buffer object used by TextIOWrapper already reads the 
start of the stream and buffers the first few lines, so
why not take advantage of that rather than repeating the 
work?

One of the things I am not sure about is if there are 
cases when the buffered read returns, say, only one
line, as might happen with tty input.

> > I don't like the codec approach because the codec is disconnected from
> > the stream. For example, the codec doesn't know the current position
> > in stream nor can read a few more bytes forward or backward. If you
> > open the file in "append" mode, you are not writing at the beginning
> > but at the end of the file. You may also seek at an arbitrary position
> > before the first read...
> > 
> > There are also some special cases. For example, when a text file is
> > opened in write mode, the file is seekable and the file position is
> > not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM
> > in the middle of the file. (See also Lib/test/test_io.py for related
> > tests.)

A callable encoding parameter would not be terribly useful 
with a file opened in write or append mode, but it's behavior
would be predictable: a write would result in an error
because the encoding hadn't been set.  A read in the middle'
of the file would work the same way as at the beginning.
This is probably not very useful, but is consistent.

Of course one could choose to implement a callable encoding
parameter such that some or all of these paths are detected
at open and declared illegal then.  One could prohibit the 
encoding call after a seek though I'm not sure there is any
point to that.

>> >> 2.  Alternatively, with a little more expertise, one can rewrap
>> >> the open binary stream in a TextIOWrapper to avoid a second
>> >> OS file open.
> > 
> > That's my favorite method because you have the full control on the
> > stream. (I wrote tokenize.open). But yes, it does not work on
> > non-seekable streams (e.g. pipes).
> > 
>> >> This too seems to read the data twice and of course the
>> >> seek(0) prevents this method also from being usable with
>> >> pipes, sockets and other non-seekable sources.
> > 
> > Does it really matter? You usually need to read few bytes to get the encoding.

It certainly matters if input is from a pipe.  Quoting from
my other message:

  $ cat test.utf8 | python3 stdin.py reopen1
  got exception: [Errno 29] Illegal seek

The whole point of my suggestion was that you've already
read those few bytes -- but by the time you have access
to them, you've already been forced to choose an encoding.
My suggestion simply defers that encoding setting until
after you've had a chance to look at the bytes.

>> >> 9. In other non-read paths where encoding needs to be known,
>> >>  raise an error if it is still None.
> > 
> > Why not reading data until you the encoding is known instead?

That's how I do it now -- open file in binary mode
and read it, buffer it, determine encoding, and henceforth
decode the bytes data "by hand" to text.

But that's an awful lot like what TextIOWrpper does, yes?
Why can't I use TextIOWrapper instead of rewriting it myself?
(Yes, I know I can reopen or rewrap the binary stream but 
as I said, that loses the one-pass processing which breaks
pipes.)

>> >> I have modified a copy the _pyio module as described and
>> >> the changes required seemed unsurprising and relatively
>> >> few, though I am sure there are subtleties and other
>> >> considerations I am missing.  Hence this post seeking
>> >> feedback...
> > 
> > Can you post the modified somewhere so I can play with it?

I put a diff against the Python-3.2.3 _pyio.py file at:

  http://pastebin.com/kZHmcBdm

Much of the diff is just moving existing stuff around.
The note at the bottom says:

| It is in no way supposed to be a serious patch.
| 
| It was the minimal changes I could make in order to 
| see if my suggestion to allow a callable encoding parameter
| in TextIOWrapper was feasible, and allow some timing tests.
| 
| I am quite sure it will not pass the Python's tests. 
|
| It does I hope give some idea of the nature and scale of the
| code changes needed to implement a callable encodign parameter.