[Python-Dev] io.BufferedReader.peek() Behaviour in python3.1

Thu Jun 11 03:38:57 CEST 2009

Greetings,

As I'm sure you all know there are currently two implementations of
the io module one in python and one much faster implementation in C.
As I recall the python version was used in python3 and the C version is
now used by default in python3.1x.  The behavior of the two is
different in some ways especially regarding io.BufferedReader.peek().

I wrote an email to the authors of the new C code last Friday.  I also
sent a copy of it to the python list for comments.  I was directed by
Antoine Pitrou that I should possibly bring up what I had asked there
here or as a bug report.  I elected to write here because I am not sure
it constitutes a bug.  In my former email I stated I was willing to
submit patches if the old behavior was desired back and the code author
was fine with the changes but didn't want to implement them.  Antoine
said this, "If people need more sophisticated semantics, I'm open to
changing peek() to accommodate it."

Antoine: If I do wrong quoting you are free to chastise me.

So my basic question is:  The behavior of io.BufferedReader.peek() has
changed; is that change something that should: stay
as is, revert or be different entirely?

Here are the two behaviors:

The python version of io.BufferedReader.peek() behaves as:
If the buffer holds less than requested (upto buffersize) read from the
raw stream the difference or up to EOF into the buffer.  Return
requested number of bytes from the start of the buffer.  This may
advance the raw stream but not the local stream.  This version can
guarantee a peek of one chunk (4096 bytes here).

The C version behaves as:
If the buffer holds 0 bytes fill from the raw stream or up to EOF.
Return what is in the buffer.  This may advance the raw stream but not
the local stream.  This version cannot guarantee a peek of over 1 byte
if random length reads are being used at all and not tracked.

Neither case limits what is possible, though, in my opinion, one makes
it easier to accomplish certain things and is more efficient in those
cases.  Take the following two parser examples:

s = io.BufferedReader wrapped stream with no negative seek in most
cases. f = output file handler or such.

python version work flow:

are = re.compile(b'(\r\n|\r|\n)')
while True:
    d = s.peek(4096) # chunk size or so.
    found = are.search(d)
    if found:
        w = d[:found.start()]
        s.seek(f.write(w))
        p = s.peek(74)
        if p.startswith(multipart_boundary):
            s.seek(len(multipart_boundary))
            # other code containing more possible splits
            # across boundaries
            continue
        w  = d[found.start():found.end()]
        s.seek(f.write(w))
        continue
    f.write(d)
    #more code
    continue

C version work flow:

old = b''
are = re.compile(b'(\r\n|\r|\n)')
while True:
    d = old if old != b'' else s.read1(4096)
    found = are.search(d)
    if found:
        w = d[:found.start()]
        f.write(w)
        w = d[found.start():]
        p = w if len(w) >= 74 else w + s.read(73)
        if p.startswith(multipart_boundary):
            # Other code containing more possible splits
            # across boundaries and joins to p.
            old = ???
            continue
        f.write(d[found.start():found.end()])
        old = dd[found.end():] + p
        continue
    old = b''
    f.write(d)
    #more code
    continue

These two examples are not real code but get the point across and are
based off code I put into a multipart parser.  The former written for
python3. I later tried running that parser on 3.1 after the new io
layer and found it broken.  Then rewrote it to the new interface.  That
rewrite is represented in the latter some what.  This is only one
example.  Others may vary, of course. Peek seems to me to have little
use outside of parsers.  Thus I used parsers as an example.

My opinion is that it would be better to have a peek function similar
to the the python implementation in C like as follows:

peek(n):
If n is less than 0, None, or not set; return buffer contents with out
advancing stream position. If the buffer is empty read a full chunk and
return the buffer.  Otherwise return exactly n bytes up to _chunk
size_(not contents) with out advancing the stream position.  If the
buffer contents is less than n, buffer an additional chunk from the
"raw" stream before hand.  If EOF is encountered during any raw read
then return as much as we can up to n. (maybe I should write that in
code form??)

This allows us to obtain the behavior of the current C io
implementation easily and would give us the old python
implementation's behavior when n is given.

The basis for this is:

1. Code reduction and Simplicity

Looking at the examples, the code reduction should be obvious.  The
logic needed to maintain a bytestring of the variously required lengths,
so that it may be checked, would not be necessary. The need to
hold a bytestring to the next iteration would be done away with as well.
Other pieces of data handling would also be simpler.

2. Speed

It would require less handling in the "slower" interpreter if we would
use the buffer in the buffered reader.  Also, all that logic mentioned
in 1 is moved to the faster C code or done away with.  There is very
little necessity for peek outside of parsers, so speed in read-through
and random reads would not have to be affected.

I have other reasons and arguments, but I want to know what every one
else thinks.  This will most likely show me what I have missed or am
not seeing, if anything.  Please I have babbled enough.

Thanks so much for the consideration.

Frederick Reeve