[Python-ideas] Prefetching on buffered IO files

Tue Sep 28 18:44:38 CEST 2010

On Tue, Sep 28, 2010 at 7:32 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
[Guido]
>> wonder if it wouldn't be better to add an extra buffer to GzipFile so
>> small seek() and read() calls can be made more efficient?
>
> The problem is that, since the buffer of the unpickler and the buffer of
> the GzipFile are not aware of each other, the unpickler could easily ask
> to seek() backwards past the current GzipFile buffer, and fall back on
> the slow algorithm.

But AFAICT unpickle doesn't use seek()?

[...]
> But, if the stream had prefetch(), the unpickling would be simplified: I
> would only have to call prefetch() once when refilling the buffer,
> rather than two read()'s followed by a peek().
>
> (I could try to coalesce the two reads, but it would complicate the code
> a bit more...)

Where exactly would the peek be used? (I must be confused because I
can't find either peek or seek in _pickle.c.)

It still seems to me that the "right" way to solve this would be to
insert a transparent extra buffer somewhere, probably in the GzipFile
code, and work in reducing the call overhead.

>> I want to push back on this more, primarily because a new primitive
>> I/O operation has high costs: it can never be removed, it has to be
>> added to every stream implementation, developers need to learn to use
>> the new operation, and so on.
>
> I agree with this (except that most developers don't really need to
> learn to use it: common uses of readable files are content with read()
> and readline(), and need neither peek() nor prefetch()). I don't intend
> to push this for 3.2; I'm throwing the idea around with a hypothetical
> 3.3 landing if it seems useful.

So far it seems more awkward than useful.

>> Also, if you can believe the multi-core crowd, a very different
>> possible future development might be to run the gunzip algorithm and
>> the unpickle algorithm in parallel, on separate cores. Truly such a
>> solution would require totally *different* new I/O primitives, which
>> might have a higher chance of being reusable outside the context of
>> pickle.
>
> Well, it's a bit of a pie-in-the-sky perspective :)
> Furthermore, such a solution won't improve CPU efficiency, so if your
> workload is already able to utilize all CPU cores (which it can easily
> do if you are in a VM, or have multiple busy daemons), it doesn't bring
> anything.

Agreed it's pie in the sky... Though the interface between the two
CPUs might actually be designed to be faster than the current buffered
I/O. I have (mostly :-) fond memories of async I/O on a mainframe I
used in the '70s which worked this way.

-- 
--Guido van Rossum (python.org/~guido)