[Python-3000] iostack, second revision

Wed Sep 13 18:21:52 CEST 2006

"Anders J. Munch" <ajm at flonidan.dk> wrote:
> Josiah Carlson wrote:
> > "Anders J. Munch" <ajm at flonidan.dk> wrote:
> > > I don't expect file methods and systems calls to map one to one, but
> > > you're right, the first time the length is needed, that's an extra
> > > system call.
> > 
> > Every time the length is needed, a system call is required 
> > (you can have
> > multiple writers of the same file)...
> 
> Point taken.  It's very rarely a good idea to do so, but the
> possibility of multiple writers shouldn't be ignored.  Still there is
> no real performance issue.  If anything, replacing
> f.seek(0,2);f.tell() with f.length in various places might save a few
> system calls.

Any sane person uses os.stat(f.name) or os.fstat(f.fileno()), unless
they want to seek to the end of the file for later writing or expected
reading of data yet-to-be-written.  Interesting that both of these cases
basically read and write to the same file at the same time (perhaps even
in the same process), something you yourself said, "In all my
programming days I don't believe I written to and read from the same
file handle even once. Use cases exist, like if you're implementing a
DBMS..."

> > Flushing during seek is important.  By not flushing during 
> > seek in your
> > FileBytes object, you are unnecessarily delaying writes, which could
> > cause file corruption.
> 
> That's what the flush method is for.  The real reason seek implies
> flush is to save the library author the bother of getting the
> interactions between input and output buffering right.
> Anyway, FileBytes has no seek and no concept of current file position,
> so I really don't know what you're talking about :)

I was talking about your earlier statement, which I quoted in my earlier
reply to you:

> My micro-optimisation circuitry blew a fuse when I discovered that
> seek always implies flush.  You won't get good performance out of code
> that does a lot of seeks, whatever you do.  Use my upcoming FileBytes
> class :)

And with the context of a previous message from you:

> FileBytes would support the sequence protocol, mimicking bytes objects.
> It would support random-access read and write using __getitem__ and
> __setitem__, allowing slice assignment for slices of equal size.  And
> there would be append() to extend the file, and partial __delitem__
> support for truncating.

While it doesn't have the methods seek or tell, the underlying
implementation needs to use seek and tell (or a memory-mapped file, mmap). 
You were also talking about buffering writes to reduce the overhead of
the underlying seeks and tells because of apparent "optimizations" you
wanted to make. Here is a data integrity optimization you can make for
me: flush when accessing the file non-sequentially, any other behavior
could corrupt the data of users who have been relying on "seek implies
flush".

I would also mention that your FileBytes class is essentially a fake
memory-mapped file, and while I also have implemented an equivalent
class (for low-memory testing purposes in a DBMS-like situation), I find
that using an mmap to be far faster and generally more reliable (and
usable with buffer()) than my FileBytes equivalent, never mind that the
vast majority of users don't want a sequence interface to a file, they
want a stream interface; which is why you don't see many FileBytes-like
objects out in the wild, or really anyone suggesting such a wrapper
object be in the standard library.

With that said, I'm not sure your FileBytes object is really necessary
or desired for the future io library.  If people want that kind of an
interface, they can use mmap (and push for the various mmap bugs/feature
requests to be fixed), otherwise they should be using readable /
writable / both streams, something that Tomer has been working towards.

 - Josiah