Writev

Mon Dec 20 08:55:24 EST 2004

On Mon, 2004-12-20 at 02:18, Steven Bethard wrote:
> Adam DePrince wrote:
> > file.writelines( seq ) and map( file.write, seq ) are the same; the
> > former is syntactic sugar for the later.
> 
> Well, that's not exactly true.  For one thing, map(file.write, seq) 
> returns a list of Nones, while file.writelines returns only the single 
> None that Python functions with no return statement do.  More 
> substantially, file.writelines (as far as I can tell from the C code) 
> doesn't make any call to file.write.
> 
> I looked at fileobject.c and it looks like file.writelines makes a call 
> to 'fwrite' for each item in the iterable given.  Your code, if I read 
> it right, makes a call to 'writev' for each item in the iterable.

No, my code makes a call to writev for every nth iterable, where n is
usually 1024.  Writev is the posix equivalent to writelines.

> 
> I looked at the fwrite() and writev() docs and read your comments, but I 
> still couldn't quite figure out what makes 'writev' more efficient than 
> 'fwrite' for the same size buffer...  Mind trying to explain it to me again?

Okay.  Imagine that you had a list of strings that you want to write to
a file.  

You normally have two choices:

1. Copy all of the strings to a buffer and write that buffer.
2. Call write a lot

Remember that write and fwrite require a single buffer of data to
write.  When you have a sequence, your items are not lined up one right
after the other in memory, so python or libc has to do a memcpy on each
element's contents to prepare it for a write.  Every sequence element
will result in either a memcpy or a write.  fwrite is special, it
buffers for you, converting scenario #2 to #1.

Writev gives you a third option.  Rather than moving the data to one
place in preparation for the write, you can give the operating system a
list of where all of the bits are and it will get them for you.

> 
> > There is one more time that writev would be beneficial ... perhaps you
> > want to write a never ending sequence with a minimum of overhead? 
> > 
> > def camera():
> > 	while 1:
> > 		yield extract_entropy( grab_frame() )
> > 
> > open( "/tmp/entropy_daemon_pipe", "w+" ).writev( camera(), 5 ) 
> 
> I tried running:
> 
> py> def gen():
> ...     i = 1
> ...     while True:
> ...         yield '%i\n' % i
> ...         i *= 10
> ...
> py> open('integers.txt', 'w').writelines(gen())
> 
> and, while (of course) it runs forever, I don't appear to get any memory 
> problems.  I seem to be able to write fairly large integers too:

Wrong example.  Your example doesn't have memory problems, it has
efficency problems.  Memory problems occur with:

''.join( list( gen()))

> 
> $ tail -n 1 integers.txt | wc
>        1       1    5001
> 
> How big do the items in the iterable need to be for writev to be necessary?
> 
> Steve
> 
> P.S.  I certainly don't have anything against including your patch (not 
> that my opinion counts for anything) ;) but if it improves a common 
> file.writelines usage, I'd like to see it used there too when possible.

I think you are looking at writev the wrong way.  Notice that it is part
of posixmodule.c.  You are tring to see why it is better to use from a
python perspective.  I'm including it for the benefit of the underlying
operating system, not the python programmer.  

writelines applies to a general Python file object.  writev applies only
to C file descriptors.  Writev can't replace writelines, after all it
makes no sense to cStringIO, gzip files for these are not valid C file
descriptors.  Generally, writelines works fine.

Now why would you want to use writev?  Optimization on the C side.

Understand that when you do file I/O you have to either:

a) Copy all of the strings to a new memory location
b) Call write over and over again

Sometimes, when you do b), userspace libraries (fwrite) will "optimize"
by buffering and doing a) for you.  But you cannot escape the fact that
so long as your write parameter takes a single string for each
invocation, the underlying libraries are forced to choose between the
two options above.

Writev is the vector version of write.  Whereas write accepts a single
pointer and length parameter, writev accepts a *list* of pointers and
size parameters.  It represents a strategy c) -- just hand the list to
operating system

I want to include it because POSIX has a single OS call that
conceptually maps pretty closely to writelines.  writev can be faster
because you don't have to do memory copies to buffer data in one place
for it -- the OS will do that, and can sometimes delegate that chore to
the underlying network or scsi card.  

If you are still scratching your head, just think of writev as a C-file
descriptor only optimization of writelines that offloads the memory copy
cost of buffering to the OS in hopes that it can pass the buck to the
hardware (and IIRC, BSD does handle this correctly ... it has zero copy
network code, think about how cool it is to think that your network card
is going to traverse your list for you with a little help from the
writev function.)

Adam DePrince