CSV writer question

Mon Oct 24 17:09:16 EDT 2011

Jason Swails wrote:

> Hello,
> 
> I have a question about a csv.writer instance.  I have a utility that I
> want to write a full CSV file from lots of data, but due to performance
> (and memory) considerations, there's no way I can write the data
> sequentially. Therefore, I write the data in chunks to temporary files,
> then combine them
> all at the end.  For convenience, I declare each writer instance via a
> statement like
> 
> my_csv = csv.writer(open('temp.1.csv', 'wb'))
> 
> so the open file object isn't bound to any explicit reference, and I don't
> know how to reference it inside the writer class (the documentation
> doesn't
> say, unless I've missed the obvious).  Thus, the only way I can think of
> to make sure that all of the data is written before I start copying these
> files sequentially into the final file is to unbuffer them so the above
> command is changed to
> 
> my_csv = csv.writer(open('temp.1.csv', 'wb', 0))
> 
> unless, of course, I add an explicit reference to track the open file
> object
> and manually close or flush it (but I'd like to avoid it if possible).  My
> question is 2-fold.  Is there a way to do that directly via the CSV API,
> or is the approach I'm taking the only way without binding the open file
> object
> to another reference?  Secondly, if these files are potentially very large
> (anywhere from ~1KB to 20 GB depending on the amount of data present),
> what kind of performance hit will I be looking at by disabling buffering
> on these types of files?
> 
> Tips, answers, comments, and/or suggestions are all welcome.
> 
> Thanks a lot!
> Jason
> 
> As an afterthought, I suppose I could always subclass the csv.writer class
> and add the reference I want to that, which I may do if there's no other
> convenient solution.

A contextmanager might help:

import csv
from contextlib import contextmanager

@contextmanager
def filewriter(filename):
    with open(filename, "wb") as outstream:
        yield csv.writer(outstream)

if __name__ == "__main__":
    with filewriter("tmp.csv") as writer:
        writer.writerows([
                ["alpha", "beta"],
                ["gamma", "delta"]])