Processing large CSV files - how to maximise throughput?

Sat Oct 26 04:53:03 EDT 2013

On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:

> Hi,
> 
> We have a directory of large CSV files that we'd like to process in
> Python.
> 
> We process each input CSV, then generate a corresponding output CSV
> file.
> 
> input CSV -> munging text, lookups etc. -> output CSV
> 
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming
> 
> For the reading, I'd
> 
>     with open('input.csv', 'r') as input, open('output.csv', 'w') as
>     output:
>         csv_writer = DictWriter(output)
>         for line in DictReader(input):
>             # Do some processing for that line...
>             output = process_line(line)
>             # Write output to file csv_writer.writerow(output)
>             
> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
> 
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?
> 
> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?
> 
> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
> 
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?
> 
I'm guessing that the idea is to load the output CSV into a database.

If that's the case, why not load the input CSV into some kind of staging 
table in the database first, and do the processing there?