Working with Huge Text Files

Fri Mar 18 19:57:00 EST 2005

Lorn Davies wrote:
> Hi there, I'm a Python newbie hoping for some direction in working
with
> text files that range from 100MB to 1G in size. Basically certain
rows,
> sorted by the first (primary) field maybe second (date), need to be
> copied and written to their own file, and some string manipulations
> need to happen as well. An example of the current format:
>
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
>  |
>  | followed by like a million rows similar to the above, with
>  | incrementing date and time, and then on to next primary field
>  |
> ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
>  |
>  | etc., there are usually 10-20 of the first field per file
>  | so there's a lot of repetition going on
>  |
>
> The export would ideally look like this where the first field would
be
> written as the name of the file (XYZ.txt):
>
> 19930104, 93027, 2887, 7600, 40, 0, Z, N
>
> Pretty ambitious for a newbie? I really hope not. I've been looking
at
> simpleParse, but it's a bit intense at first glance... not sure where
> to start, or even if I need to go that route. Any help from you guys
in
> what direction to go or how to approach this would be hugely
> appreciated.
>
> Best regards,
> Lorn

You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
    print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.