Working with Huge Text Files

mensanator at aol.com mensanator at aol.com
Fri Mar 18 19:57:00 EST 2005


Lorn Davies wrote:
> Hi there, I'm a Python newbie hoping for some direction in working
with
> text files that range from 100MB to 1G in size. Basically certain
rows,
> sorted by the first (primary) field maybe second (date), need to be
> copied and written to their own file, and some string manipulations
> need to happen as well. An example of the current format:
>
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
>  |
>  | followed by like a million rows similar to the above, with
>  | incrementing date and time, and then on to next primary field
>  |
> ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
>  |
>  | etc., there are usually 10-20 of the first field per file
>  | so there's a lot of repetition going on
>  |
>
> The export would ideally look like this where the first field would
be
> written as the name of the file (XYZ.txt):
>
> 19930104, 93027, 2887, 7600, 40, 0, Z, N
>
> Pretty ambitious for a newbie? I really hope not. I've been looking
at
> simpleParse, but it's a bit intense at first glance... not sure where
> to start, or even if I need to go that route. Any help from you guys
in
> what direction to go or how to approach this would be hugely
> appreciated.
>
> Best regards,
> Lorn

You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
    print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.




More information about the Python-list mailing list