Working with Huge Text Files

Fri Mar 18 21:15:24 EST 2005

mensanator at aol.com wrote:
> Lorn Davies wrote:
> > Hi there, I'm a Python newbie hoping for some direction in working
> with
> > text files that range from 100MB to 1G in size. Basically certain
> rows,
> > sorted by the first (primary) field maybe second (date), need to be
> > copied and written to their own file, and some string manipulations
> > need to happen as well. An example of the current format:
> >
> > XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> > XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> >  |
> >  | followed by like a million rows similar to the above, with
> >  | incrementing date and time, and then on to next primary field
> >  |
> > ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
> >  |
> >  | etc., there are usually 10-20 of the first field per file
> >  | so there's a lot of repetition going on
> >  |
> >
> > The export would ideally look like this where the first field would
> be
> > written as the name of the file (XYZ.txt):
> >
> > 19930104, 93027, 2887, 7600, 40, 0, Z, N
> >
> > Pretty ambitious for a newbie? I really hope not. I've been looking
> at
> > simpleParse, but it's a bit intense at first glance... not sure
where
> > to start, or even if I need to go that route. Any help from you
guys
> in
> > what direction to go or how to approach this would be hugely
> > appreciated.
> >
> > Best regards,
> > Lorn
>
> You could use the csv module.
>
> Here's the example from the manual with your sample data in a file
> named simple.csv:

Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.

>
> import csv
> reader = csv.reader(file("some.csv"))
> for row in reader:
>     print row
>
> """
> ['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N
']
> ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N
']
> ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N
']
> """
>
> The csv module while bring each line in as a list of strings.
> Of course, you want to process each line before printing it.
> And you don't just want to print it, you want to write it to a file.
>
> So after reading the first line, open a file for writing with the
> first field (row[0]) as the file name. Then you want to process
> fields row[1], row[2] and row[3] to get them in the right format
> and then write all the row fields except row[0] to the file that's
> open for writing.
>
> On every subsequent line you must check to see if row[0] has changed,
> so you'll have to store row[0] in a variable. If it's changed, close
> the file you've been writing to and open a new file with the new
> row[0]. Then continue processing lines as before.
>
> It will only be this simple if you can guarantee that the original
> file is actually sorted by the first field.