Working with Huge Text Files

Fri Mar 18 21:27:52 EST 2005

Hi,

Lorn Davies wrote:

> ..... working with text files that range from 100MB to 1G in size.
> .....
> XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
> XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
> .....

I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
    fields = line.strip().split(',')
    filename = fields[0]
    if filename not in output_files:
        output_files[filename] = open(filename+'.txt', 'w')
    fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
    fields[2] = fields[2].replace(':', '')
    fields[3] = fields[3].replace('.', '')
    print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
    output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org