text file parsing (awk -> python)

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Wed Nov 22 13:02:20 EST 2006


Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:
> But the names of the fields (node, x, y) keeps changing from file to
> file, even their number is not fixed, sometimes it is (node, x, y, z).

Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):

> converters = dict(
>     x=int,
>     y=int
> )


I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+)   \s+  ([-+]?  \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
    block = obj.groups()
    d = dict((block[i], int(block[i+1])) for i in xrange(0, 6, 2))
    result.append(d)

print result


So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
    from cStringIO import StringIO
    return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
    if not empty:
        pairs = ([k, int(v)] for k,v in map(str.split, record))
        records.append(dict(pairs))

print records

Bye,
bearophile




More information about the Python-list mailing list