On text processing
Daniel Nogradi
nogradi at gmail.com
Sat Mar 24 03:38:27 EDT 2007
> > I'm in a process of rewriting a bash/awk/sed script -- that grew to
> > big -- in python. I can rewrite it in a simple line-by-line way but
> > that results in ugly python code and I'm sure there is a simple
> > pythonic way.
> >
> > The bash script processed text files of the form:
> >
> > ###############################
> > key1 value1
> > key2 value2
> > key3 value3
> >
> > key4 value4
> > spec11 spec12 spec13 spec14
> > spec21 spec22 spec23 spec24
> > spec31 spec32 spec33 spec34
> >
> > key5 value5
> > key6 value6
> >
> > key7 value7
> > more11 more12 more13
> > more21 more22 more23
> >
> > key8 value8
> > ###################################
> >
> > I guess you get the point. If a line has two entries it is a key/value
> > pair which should end up in a dictionary. If a key/value pair is
> > followed by consequtive lines with more then two entries, it is a
> > matrix that should end up in a list of lists (matrix) that can be
> > identified by the key preceeding it. The empty line after the last
> > line of a matrix signifies that the matrix is finished and we are back
> > to a key/value situation. Note that a matrix is always preceeded by a
> > key/value pair so that it can really be identified by the key.
> >
> > Any elegant solution for this?
>
>
> My solution expects correctly formatted input and parses it into
> separate key/value and matrix holding dicts:
>
>
> from StringIO import StringIO
>
> fileText = '''\
> key1 value1
> key2 value2
> key3 value3
>
> key4 value4
> spec11 spec12 spec13 spec14
> spec21 spec22 spec23 spec24
> spec31 spec32 spec33 spec34
>
> key5 value5
> key6 value6
>
> key7 value7
> more11 more12 more13
> more21 more22 more23
>
> key8 value8
> '''
> infile = StringIO(fileText)
>
> keyvalues = {}
> matrices = {}
> for line in infile:
> fields = line.strip().split()
> if len(fields) == 2:
> keyvalues[fields[0]] = fields[1]
> lastkey = fields[0]
> elif fields:
> matrices.setdefault(lastkey, []).append(fields)
>
> ==============
> Here is the sample output:
>
> >>> from pprint import pprint as pp
> >>> pp(keyvalues)
> {'key1': 'value1',
> 'key2': 'value2',
> 'key3': 'value3',
> 'key4': 'value4',
> 'key5': 'value5',
> 'key6': 'value6',
> 'key7': 'value7',
> 'key8': 'value8'}
> >>> pp(matrices)
> {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
> ['spec21', 'spec22', 'spec23', 'spec24'],
> ['spec31', 'spec32', 'spec33', 'spec34']],
> 'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
> 'more23']]}
> >>>
Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing :)
More information about the Python-list
mailing list