On text processing

Sat Mar 24 03:38:27 EDT 2007

> > I'm in a process of rewriting a bash/awk/sed script -- that grew to
> > big -- in python. I can rewrite it in a simple line-by-line way but
> > that results in ugly python code and I'm sure there is a simple
> > pythonic way.
> >
> > The bash script processed text files of the form:
> >
> > ###############################
> > key1    value1
> > key2    value2
> > key3    value3
> >
> > key4    value4
> > spec11  spec12   spec13   spec14
> > spec21  spec22   spec23   spec24
> > spec31  spec32   spec33   spec34
> >
> > key5    value5
> > key6    value6
> >
> > key7    value7
> > more11   more12   more13
> > more21   more22   more23
> >
> > key8    value8
> > ###################################
> >
> > I guess you get the point. If a line has two entries it is a key/value
> > pair which should end up in a dictionary. If a key/value pair is
> > followed by consequtive lines with more then two entries, it is a
> > matrix that should end up in a list of lists (matrix) that can be
> > identified by the key preceeding it. The empty line after the last
> > line of a matrix signifies that the matrix is finished and we are back
> > to a key/value situation. Note that a matrix is always preceeded by a
> > key/value pair so that it can really be identified by the key.
> >
> > Any elegant solution for this?
>
>
> My solution expects correctly formatted input and parses it into
> separate key/value and matrix holding dicts:
>
>
> from StringIO import StringIO
>
> fileText = '''\
>  key1    value1
> key2    value2
> key3    value3
>
> key4    value4
> spec11  spec12   spec13   spec14
> spec21  spec22   spec23   spec24
> spec31  spec32   spec33   spec34
>
> key5    value5
> key6    value6
>
> key7    value7
> more11   more12   more13
> more21   more22   more23
>
> key8    value8
> '''
> infile = StringIO(fileText)
>
> keyvalues = {}
> matrices  = {}
> for line in infile:
>     fields = line.strip().split()
>     if len(fields) == 2:
>         keyvalues[fields[0]] = fields[1]
>         lastkey = fields[0]
>     elif fields:
>         matrices.setdefault(lastkey, []).append(fields)
>
> ==============
> Here is the sample output:
>
> >>> from pprint import pprint as pp
> >>> pp(keyvalues)
> {'key1': 'value1',
>  'key2': 'value2',
>  'key3': 'value3',
>  'key4': 'value4',
>  'key5': 'value5',
>  'key6': 'value6',
>  'key7': 'value7',
>  'key8': 'value8'}
> >>> pp(matrices)
> {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
>           ['spec21', 'spec22', 'spec23', 'spec24'],
>           ['spec31', 'spec32', 'spec33', 'spec34']],
>  'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
> 'more23']]}
> >>>

Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing :)