[TriZPUG] More Fun With Text Processing
Josh Johnson
jj at email.unc.edu
Mon Apr 6 16:22:45 CEST 2009
Josh Johnson wrote:
> Ok all,
> Since we've got a brain trust of pythonistas that know how to deal
> with strings, here's a problem I'm facing right now that I'd like some
> input on:
>
> I've got a tabular list, it's the output from a command-line program,
> and I need to parse it into some sort of structure.
>
> Here's an example of the data (the headings and column width will vary):
> TARGET VOLUME GROUP LENGTH AVAILABLE NPE
> MIRROR
> 1.1 HIGHAVAIL 5001.023GB 4501.008GB 1192337 2.1
> 1.3 BACKUP 5001.023GB 4250.759GB 1192337
> 1.4 BACKUP 3000.613GB 3000.353GB 715402
> 2.2 HIGHAVAIL 5001.023GB 5001.015GB 1192337 1.2
> 2.3 BACKUP 5001.023GB 5000.763GB 1192337
> 2.4 BACKUP 3000.613GB 3000.353GB 715402
>
> I'd like a structure I can work with, like say, a list of hashes.
>
> My initial approach involves treating the header row as the guide for
> the field lengths, and then extracting substrings for each field in
> each row.
>
> I also thought about just doing a split on spaces, but some of the
> fields could have spaces in their data.
>
> What do you guys think?
>
> JJ
> _______________________________________________
> TriZPUG mailing list
> TriZPUG at python.org
> http://mail.python.org/mailman/listinfo/trizpug
> http://trizpug.org is the Triangle Zope and Python Users Group
I appreciate everyone's help with this. It was seriously driving me nuts
last week, and now I've got a viable solution.
The solution I've got is a hybrid of almost everything everybody
suggested, it's beautiful really :)
Thanks everybody, you guys are awesome :)
JJ
import re
import csv
from StringIO import StringIO
def parse_table(output):
"""
Parse tabular output from the coraid VS in commands like ``lsvg``
and ``lslun``
The first line in output is considered to be the header.
@param output: the tabular output from some command.
@return: list of hashes, each hash indexed by the header fields
ex: [{'NAME':'HIGHAVAIL',
'LENGTH':'10002.047GB'},
-etc-]
Test data: lspv
---------------
>>> output = '''TARGET VOLUME GROUP LENGTH
AVAILABLE NPE MIRROR
... 1.1 HIGHAVAIL 5001.023GB 4501.008GB
1192337 2.1
... 1.3 BACKUP 5001.023GB 4250.759GB 1192337
... 1.4 BACKUP 3000.613GB 3000.353GB 715402
... '''
>>> data = parse_table(output)
Check some random data
>>> data[0]['TARGET']
'1.1'
>>> data[1]['VOLUME GROUP']
'BACKUP'
Check the last column, some are empty
>>> data[0]['MIRROR']
'2.1'
>>> data[1]['MIRROR']
>>> data[2]['MIRROR']
Test Data: lslun
----------------
>>> output = '''LUN LENGTH ONLINE TARGET LABEL
... 51 2000.003GB ON 99.51 Lab 1
... 68 500.002GB ON 99.68 Admin
... 130 750.000GB ON 99.130 Admin Backup
... 131 3000.001GB ON 99.131 Lab 1 Backup
... '''
>>> data = parse_table(output)
>>> data[0]['TARGET']
'99.51'
>>> data[0]['LABEL']
'Lab 1'
Test Data: lsvg
---------------
>>> output = '''NAME LENGTH AVAILABLE
EXTSZ PVS
... HIGHAVAIL 10002.047GB 7502.016GB 4MB 02/02
... BACKUP 16003.274GB 12253.231GB 4MB 04/04
... '''
>>> data = parse_table(output)
>>> data[0]['LENGTH']
'10002.047GB'
>>> data[1]['EXTSZ']
'4MB'
"""
output = re.sub(r'\s{2,}', r'\t', output)
output = StringIO(output)
reader = csv.DictReader(output, dialect='excel-tab')
data = []
for row in reader:
data.append(row)
return data
More information about the TriZPUG
mailing list