[TriZPUG] More Fun With Text Processing

Mon Apr 6 16:22:45 CEST 2009

Josh Johnson wrote:
> Ok all,
> Since we've got a brain trust of pythonistas that know how to deal 
> with strings, here's a problem I'm facing right now that I'd like some 
> input on:
>
> I've got a tabular list, it's the output from a command-line program, 
> and I need to parse it into some sort of structure.
>
> Here's an example of the data (the headings and column width will vary):
> TARGET         VOLUME GROUP        LENGTH     AVAILABLE         NPE  
> MIRROR
> 1.1               HIGHAVAIL    5001.023GB    4501.008GB     1192337  2.1
> 1.3                  BACKUP    5001.023GB    4250.759GB     1192337
> 1.4                  BACKUP    3000.613GB    3000.353GB      715402
> 2.2               HIGHAVAIL    5001.023GB    5001.015GB     1192337  1.2
> 2.3                  BACKUP    5001.023GB    5000.763GB     1192337
> 2.4                  BACKUP    3000.613GB    3000.353GB      715402
>
> I'd like a structure I can work with, like say, a list of hashes.
>
> My initial approach involves treating the header row as the guide for 
> the field lengths, and then extracting substrings for each field in 
> each row.
>
> I also thought about just doing a split on spaces, but some of the 
> fields could have spaces in their data.
>
> What do you guys think?
>
> JJ
> _______________________________________________
> TriZPUG mailing list
> TriZPUG at python.org
> http://mail.python.org/mailman/listinfo/trizpug
> http://trizpug.org is the Triangle Zope and Python Users Group
I appreciate everyone's help with this. It was seriously driving me nuts 
last week, and now I've got a viable solution.

The solution I've got is a hybrid of almost everything everybody 
suggested, it's beautiful really :)

Thanks everybody, you guys are awesome :)

JJ

import re
import csv
from StringIO import StringIO

def parse_table(output):
    """
    Parse tabular output from the coraid VS in commands like ``lsvg`` 
and ``lslun``
   
    The first line in output is considered to be the header.
   
    @param output: the tabular output from some command.
    @return: list of hashes, each hash indexed by the header fields
             ex: [{'NAME':'HIGHAVAIL',
                   'LENGTH':'10002.047GB'},
                   -etc-]
   
   
    Test data: lspv
    ---------------
    >>> output = '''TARGET         VOLUME GROUP        LENGTH     
AVAILABLE         NPE  MIRROR
    ... 1.1               HIGHAVAIL    5001.023GB    4501.008GB     
1192337  2.1
    ... 1.3                  BACKUP    5001.023GB    4250.759GB     1192337
    ... 1.4                  BACKUP    3000.613GB    3000.353GB      715402
    ... '''
    >>> data = parse_table(output)
   
    Check some random data
    >>> data[0]['TARGET']
    '1.1'
    >>> data[1]['VOLUME GROUP']
    'BACKUP'
   
    Check the last column, some are empty
    >>> data[0]['MIRROR']
    '2.1'
    >>> data[1]['MIRROR']
    >>> data[2]['MIRROR']
   
    Test Data: lslun
    ----------------
    >>> output = '''LUN           LENGTH  ONLINE      TARGET  LABEL
    ... 51        2000.003GB      ON       99.51  Lab 1
    ... 68         500.002GB      ON       99.68  Admin
    ... 130        750.000GB      ON      99.130  Admin Backup
    ... 131       3000.001GB      ON      99.131  Lab 1 Backup
    ... '''
    >>> data = parse_table(output)
   
    >>> data[0]['TARGET']
    '99.51'
    >>> data[0]['LABEL']
    'Lab 1'
   
    Test Data: lsvg
    ---------------
    >>> output = '''NAME                      LENGTH        AVAILABLE  
EXTSZ    PVS
    ... HIGHAVAIL            10002.047GB       7502.016GB    4MB  02/02
    ... BACKUP               16003.274GB      12253.231GB    4MB  04/04
    ... '''
    >>> data = parse_table(output)
   
    >>> data[0]['LENGTH']
    '10002.047GB'
    >>> data[1]['EXTSZ']
    '4MB'
    """
    output = re.sub(r'\s{2,}', r'\t', output)
   
    output = StringIO(output)
   
    reader = csv.DictReader(output, dialect='excel-tab')
   
    data = []
   
    for row in reader:
        data.append(row)
       
    return data