table (ascii text) lin ayout recognition

Wed Sep 13 10:28:50 EDT 2006

"James Stroud" <jstroud at mbi.ucla.edu> wrote in message 
news:8RPNg.748$TV3.408 at newssvr21.news.prodigy.com...
> vbfoobar at gmail.com wrote:
>> Hello,
>>
>> I am looking for python code useful to process
>> tables that are in ASCII text. The code must
>> determine where are the columns (fields).
>> Concerned tables for my application are various,
>> but their columns are not very complicated
>> to locate for a human, because even
>> when ignoring the semantic of  words,
>> our eyes see vertical alignments
>>
>> Here is a sample table (must be viewed
>> with fixed-width font to see alignments):
>> =================================
>>
>> 44544      ipod          apple     black         102
>> GFGFHHF-12 unknown thing bizar     brick mortar  tbc
>> 45fjk      do not know   + is less               biac
>>            disk          seagate   250GB         130
>> 5G_gff                   tbd       tbd
>> gjgh88hgg  media record  a and b                 12
>> hjj        foo           bar       hop           zip
>> hg uy oi   hj uuu ii a   qqq ccc v ZZZ Ughj
>> qdsd       zert                    nope          nope
>>
>> =================================
>>
>> I want the python code that builds a representation
>> of this table (for exemple a list of lists, where each list
>> represents a table line, each element of the list
>> being a field value).
>>
>> Any hints?
>> thanks
>>
>
> As promised. I call this the "cast a shadow" algorithm for table 
> discovery. This is about as obfuscated as I could make it. It will be up 
> to you to explain it to your teacher ;-)
>

James -

I used your same algorithm, but I guess I used more brute force (and didn't 
use pyparsing, either!).

-- Paul

data = """\
44544      ipod          apple     black         102
GFGFHHF-12 unknown thing bizar     brick mortar  tbc
45fjk      do not know   + is less               biac
           disk          seagate   250GB         130
5G_gff                   tbd       tbd
gjgh88hgg  media record  a and b                 12
hjj        foo           bar       hop           zip
hg uy oi   hj uuu ii a   qqq ccc v ZZZ Ughj
qdsd       zert                    nope          nope""".split('\n')

# find rightmost space characters delimiting text columns
spaceCols = set(range(max(map(len, data)))) - \
            set( [col for line in data
                      for col,c in enumerate(line.expandtabs())
                      if not c.isspace() ] )
spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] )

# convert to sorted list of leading col characters
spaceCols = map(lambda x:x+1, sorted(list(spaceCols)))

# get and pretty-print data fields
dataFields = \
    [ [line.expandtabs()[start:stop] for (start,stop) in
        zip([0]+spaceCols,spaceCols+[None])] for line in data ]
import pprint
pprint.pprint( dataFields )

Gives:

[['44544      ', 'ipod          ', 'apple     ', 'black         ', '102'],
 ['GFGFHHF-12 ', 'unknown thing ', 'bizar     ', 'brick mortar  ', 'tbc'],
 ['45fjk      ', 'do not know   ', '+ is less ', '              ', 'biac'],
 ['           ', 'disk          ', 'seagate   ', '250GB         ', '130'],
 ['5G_gff     ', '              ', 'tbd       ', 'tbd', ''],
 ['gjgh88hgg  ', 'media record  ', 'a and b   ', '              ', '12'],
 ['hjj        ', 'foo           ', 'bar       ', 'hop           ', 'zip'],
 ['hg uy oi   ', 'hj uuu ii a   ', 'qqq ccc v ', 'ZZZ Ughj', ''],
 ['qdsd       ', 'zert          ', '          ', 'nope          ', 'nope']]