table (ascii text) lin ayout recognition
Paul McGuire
ptmcg at austin.rr._bogus_.com
Wed Sep 13 10:28:50 EDT 2006
"James Stroud" <jstroud at mbi.ucla.edu> wrote in message
news:8RPNg.748$TV3.408 at newssvr21.news.prodigy.com...
> vbfoobar at gmail.com wrote:
>> Hello,
>>
>> I am looking for python code useful to process
>> tables that are in ASCII text. The code must
>> determine where are the columns (fields).
>> Concerned tables for my application are various,
>> but their columns are not very complicated
>> to locate for a human, because even
>> when ignoring the semantic of words,
>> our eyes see vertical alignments
>>
>> Here is a sample table (must be viewed
>> with fixed-width font to see alignments):
>> =================================
>>
>> 44544 ipod apple black 102
>> GFGFHHF-12 unknown thing bizar brick mortar tbc
>> 45fjk do not know + is less biac
>> disk seagate 250GB 130
>> 5G_gff tbd tbd
>> gjgh88hgg media record a and b 12
>> hjj foo bar hop zip
>> hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
>> qdsd zert nope nope
>>
>> =================================
>>
>> I want the python code that builds a representation
>> of this table (for exemple a list of lists, where each list
>> represents a table line, each element of the list
>> being a field value).
>>
>> Any hints?
>> thanks
>>
>
> As promised. I call this the "cast a shadow" algorithm for table
> discovery. This is about as obfuscated as I could make it. It will be up
> to you to explain it to your teacher ;-)
>
James -
I used your same algorithm, but I guess I used more brute force (and didn't
use pyparsing, either!).
-- Paul
data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope""".split('\n')
# find rightmost space characters delimiting text columns
spaceCols = set(range(max(map(len, data)))) - \
set( [col for line in data
for col,c in enumerate(line.expandtabs())
if not c.isspace() ] )
spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] )
# convert to sorted list of leading col characters
spaceCols = map(lambda x:x+1, sorted(list(spaceCols)))
# get and pretty-print data fields
dataFields = \
[ [line.expandtabs()[start:stop] for (start,stop) in
zip([0]+spaceCols,spaceCols+[None])] for line in data ]
import pprint
pprint.pprint( dataFields )
Gives:
[['44544 ', 'ipod ', 'apple ', 'black ', '102'],
['GFGFHHF-12 ', 'unknown thing ', 'bizar ', 'brick mortar ', 'tbc'],
['45fjk ', 'do not know ', '+ is less ', ' ', 'biac'],
[' ', 'disk ', 'seagate ', '250GB ', '130'],
['5G_gff ', ' ', 'tbd ', 'tbd', ''],
['gjgh88hgg ', 'media record ', 'a and b ', ' ', '12'],
['hjj ', 'foo ', 'bar ', 'hop ', 'zip'],
['hg uy oi ', 'hj uuu ii a ', 'qqq ccc v ', 'ZZZ Ughj', ''],
['qdsd ', 'zert ', ' ', 'nope ', 'nope']]
More information about the Python-list
mailing list