[Tutor] extract meaningful data from garbage

Alan Gauld alan.gauld at btinternet.com
Sun Jan 3 09:56:22 CET 2010


"Shashwat Anand" <anand.shashwat at gmail.com> wrote 

> here are the examples : http://codepad.org/wF8APZV3
> 
>> I need to extract some meaningful data from grabages.

>> How should I parse info if I'm not certain of any definite rules. This is
>> my first time dealing with real-life data.

Unfortunarely to parse it you will need to define a set of rules.
The company name seems to consistently follow a line like

1511261 - 08/12/2006

So that should be relatively easy to extract. However the 
address data seems much more random.

Also you don't say how you want to trweat the alternative 
names/addresses (eg "trading as...")

As a first attempt  the address follows immediately 
after the company name except 
1) when the next line begins with "trading as" or 
2) the next line begins with "(" in which case the address 
   follows the closing "("

Those two rules are sufficient for the 4 examples you posted. 
You may have other cases where they break down.

The address seems to consistently stop on theline above 
MANUFACTURER...But again other data samples may show cases where that breaks down. You will need to create more rulesto cover those cases.Unfortunately the address data itself is not consistentwhich makes it difficult to define a rule to recognise it in its own right.HTH-- Alan GauldAuthor of the Learn to Program web sitehttp://www.alan-g.me.uk/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100103/c289b425/attachment.htm>


More information about the Tutor mailing list