[Tutor] extract meaningful data from garbage

Lie Ryan lie.1296 at gmail.com
Sun Jan 3 10:23:18 CET 2010


On 1/3/2010 4:58 PM, Shashwat Anand wrote:
> I need to extract some meaningful data from grabages.
> Here are four examples. I need to get date, company name and address
> from these.
> For date i used regex but I'm unable to find any definite pattern for
> address and company name
> the format is more or less :
> garbage
> id - date
> garbage
> company name
> garbage
> company address
> garbage
>
> How should I parse info if I'm not certain of any definite rules. This
> is my first time dealing with real-life data.

Other than the "id - date"; it seems quite difficult to reliably extract 
the company names and addresses. Extracting the company names and 
addresses appears to be based on a best-effort basis.

Tips: look for clue keywords; company names often ends with 
ltd/sdn/bhd/berhad; lines that starts with "address" often is followed 
by the actual addresses; etc.

Tips: this is a good showcase for TDD; pick twenty-or-so cases and 
manually extract the information and write your program to match as much 
of these test cases as possible (while manually extracting you should be 
able to notice additional patterns that you can use later on while 
writing your program).



More information about the Tutor mailing list