Heuristically processing documents

Hendrik van Rooyen mail at microcorp.co.za
Fri Mar 20 04:34:48 EDT 2009


"MRAB" <google at mrett.plus.com> wrote:

BJörn Lindqvist wrote:
 8< ---------------------------
>> For example, to find the email you can use a simple regexp. If there
>> is a match you can be certain that that is the authors email. But what
>> algorithms can you use to figure out the other information?
>>
>Tricky! :-)
>
>How would _you_ recognise them? Have a look at the documents and see if
>you can see a pattern. For example, names and address often consist of a
>sequence of words in title case, eg "Björn Lindqvist", which might help
>you narrow down the list of possibilities. What do telephone numbers
>look like, etc?

It may help you to think about the problem if you imagine yourself having
to extract the information from documents written in a language that
you do not understand.

An address may be identified by a number in a line (street address or PO box)
that is followed some lines later by another number (zip code).
But this hardly qualifies as an "algorithm".

A "mailto:"  and/or a set of "angle brackets" is a strong clue too...

Don't have a clue about the name, though. - plain title case might
work for "John Brown" but it fails with "Koos van der Merwe".

If there is an email addy in the doc, then it might serve as a clue
to where to look - based on the theory that the contact information
would be grouped together.

Another clue might be to look for the word "Author" or its
equivalent in a bunch of languages.

"Tricky" is an understatement.

- Hendrik





More information about the Python-list mailing list