Finding Peoples' Names in Files

John J. Lee jjl at pobox.com
Thu Oct 11 15:25:16 EDT 2007


brad <byte8bits at gmail.com> writes:

> Crazy question, but has anyone attempted this or seen Python code that
> does? For example, if a text file contained 'Guido' and or 'Robert'
> and or 'Susan', then we should return True, otherwise return False.

A few ideas:

1. If you don't have a list of names, find a list of words that
doesn't contain proper nouns (there are a few word lists out there,
not sure if any exclude people's names, though).  Look for short runs
of two or three "words" (punctuation-separated tokens) in the email
that aren't in the dictionary.  Some of them will be people's names.

2. Send the text through Google translate and look for runs of words
that are unchanged.  Some of them will be people's names.

3. Search the literature and look for fancy algorithms.  Here are some
papers (the last mentions some commercial software to do this):

http://citeseer.ist.psu.edu/bikel99algorithm.html

http://citeseer.ist.psu.edu/618945.html

http://arxiv.org/html/cmp-lg/9706017


John



More information about the Python-list mailing list