re.compile for names

Mon May 21 17:35:42 EDT 2007

On 21/05/2007 11:46 PM, brad wrote:
> I am developing a list of 3 character strings like this:
> 
> and
> bra
> cam
> dom
> emi
> mar
> smi
> ...
> 
> The goal of the list is to have enough strings to identify files that 
> may contain the names of people. Missing a name in a file is unacceptable.

The constraint that you have been given (no false negatives) is utterly 
unrealistic. Given that constraint, forget the 3-letter substring 
approach. There are many two-letter names. I have seen a genuine 
instance of a one-letter surname ("O"). In jurisdictions which don't 
disallow it, people can change their name to a string of digits. These 
days you can't even rely on names starting with a capital letter ("i 
think paris hilton is <adjective> do u 2").

> 
> For example, the string 'mar' would get marc, mark, mary, maria... 'smi' 
> would get smith, smiley, smit, etc. False positives are OK (getting 
> common words instead of people's names is OK).
> 
> I may end up with a thousand or so of these 3 character strings.

If you get a large file of names and take every possible 3-letter 
substring that you find, you would expect to get well over a thousand.

> Is that 
> too much for an re.compile to handle?

Suck it and see. I'd guess that re.compile("mar|smi|jon|bro|wil....) is 
*NOT* the way to go.

> Also, is this a bad way to 
> approach this problem?

Yes. At the very least I'd suggest that you need to break up your file 
into "words" and then consider whether each word is part of a "name". 
Much depends on context, if you want to cut down on false positives -- 
"we went 2 paris n staid at the hilton", "the bill from the smith was 
too high".

> Any ideas for improvement are welcome!

1. Get the PHB to come up with a more realistic constraint.
2. http://en.wikipedia.org/wiki/Named_entity_recognition

HTH,
John