re.compile for names
John Machin
sjmachin at lexicon.net
Mon May 21 17:35:42 EDT 2007
On 21/05/2007 11:46 PM, brad wrote:
> I am developing a list of 3 character strings like this:
>
> and
> bra
> cam
> dom
> emi
> mar
> smi
> ...
>
> The goal of the list is to have enough strings to identify files that
> may contain the names of people. Missing a name in a file is unacceptable.
The constraint that you have been given (no false negatives) is utterly
unrealistic. Given that constraint, forget the 3-letter substring
approach. There are many two-letter names. I have seen a genuine
instance of a one-letter surname ("O"). In jurisdictions which don't
disallow it, people can change their name to a string of digits. These
days you can't even rely on names starting with a capital letter ("i
think paris hilton is <adjective> do u 2").
>
> For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
> would get smith, smiley, smit, etc. False positives are OK (getting
> common words instead of people's names is OK).
>
> I may end up with a thousand or so of these 3 character strings.
If you get a large file of names and take every possible 3-letter
substring that you find, you would expect to get well over a thousand.
> Is that
> too much for an re.compile to handle?
Suck it and see. I'd guess that re.compile("mar|smi|jon|bro|wil....) is
*NOT* the way to go.
> Also, is this a bad way to
> approach this problem?
Yes. At the very least I'd suggest that you need to break up your file
into "words" and then consider whether each word is part of a "name".
Much depends on context, if you want to cut down on false positives --
"we went 2 paris n staid at the hilton", "the bill from the smith was
too high".
> Any ideas for improvement are welcome!
1. Get the PHB to come up with a more realistic constraint.
2. http://en.wikipedia.org/wiki/Named_entity_recognition
HTH,
John
More information about the Python-list
mailing list