re.compile for names

Mon May 21 10:02:02 EDT 2007

On May 21, 8:46 am, brad <byte8b... at gmail.com> wrote:
> I am developing a list of 3 character strings like this:
>
> and
> bra
> cam
> dom
> emi
> mar
> smi
> ...
>
> The goal of the list is to have enough strings to identify files that
> may contain the names of people. Missing a name in a file is unacceptable.
>
> For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
> would get smith, smiley, smit, etc. False positives are OK (getting
> common words instead of people's names is OK).
>
> I may end up with a thousand or so of these 3 character strings. Is that
> too much for an re.compile to handle? Also, is this a bad way to
> approach this problem? Any ideas for improvement are welcome!
>
> I can provide more info off-list for those who would like.
>
> Thank you for your time,
> Brad

There are only 17,576 possible 3-letter strings, so you must keep your
percentage of this number small for this filter to be of any use.
With a list of a dozen or so strings, this may work okay for you.  But
the more of these strings that you add, the more the number of false
positives will frustrate your attempts at making any sense of the
results.  I suspect that using a thousand or so of these strings will
end up matching 95+% of all files.

You will also get better results if you constrain the location of the
match, for instance, looking for file names that *start* with
someone's name, instead of just containing them somewhere.

-- Paul