Finding Peoples' Names in Files

Thu Oct 11 16:05:00 EDT 2007

On 10/11/07, brad <byte8bits at gmail.com> wrote:
> Chris Mellon wrote:
>
> > In case you're doing this for PCI validation, be aware that just the
> > CC number is considered sensitive and you'd get some false negatives
> > if you filter on anything except that.
> >
> > Random strings that match CC checksums are really quite rare and false
> > positives from that alone are unlikely to be a problem. Unless I
> > deployed this and there was a significant false positive rate I
> > wouldn't risk the false negatives, personally.
>
> Yes, it is for PCI. Our rate of false positives is low, very low. I
> wasn't aware that a number alone was a PCI violation. Thank you! On
> another note, we're a university (Virginia Tech) and we're subject to
> FERPA, HIPPA, GLBA, etc... in addition to PCI. So we do these checks for
> U.S. Social Security Numbers too in an effort to prevent or lessen the
> chance of ID theft. Unfortunately, there is no luhn check for SSNs. We
> follow the Social Security Administration verification guideline
> religiously... here's an web front-end to my logic:
>
> http://black.cirt.vt.edu/public/valid_ssn/index.html
>
> but still have many false positives on SSNs, so being able to id *names
> and numbers* in files would still be a be benefit to us.
>
> Brad
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Defining the problem as "given a word, figure out if that word is
likely to be a name", it seems the simplest solution is to get a
corpus of names and then flag them based on edit distance from words
in the name list. Maybe soundex? You're going to need  a *massive*
corpus though, and that might be a problem if you distribute this for
people to run instead of doing it centrally.

As a totally off the wall speculation, you might be able to train a
neural net against a large enough corpus (Say, your student and
faculty member databases) and end up with something that can match a
name algorithmically without needing the table. This is a really hard
problem - maybe you can get your CompSci department to make it part of
someones thesis ;)

Once you've got a way to tell if a word might be a name, and a way to
tell if another word is likely to be a SSN, you just need to match up
hits within the same document, use some sort of distance filter, and
then you'll be "done".

I assume this is intended primary to catch files that people are
storing accidentally, rather that catching intentional identity theft
in action. It'd be trivial to hide from these sort of scans if you
were actively malicious.