difflib qualm
Gabriel Genellina
gagsl-py at yahoo.com.ar
Thu Jan 25 20:33:33 EST 2007
At Thursday 25/1/2007 21:49, Larry Bates wrote:
>Gabriel Genellina wrote:
> > At Wednesday 24/1/2007 23:05, Sick Monkey wrote:
> >
> >> I am trying to write a python script that will compare 2 files which
> >> contains names (millions of them).
> >>
> >> More specifically, I have 2 files (Files1.txt and Files2.txt).
> >> Files1.txt contains 180 thousand names and Files2.txt contains 34
> >> million names.
>
>Put the big list of names in a database and create soundex keys for the names
>and make the soundex keys an index so you can search quickly. Databases
>are really good at storing data that is searchable via an index. If
>you REALLY
>need speed you can consider an in-memory database.
>
>Create soundex keys for each name in your small list and query the database
>with this key into the table in the DB that is indexed on soundex keys.
>If you get a hit, the key is sufficiently "alike" to be a candidate. I'll
>leave the remainder to you. Perhaps there is other information that will
>help determine if there is a match?
Soundex is only good for English words, and it's almost useless for
non-English names, so it must be used with caution if used at all.
--
Gabriel Genellina
Softlab SRL
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
More information about the Python-list
mailing list