fastest way for humongous regexp search?

Tue Nov 2 11:50:59 EST 2004

[Tim]
> I've got a list of 1000 common misspellings, and I'd like to check a set
> of text files for those misspellings.

[Istvan]
> A much simpler way would be to just store these misspellings as a
> dictionary (or set), read and split each line into words, then check
> whether each of words is in the set.

[Tim]
> Thanks, I didn't know that would be faster.
> But I need to match against the misspellings in a case-insensitive
> way--that's the reason I'm using the regular expressions.

Make the misspelling set lower case, and convert the list of words from
the text file into lower case before comparing them:

>>> from sets import Set
>>> misspellings = Set(['speling', 'misteak'])
>>> text = "Does this text contain any common speling mistakes?"
>>> print [word for word in text.split() if word in misspellings]
['speling']

-- 
Richie Hindle
richie at entrian.com