Case-insensitive string comparison

Tue Apr 15 10:56:20 EDT 2003

"Avner Ben" <avner at skilldesign.com> wrote in
news:3e9c1413$1 at news.012.net.il: 

> I am searching for one or more members of a list of names (that
> may get 
> near 1000 pieces in a normal application and may average some 40
> characters) in a string (that may average 100 characters). For each
> name found in the string, I do something to the string, using the
> original name (in its proper case). At present, I am looking for the
> exact name as stated, with agreeable performance. However, this makes
> me miss many cases where the author of the string did not care for the
> casing (which happens frequently). I have little problem with either
> lowering or uppering the entire list once, because it is not expected
> to change often and when that happens once in a while, a subscription
> mechanism can take care of that. However, this requires holding two
> lists (I still need the original names as are) and lowering or
> uppering each string for comparison (and keep its original version for
> restructuring).

I take it from your description that the names are not in any way
delimited by spaces or punctuation in the strings. If they were, then
splitting the strings on the delimiters might lead to a nice solution. 

My initial gut feel would be to approach this by building a dictionary
mapping the lowercase version of each name to its original casing. You
only talk of 1000 names of about 40 characters, so keeping two copies
isn't going to be a concern. 

Then you have to find the best way to locate which of the lowercase
names occur in the lowercase version of the string. Three ways to try
would be to iterate over all the names using str.find to check whether
it occurs in the string, or to try concatenating all the names into one
humungous regular expression, or to try something based on your C++
method and lookup substrings in a dictionary. 

The last of these would be the most complex, but might be pretty fast.
What I'm thinking is that you could split your 100 character string into
each 5 character substring in turn, and look them up in a dictionary of
names keyed by the first 5 characters. That way you do perhaps 95
dictionary lookups instead of 1000 string find operations. Of course you
would have to handle very short names as a special case (i.e. find or
regex).

-- 
Duncan Booth                                            
duncan at rcp.co.uk int month(char
*p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3" 
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?