re.I slowness

Thu Mar 30 09:21:08 EST 2006

<vvikram at gmail.com> wrote in message
news:1143719899.018571.41330 at u72g2000cwu.googlegroups.com...
> We process a lot of messages in a file based on some regex pattern(s)
> we have in a db.
> If I compile the regex using re.I, the processing time is substantially
> more than if I
> don't i.e using re.I is slow.
>
> However, more surprisingly, if we do something on the lines of :
>
> s = <regex string>
> s = s.lower()
> t = dict([(k, '[%s%s]' % (k, k.upper())) for k in
> string.ascii_lowercase])
> for k in t: s = s.replace(k, t[k])
> re.compile(s)
> ......
>
> its much better than using plainly re.I.
>
> So the qns are:
> a) Why is re.I so slow in general?
> b) What is the underlying implementation used and what is wrong, if
> any,
> with above method and why is it not used instead?
>
> Thanks
> Vikram
>
Can't tell you why re.I is slow, but perhaps this expression will make your
RE transform a little plainer (no need to create that dictionary of uppers
and lowers).

s = <regex string>
makeReAlphaCharLowerOrUpper = lambda c : c.isalpha() and "[%s%s]" %
(c.lower(),c.upper()) or c
s_optimized =  "".join( makeReAlphaCharLowerOrUpper(k) for k in s)

or

s_optimized =  "".join( map( makeReAlphaCharLowerOrUpper, s ) )

Just curious, but what happens if your RE contains something like this
spelling check error finder:
"[^c]ei"
(looking for violations of "i before e except after c")

Can []'s nest in an RE?

-- Paul