Regex driving me crazy...

Wed Apr 7 23:26:02 EDT 2010

On Apr 7, 9:51 pm, Steven D'Aprano
<ste... at REMOVE.THIS.cybersource.com.au> wrote:

> This is one of the reasons we're so often suspicious of re solutions:
>
> >>> s = '# 1  Short offline       Completed without error       00%'
> >>> tre = Timer("re.split(' {2,}', s)",
>
> ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]",
>
> ... "from __main__ import s")
>
> >>> re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
> True
>
> >>> min(tre.repeat(repeat=5))
> 6.1224789619445801
> >>> min(tsplit.repeat(repeat=5))
>
> 1.8338048458099365

I will confess that, in my zeal to defend re, I gave a simple one-
liner, rather than the more optimized version:

>>> from timeit import Timer
>>> s = '# 1  Short offline       Completed without error       00%'
>>> tre = Timer("splitter(s)",
... "import re; from __main__ import s; splitter =
re.compile(' {2,}').split")
>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]",
... "from __main__ import s")
>>> min(tre.repeat(repeat=5))
1.893190860748291
>>> min(tsplit.repeat(repeat=5))
2.0661051273345947

You're right that if you have an 800K byte string, re doesn't perform
as well as split, but the delta is only a few percent.

>>> s *= 10000
>>> min(tre.repeat(repeat=5, number=1000))
15.331652164459229
>>> min(tsplit.repeat(repeat=5, number=1000))
14.596404075622559

Regards,
Pat