How to make regexes faster? (Python v. OmniMark)

Fri Apr 19 03:41:19 EDT 2002

"Frederick H. Bartlett" <fbartletFIXIT at optonline.net> wrote in message
news:3CBF71D4.5668CF1A at optonline.net...

> So I did it in Python, too. But the best time I could get from Python
> was .57 sec, while OmniMark came in at .20 sec. What's the most
> efficient technique for Pythonesque regex-based text processing?
>
> My best time came from using a single rather large regex and findall; I
> also tried smaller regexes and scan and match.

Internally OmniMark builds one big state machine for all the find rules...
pretty much the same thing as what you did with the single large regex.
This would have been the only recommendation I could make, but I don't know
any performance tricks for Python regex's.

> Has anyone else compared OmniMark and Python?

OmniMark has had years and years to optimize their engine and it's the only
thing they do.  I've never seen anything faster, and that makes a HUGE
difference when you're processing gigabytes of data, however the language is
a travesty.  Well, I should qualify that... it's still more readable and
writable than perl, and I mostly used it back in the days before they had
those new-fangled "function" thingies.

I'd recommend OmniMark if the performance really will make a difference or
if you need 100% standard SGML or XML processing. Otherwise, it'll be easier
to find Python programmers, and the code will be maintainable :)

Van