How to make regexes faster? (Python v. OmniMark)

Fri Apr 19 13:10:33 EDT 2002

Quoth claird at starbase.neosoft.com (Cameron Laird):
...
| Next, I'd determine whether my test examples are indeed
| regex-bound (it might well be I/O which constrains your
| performance).  After that ... well, part of the charm of
| regex-s for some people is that they're so flexible that
| different techniques are superior in different circumstances.

Indeed, it could be partly I/O.

I recently went to a meeting and heard someone mention that he had
written a program in Python, his first, but was thinking of rewriting
it in Perl because he had determined that Perl was 10 times faster
at I/O and regular expressions.  At a site that employs hundreds of
at least occasional programmers, this is maybe the fourth I've seen
show this much interest in Python, so I was kind of chagrined to hear
this announcement and went back to check it out.  I was even more
chagrined to find that it was not an unreasonable claim.

Part of the problem is that when you write something like a "grep"
in Python and in Perl, the Perl program will naturally be written
like while ($line = <STDIN>) {...}, and the Python program will
naturally be written like while 1: line = sys.stdin.readline() ...
That pits a lot of function calls against what must be an inline
operation.  I think I decided that "I/O", in this practical sense
of getting a line of data, might have been about half the problem.

The xreadlines function available in later versions of Python did
reduce the disparity a little.  This optimization might help a lot
in the present case, if there's a lot of line-by-line I/O and if
Python is 2.1 or later.

	Donn Cave, donn at u.washington.edu