[issue24426] re.split performance degraded significantly by capturing group

Sat Jun 13 16:04:36 CEST 2015

Patrick Maupin added the comment:

> (stuff about cPython)

No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler).  Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for.

> ... regex is more powerful and better supports Unicode.

Unfortunately, it is still not competitive.  For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files:

Python 2.7 with re -- 5.9 s
Python 2.7 with regex -- 6.9 s
Python 3.4 with re -- 7.2 s
Python 3.4 with regex -- 8.2 s

A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7.

Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3.

> Do you mean documenting codes of compiled re pattern?

Yes.

> This is implementation detail and will be changed in future.

Understood, and that's fine.  If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready.

Thanks,
Pat

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue24426>
_______________________________________