[issue24426] re.split performance degraded significantly by capturing group

Wed Jun 10 22:16:59 CEST 2015

Patrick Maupin added the comment:

1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.

2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.

3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file.  Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?

I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase.  If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.

But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.

Thanks,
Pat

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue24426>
_______________________________________