Regarding Regex timeout behavior to minimize CPU consumption

Barry barry at barrys-emacs.org
Sun Dec 6 17:36:25 EST 2020



> On 5 Dec 2020, at 23:44, Peter J. Holzer <hjp-python at hjp.at> wrote:
> 
> On 2020-12-05 23:42:11 +0100, sjeik_appie at hotmail.com wrote:
>>   Timeout: no idea. But check out re.compile and re.iterfind as they might
>>   speed things up.
> 
> I doubt that compiling regular expressions helps the OP much. Compiled
> regular expressions are cached, but more importantly, if a match takes
> long enough that specifying a timeout is useful, the time is almost
> certainly not spent compiling, but matching - most likely backtracking
> from lots of promising but ultimately unsuccessful partial matches.
> 
>>     regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \
>>            
>>     r'(?:class\s*=\s*"\s*sticky-book-now\s*"|</ul>\s*</section>|id\s*=\s*"Location")'
>>     rooms_blocks_to_be_replace = re.findall(regex, html_template)
> 
> This part:
> 
>    \s*([\s\S]*?)\s*'
> 
> looks dangerous from a performance point of view. If that can be
> rewritten with less potential for backtracking, it might help.
> 
> Generally, it should be possible to implement a timeout for any
> operation by either scheduling an alarm with signal.alarm or by
> executing the operation in a separate thread and killing the thread if
> it takes too long.

I think that python ignores signals until the coeval loop is entered.
And since the re.match will block that is not going to happen.

Killing threads is not safe and if your OS allows it then you end up with the internal state of python messed up.

To implement this I think requires the re code to implement the timeout.

Better is for the OP to fix the re to not back track so much or to work on the
input string in chunks.

Barry

> 
>        hp
> 
> -- 
>   _  | Peter J. Holzer    | Story must make more sense than reality.
> |_|_) |                    |
> | |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
> __/   | http://www.hjp.at/ |       challenge!"
> -- 
> https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list