Regarding Regex timeout behavior to minimize CPU consumption

Peter J. Holzer hjp-python at hjp.at
Sat Dec 5 18:34:13 EST 2020


On 2020-12-05 23:42:11 +0100, sjeik_appie at hotmail.com wrote:
>    Timeout: no idea. But check out re.compile and re.iterfind as they might
>    speed things up.

I doubt that compiling regular expressions helps the OP much. Compiled
regular expressions are cached, but more importantly, if a match takes
long enough that specifying a timeout is useful, the time is almost
certainly not spent compiling, but matching - most likely backtracking
from lots of promising but ultimately unsuccessful partial matches.

>      regex = r'data-stid="section-room-list"[\s\S]*?>\s*([\s\S]*?)\s*' \
>             
>      r'(?:class\s*=\s*"\s*sticky-book-now\s*"|</ul>\s*</section>|id\s*=\s*"Location")'
>      rooms_blocks_to_be_replace = re.findall(regex, html_template)

This part:

    \s*([\s\S]*?)\s*'

looks dangerous from a performance point of view. If that can be
rewritten with less potential for backtracking, it might help.

Generally, it should be possible to implement a timeout for any
operation by either scheduling an alarm with signal.alarm or by
executing the operation in a separate thread and killing the thread if
it takes too long.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20201206/e0e65983/attachment.sig>


More information about the Python-list mailing list