[Python-ideas] re.compile_lazy - on first use compiled regexes

Sat Mar 23 15:35:18 CET 2013

On 2013-03-23, at 14:34 , Antoine Pitrou wrote:

> On Sat, 23 Mar 2013 14:26:30 +0100
> Masklinn <masklinn at masklinn.net> wrote:
>> 
>> Wouldn't it be better if there are *few* different regexes? Since the
>> module itself caches 512 expressions (100 in Python 2) and does not use
>> an LRU or other "smart" cache (it just clears the whole cache dict once
>> the limit is breached as far as I can see), *and* any explicit call to
>> re.compile will *still* use the internal cache (meaning even going
>> through re.compile will count against the _MAXCACHE limit), all regex
>> uses throughout the application (including standard library &al) will
>> count against the built-in cache and increase the chance of the regex
>> we want cached to be thrown out no?
> 
> Well, it mostly sounds like the re cache should be made a bit smarter.

It should, but even with that I think it makes sense to explicitly cache
regexps in the application, the re cache feels like an optimization more
than semantics.

Either that, or the re module should provide an instantiable cache object
for lazy compilation and caching of regexps e.g.
re.local_cache(maxsize=None) which would return an lru-caching proxy to
re. Thus the caching of a module's regexps would be under the control of
the module using them if desired (and important), and urllib.parse could
fix its existing "ugly" pattern by using

    import re
    re = re.local_cache()

and removing the conditional compile calls (or even the compile call
and using re-level functions)

Optionally, the cache could take e.g. an *args of regexp to precompile
at module load/cache creation.