[Tutor] Entity to UTF-8 [low level C details on regexmodule.c]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu May 8 15:17:02 2003


On Wed, 30 Apr 2003, Paul Tremblay wrote:

> You probably already know this already, but I thought I'd offer it
> anyway.
>
> Your code has the lines:
>
> patt = '&#([^;]+);'
>
> ustr = re.sub(patt, ToUTF8, ustr)
>
> I believe this is ineffecient, because python has to compile the regular
> expression each time.  This code should be more effecient:
>
> patt = re.compile(r'&#[^;];')



Hi Paul,


Actually, there's a very low level implementation detail that, in the
common case, improves our situation here.  The last time I checked,
Python's regular expression engine does cache the last few regular
expressions that we use via the functions sub(), match(), and search().
So it might not be so necessary to do an re.compile() in his program.


Python's current regular expression engine, 're', uses the internal module
'sre' by default, and there's a section of 'sre' that defines a cache of
regular expressions:


### sre.py
_cache = {}
_cache_repl = {}
                           # some code cut
_MAXCACHE = 100
###


So the first 100 regular expressions processed by Python are automatically
compiled and saved internally in the 're' module itself.  So when we try
reusing a particular old regular expression, Python can pick it out of the
cache.  This caching behavior is not something that we should really
depend on, but it's good to know that it's there.





[C code ahead]

For the curious C programmers among us, in Python 1.52, this sort of
caching was much more limited: the old regex engine only cached the very
last regular expression!  We can look at the relevant function in
Modules/regexmodule.c, in the update_cache() function:


/******/
static PyObject *cache_pat;
static PyObject *cache_prog;

static int
update_cache(PyObject *pat)
{
        PyObject *tuple = Py_BuildValue("(O)", pat);
        int status = 0;

        if (!tuple)
                return -1;

        if (pat != cache_pat) {
                Py_XDECREF(cache_pat);
                cache_pat = NULL;
                Py_XDECREF(cache_prog);
                cache_prog = regex_compile((PyObject *)NULL, tuple);
                if (cache_prog == NULL) {
                        status = -1;
                        goto finally;
                }
                cache_pat = pat;
                Py_INCREF(cache_pat);
        }
  finally:
        Py_DECREF(tuple);
        return status;
}
/******/


Notice that there's some static variables here for maintaining some
memory.  The idea of update_cache is this: on every call to a regular
expression matching function, Python uses update_cache() to check to see
if can reuse work that it's done on the very last regex call.  If the very
last regular expression we used is the same as the one we're doing now, we
reuse that regex object without recompiling the expression.


Sorry about diving into C code like this!  It's just that I thought that
this optimization detail was cute: it covers the common case when we're
only dealing with a single regular expression repeatedly in a loop.


But even so, it apparently made more sense in later versions of Python to
yank the cache out of the C code entirely, and to maintain it externally
in the 'sre' Python module.





Good luck to you!