[Tutor] re.findall() weirdness. [looks like a bug!]

Tue, 26 Jun 2001 19:13:23 -0700 (PDT)

I've been looking at the source code a bit more.  It's definitely a bug,
and I'm really happy that you found it!  *grin*  Wow.

[note --- I'm in a very excited "hackish" mode right now, so this will
probably not make sense.  *grin* If you don't know C, you probably
don't want to read this message.]

In any case, there's an undocumented reason why we were getting those
results: apparently, Python's findall() internally takes in 3 arguments:
the string we're searching through, and the 'begin' and 'end' positions of
that string.  Normally, Python sets up the 'begin' and 'end' to be '0' and
the largest integer in the world, respectively, but with the bug in
findall(), something bad happens.

If we look at the C source code for regular expressions, we can see this
undocumented behavior in Modules/_sre.c.  For those that are curious, I'll
snip the part that's relevant:

###
static PyObject*
pattern_findall(PatternObject* self, PyObject* args, PyObject* kw)
{
    SRE_STATE state;
    PyObject* list;
    int status;
    int i;

    PyObject* string;
    int start = 0;
    int end = INT_MAX;
    static char* kwlist[] = { "source", "pos", "endpos", NULL };
    if (!PyArg_ParseTupleAndKeywords(args, kw, "O|ii:findall", kwlist,
                                     &string, &start, &end))

    [ ... code omitted]
###

So, internally at the C level, findall takes in three parameters:
"source", "pos", and "endpos".  Here's a guess to explain what was
happening before in the buggy findall:

###
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', re.I)
###

It helps if we look at the value of 're.I' --- not only is re.I a great
place to get outdoor supplies, but it's also an integer:

###
>>> re.I
2
###

What I think was happening was that the findall was trying to start
searching for all instances of '<.*?>', but beginning at position 2 of our
string.  We can confirm this by experiment:

###
## Using the buggy sre.py
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 2)
['</2>', '<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 3)
['</2>', '<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 4)
['</2>', '<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 5)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 6)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 7)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 8)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 9)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 10)
['<3>', '</4>']
>>> re.findall('<.*?>','<1> </2> \n<3> </4>', 11)
['</4>']
###

Of course re.findall is never supposed to do this, but it's nice to know
WHY it was doing that...

Ok, I'm pooped out.  Wake me up in the morning.  *grin*