sre \Z bug or feature?

Tue Jan 2 16:35:49 EST 2001

[posted & mailed]

[Pearu Peterson]
> Here is a complete re.match command that I use in my application:
>
> m=re.match(r'\A(?ms)\s*@\s*(?P<name>\w+)\s*{\s*(?P<body>.*?)\s*}\s
> *\Z',item)
>
> and it should match an item in the .bib files (the BiBTeX input file).
> This item is expected to be of the following form:
>
> @name{body}
>
> where `name' is purely alphabetic string, body may contain arbitrary
> characters, including `{',`}',`\n'. Between the parts `@', `name', `{',
> `body', and `}' there may be any number of whitespace characters,
> including newlines. Note also that `item' contains exactly one such
> bibtex item block (that is correctly ensured by the other parts of my
> application).

That's an excellent explanation.  Thanks!

> So, do you think that the pattern I use is somehow invalid or
> inefficient?

It's at least confusing <wink>.  Note that \A doesn't accomplish anything,
because .match *always* anchors to the start of the string.  (?m) doesn't
accomplish anything either, because you have no instances of "^" or "$"
metacharacters.  They don't hurt, except that they're of no actual use so
it's impossible to guess whether you *hoped* they were of some use.  It's
like seeing stmts of the form

   i = i + 0

in regular code.  Is that what the author intended?  Can't tell, but it
looks fishy.  Also unclear why the obscure (because rarely used) \Z is
there -- it doesn't appear to accomplish anything here that the usual $
wouldn't.

The use of minimal matching .*? is certainly inefficient:  first it tries an
empty string, then it tries matching 1 character, then 2 characters, and on
& on.  But unless you typically have enormous amounts of trailing
whitespace, it will be more efficient to try matching the whole tail first,
then cutting back one at a time until it succeeds.

So it took me a long time to figure out why you might be using a minimal
match to begin with.  Best I can guess is that it's simply because you don't
want trailing whitespace in the body.  But getting rid of excess whitespace
is what string.strip() is for!  Trying to do that with regexps instead is
strained.

That leaves me with

matcher = re.compile(r"""
    \s*
    @
    \s*
    (?P<name> \w+)
    \s*
    {
        (?P<body> .*)
    }
    \s*
    $
""", re.VERBOSE | re.DOTALL).match

and then

    m = matcher(some_string)
    if m:
        name, body = m.group("name"), m.group("body").strip()

> In order to get my application to work under Python 2.0, I use now
>
> import pre as re
>
> that works perfectly.

Fine by me.

just-so-long-as-i-don't-have-to-maintain-it<wink>-ly y'rs  - tim