problems with re in binary files

Tim Peters tim.one at home.com
Thu Sep 20 04:32:33 EDT 2001


[Steven D. Arnold]
> I'm trying to parse a binary file using the re module.  At one point I
> use '<<(.*)>>' with the r.DOTALL option.  I expect this to find a '<<'
> string, find the last '>>' string in the string (which contains the
> whole document; it's small), and everything between would go in the
> group.  However, in practice, re seems to stop well before the end of
> the string, and well before the last instance of '>>'.  In other
> words, the group doesn't seem to contain everything it should.
> ...

It's always better to give a self-contained, small program, than to try to
explain.  For example, here's a small program:

import re
test = "<<" + "\x00\n>>" * 1000 + ">>"
pat = re.compile(r"<<(.*)>>", re.DOTALL)
print "Test string has", len(test), "chars."
m = pat.search(test)
if m:
    print "Group 1 spans slice %d:%d" % m.span(1)
else:
    print "Didn't match!"

What does that print when you run it?  When I run it, it prints

    Test string has 4004 chars.
    Group 1 spans slice 2:4002

This shows that embedded null bytes, and embedded newlines, and 1000 "early"
hits on ">>", don't fool re.  Therefore you have a bug in your Python, or
you haven't told us something *relevant* about why it isn't working for you.
If you show us actual code, it's much easier than guessing.

> I'm manipulating a binary file (and therefore a string with 8-bit
> binary characters), so I thought perhaps the `.' was not matching the
> NULL character.

As above, shouldn't matter.

> So I changed the expression above to '<<((.|\000)*)>>'.  My
> understanding is that this should match either the normal dot regular
> expression, or a literal zero (NULL) character,

DOTALL does the same but quicker.

> and this pattern would then be matched zero or more times.  However,
> the behavior is basically the same.

More evidence that you're not in the right ballpark yet.

> Anyone have any idea what I should do?
> ...

Post a failing test case, and the cause will be obvious to someone.

it's-always-the-last-place-you-look-ly y'rs  - tim





More information about the Python-list mailing list