Searching binary data

Tim Peters tim_one at email.msn.com
Wed Feb 2 22:49:06 EST 2000


[Darrell]
> Didn't have access to the internet today which forced me to have
> a creative thought of my own. Now to find out if I wasted my time.
>
> The problem is to find patterns in gobs of binary data.
> Treat it as a string you see something like this.
> MZ\220\000\003\000\000\000\004\000\000\000\377\377
>
> I found writing a re for patterns in that, a pain.
> What if I wanted r"[\000-\077]".
> It won't work because there are nulls in the result and re doesn't
> like that.

Actually, that works fine (if it didn't, what you just told us you did is
not what you actually did).  You can't pass a pattern with an actual null to
re (minor flaw of the implementation, IMO), but the raw string
r"[\000-\077]" doesn't contain an actual null:  it contains the 4-character
escape sequence "\000", which re converts to a null.

>>> p = re.compile(r"[\000-\001]")
>>> p.match(chr(0)).span(0)
(0, 1)
>>> p.match(chr(1)).span(0)
(0, 1)
>>> print p.match(chr(2))
None
>>>

> Not to mention all this octal to hex is annoying

Hex escapes work fine too:  r"[\x00-\x3f]" means the same as the above.

> an who knows what trouble Nulls will be.

I do:  none <wink>.  Really, nulls aren't special at all to re.  The glitch
in *passing* an actual null in the pattern to re has to do with the engine's
C interface, which uses a char* for the pattern without an additional count
argument.  That's as deep as this one goes.

> So I wrote an extension to covert everything to hex in the
> following format.
> 4d5aff000300000004000000ffff0000ff
> Now I can treat the whole thing as a string :)

That's fine too.






More information about the Python-list mailing list