Python 2.2 re bug?

Mon Aug 26 03:00:13 EDT 2002

Travis Shirk <travis at puddy.lan.kerrgulch.net> wrote:
>
>I'm running into what looks to be a bug in the python 2.2 re module.

I am hard-pressed to call this a bug.  It looks to me like you were relying
on unspecified/undocumented behavior, and that behavior has changed.  I
don't see the "\x" syntax defined anywhere in the "re" module
documentation. To me, the behavior you were seeing in 1.5 is unexpected.
For example, it means that I cannot include a control-A in the substitution
string in the simplest way, nor can I include the literal string "\xFF" in
the substitution string.

The 2.2 behavior seems much more rational: if you want an ASCII 255 in the
substitution string, then put an ASCII 255 in the substitution string by
using "\xFF", not "\\xFF", which is what you are sending.  What you are
doing puts the literal string "\xFF" in the substitution string, and that's
exactly what the 2.2 substitution does.

It is true that Python 2.x has a different regular expression engine than
Python 1.x.  You should be able to get the old engine by importing "pre"
instead of "re", although I was unable to get your example to work directly
using pre.

>This output is exactly what I expect, but now see what happens in 
>2.2.1:
>import re;
>data = "\xFF\x00\xE0\xD3\xD3\xE4\x95\xFF\x00\x00\x11\xFF\x00\xF5"
>data1 = re.compile(r"\xFF\x00([\xE0-\xFF])").sub(r"\xFF\1", data);
>print data1
>'\\xFF\xe0\xd3\xd3\xe4\x95\xff\x00\x00\x11\\xFF\xf5'

Well, THAT output is what I expect.  It sucked up the first three
characters of the string, and substituted the literal string "\xFF"
followed by the third character.

>I like the hex output over the octal in 1.5, but the substitution is
>clearly wrong.  Notice each spot containing "\\" in the last result.

You do realize that the "\\" is print's way of telling you there is a
single "\" in the string, right?  Your result starts with the four literal
characters "\xFF", which are, in fact, the first four characters of your
substitution string.

You can get the results you want this way, which is how I would have
expected to write it in the first place:

  data1 = re.compile("\xFF\x00([\xE0-\xFF])").sub("\xFF\\1", data);

>Is this a known bug?  Have the semantics changed wrt the 2.0 unicode aware
>re package?

The semantics have changed, but I do not believe it can be called a "bug".
It is a change in unspecified/undocumented behavior.
--
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.