null bytes in re pattern - difference between 1.5.2 and 2.0?

Tim Peters tim.one at home.com
Wed Dec 13 22:32:36 EST 2000


[posted and mailed]

[Skip Montanaro]
> I want to delete control characters from some strings.  Accordingly, I
> tried:
>
>     name = re.sub("[\000-\037\177]", "", name)
>
> This works in Python 2.0 but not in 1.5.2.  In 1.5.2 I find I need
> to use raw strings:

That's a good idea in 2.0 too, y'know.

>     name = re.sub(r"[\000-\037\177]", "", name)
>
> Accordingly, using raw strings in 2.0 fails.

Eh?  Prove it.  That is, submit a bug report with a specific failing example
if that's true.  Works for me:

Python 2.0 (#8, Oct 16 2000, 17:27:58) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
IDLE 0.6 -- press F1 for help
>>> import re
>>> allchars = [chr(i) for i in range(256)]
>>> fat = "".join(allchars)
>>> print len(fat)
256
>>> skinny = re.sub(r"[\000-\037\177]", "", fat)
>>> print len(skinny)
223
>>> 256 - 223
33
>>>

> Is there some form that will work both in 1.5.2 and 2.0?

The r-string form.  Or use the optional deletechars argument to
string.translate, which should run much faster in either version.

> Is this a change I should have expected?

No.

> I assume it has something to do with Unicode support in 2.0.

Much more mundane than that:  1.5.2 used a 3rd-party regexp engine (PCRE),
and its interface required passing in the pattern as a regular old C string.
So you couldn't pass a pattern with a literal null byte in 1.5.2.

ghosts-chasing-ghosts-ly y'rs  - tim





More information about the Python-list mailing list