re.sub() backreference bug?

Thu Aug 17 19:02:46 EDT 2006

jemminger at gmail.com wrote:
> using this code:
>
> import re
> s = 'HelloWorld19-FooBar'
> s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
> s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
> s = re.sub('-', '_', s)
> s = s.lower()
> print "s: %s" % s
>
> i expect to get:
> hello_world19_foo_bar
>
> but instead i get:
> hell☺_☻orld19_fo☺_☻ar
>
> (in case the above doesn't come across the same, it's:
> hellX_Yorld19_foX_Yar, where X is a white smiley face and Y is a black
> smiley face !!)
>
> is this a bug, or am i doing something wrong?
>

Tim's given you the solution to the problem: with the re module,
*always* use raw strings  in regexes and substitution strings.

Here's a simple diagnostic tool that you can use when the visual
presentation of a result leaves you wondering [did you get smiley faces
on Windows in IDLE? on Linux?]:

|>>> print repr(s)
'hell\x01_\x02orld19_fo\x01_\x02ar'
|>>> print "s: %r" % s
s: 'hell\x01_\x02orld19_fo\x01_\x02ar'

HTH,
John