re bug?

Fri Dec 1 07:08:05 EST 2000

On Fri, 1 Dec 2000 rubygeek at yahoo.com wrote:

> active-python2.0, win2k
> 
> > re.sub("(\d+) (\d+)","\1\2","abc 34 23")
> "abc \001\002"
> 
> > re.sub("(\d+) (\d+)","\\1\\2","abc 34 23")
> "abc 3423"
> 
> can anyone explain this?

I'll try.  This is because of python's way how to escape characters.  In
normal strings (you use them), the backslash is to escape special
characters "\n" will give you a newline (code 10?).  In your case you are
using "\d" and "\1"/"\2".  For "\d" python has no special meaning, so it
takes a backslash and the character `d'.  But "\1" is used as the octal
coding of a character, in this case octal, decimal 1.  That's why you get
the first result.  In the second you give python a "\\" which means a
single backslash character `\'.  Actually you are quite lucky, that "\d"
has no special meaning.

To make it short: For regular expressions where you have to use often
backslashes it's best to use so-called raw strings r"...":

re.sub(r"(\d+) (\d+)", r"\1\2", "abc 34 23")
"abc 3423"

Raw stings have no backslash escaping (except for the quotes, but the
backslash will stay in the string).  For more information on this look in
the Python Reference Manual "1.4.1 String literals".

Cheers, Carsten
-- 
Carsten Geckeler:  carsten dot geckeler at gmx dot de
To get proper email-address replace `dot' and `at' by the corresponding symbols.