[Tutor] regex questions

Fri Feb 18 11:13:09 CET 2011

Albert-Jan Roskam wrote:

> So the raw string \b means means "ASCII backspace". Is that another way of 
> saying that it means 'Word boundary'?

No.

Python string literals use backslash escapes for special characters, 
similar to what many other computer languages, including C, do.

So when you type "hello world\n" as a *literal* in source code, the \n 
doesn't mean backslash-n, but it means a newline character. The special 
escapes used by Python include:

\0  NULL (ASCII code 0)
\a  BELL character (ASCII code 7)
\b  BACKSPACE (ASCII code 8)
\n  newline
\t  tab
\r  carriage return
\'  single quote  (does not close string)
\"  double quote  (does not close string)
\\  backslash
\0nn  character with ASCII code nn in octal
\xXX  character with ASCII code XX in hex

\b (backspace) doesn't have anything to do with word boundaries.

Regexes, however, are a computer language in themselves, and they use an 
*actual backslash* to introduce special meaning. Because that backslash 
clashes with the use of backslashes in Python string literals, you have 
to work around the clash. You could do any of these:

# Escape the backslash, so Python won't treat it as special:
pattern = '\\bword\\b'

# Use chr() to build up a non-literal string:
pattern = chr(92) + 'bword' + chr(92) + 'b'

# Use raw strings:
pattern = r'\bword\b'

The Python compiler treats backslashes as just an ordinary character 
when it compiles raw strings. So that's the simplest and best solution.

> You're right: debugging regexes is a PIA. One teeny weeny mistake makes all the 
> difference. Could one say that, in general, it's better to use a Divide and 
> Conquer strategy and use a series of regexes and other string operations to 
> reach one's goal?

Absolutely!

-- 
Steven