[Tutor] regex questions
Steven D'Aprano
steve at pearwood.info
Fri Feb 18 11:13:09 CET 2011
Albert-Jan Roskam wrote:
> So the raw string \b means means "ASCII backspace". Is that another way of
> saying that it means 'Word boundary'?
No.
Python string literals use backslash escapes for special characters,
similar to what many other computer languages, including C, do.
So when you type "hello world\n" as a *literal* in source code, the \n
doesn't mean backslash-n, but it means a newline character. The special
escapes used by Python include:
\0 NULL (ASCII code 0)
\a BELL character (ASCII code 7)
\b BACKSPACE (ASCII code 8)
\n newline
\t tab
\r carriage return
\' single quote (does not close string)
\" double quote (does not close string)
\\ backslash
\0nn character with ASCII code nn in octal
\xXX character with ASCII code XX in hex
\b (backspace) doesn't have anything to do with word boundaries.
Regexes, however, are a computer language in themselves, and they use an
*actual backslash* to introduce special meaning. Because that backslash
clashes with the use of backslashes in Python string literals, you have
to work around the clash. You could do any of these:
# Escape the backslash, so Python won't treat it as special:
pattern = '\\bword\\b'
# Use chr() to build up a non-literal string:
pattern = chr(92) + 'bword' + chr(92) + 'b'
# Use raw strings:
pattern = r'\bword\b'
The Python compiler treats backslashes as just an ordinary character
when it compiles raw strings. So that's the simplest and best solution.
> You're right: debugging regexes is a PIA. One teeny weeny mistake makes all the
> difference. Could one say that, in general, it's better to use a Divide and
> Conquer strategy and use a series of regexes and other string operations to
> reach one's goal?
Absolutely!
--
Steven
More information about the Tutor
mailing list