RegEx issues

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Sat Jan 24 19:46:43 EST 2009


En Sat, 24 Jan 2009 19:03:26 -0200, Sean Brown gmail.com>  
<"<sbrown.home"@[spammy]> escribió:

> Using python 2.4.4 on OpenSolaris 2008.11
>
> I have the following string created by opening a url that has the
> following string in it:
>
> td[ct] = [[ ... ]];\r\n
>
> The ...  above is what I'm interested in extracting which is really a
> whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
> The problem is it appears that python is escaping the \ in the regex
> because I see this:
>>>> reg = '\[\[(.*)\]\];'
>>>> reg
> '\\[\\[(.*)\\]\\];'
>
> Now to me looks like it would match the string - \[\[ ... \]\];

No. Python escape character is the backslash \; if you want to include a  
backslash inside a string, you have to double it. By example, these are  
all single character strings: 'a'  '\n'  '\\'
Coincidentally (or not), the backslash has a similar meaning in a regular  
expression: if you want a string containing \a (two characters) you should  
write "\\a".
That's rather tedious and error prone. To help with this, Python allows  
for "raw-string literals", where no escape interpretation is done. Just  
put an r before the opening quote: r"\(\d+\)" (seven characters; matches  
numbers inside parenthesis).

Also, note that when you *evaluate* an expression in the interpreter (like  
the lone "reg" above), it prints the "repr" of the result: for a string,  
it is the escaped contents surrounded by quotes. (That's very handy when  
debugging, but may be confusing if don't know how to interpret it)

Third, Python is very permissive with wrong escape sequences: they just  
end up in the string, instead of flagging them as an error. In your case,  
\[ is an invalid escape sequence, which is left untouched in the string.

py> reg = r'\[\[(.*)\]\];'
py> reg
'\\[\\[(.*)\\]\\];'
py> print reg
\[\[(.*)\]\];
py> len(reg)
13

> Which obviously doesn't match anything because there are no literal \ in
> the above string. Leaving the \ out of the \[\[ above has re.compile
> throw an error because [ is a special regex character. Which is why it
> needs to be escaped in the first place.

It works in this example:

py> txt = """
... Some text
... and td[ct] = [[ more things ]];
... more text"""
py> import re
py> m = re.search(reg, txt)
py> m
<_sre.SRE_Match object at 0x00AC66A0>
py> m.groups()
(' more things ',)

So maybe your r.e. doesn't match the text (the final ";"? whitespace?)
For more info, see the Regular Expressions HOWTO at  
http://docs.python.org/howto/regex.html

-- 
Gabriel Genellina




More information about the Python-list mailing list