Raw string substitution problem

MRAB python at mrabarnett.plus.com
Thu Dec 17 14:45:46 EST 2009


Alan G Isaac wrote:
>> Alan G Isaac<alan.isaac at gmail.com>  wrote:
>>>           >>>  re.sub('abc', r'a\nb\n.c\a','123abcdefg') == 
>>> re.sub('abc', 'a\\nb\\n.c\\a','123abcdefg') == re.sub('abc', 
>>> 'a\nb\n.c\a','123abcdefg')
>>>           True
>>> Why are the first two strings being treated as if they are the last one?
>  
> 
> On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
>> They aren't.  The last string is different.
> 
> Of course it is different.
> That is the basis of my question.
> Why is it being treated as if it is the same?
> (See the end of this post.)
> 
> 
>> Alan G Isaac<alan.isaac at gmail.com>  wrote:
>>> More simply, consider::
>>>
>>>           >>>  re.sub('abc', '\\', '123abcdefg')
>>>           Traceback (most recent call last):
>>>             File "<stdin>", line 1, in<module>
>>>             File "C:\Python26\lib\re.py", line 151, in sub
>>>               return _compile(pattern, 0).sub(repl, string, count)
>>>             File "C:\Python26\lib\re.py", line 273, in _subx
>>>               template = _compile_repl(template, pattern)
>>>             File "C:\Python26\lib\re.py", line 260, in _compile_repl
>>>               raise error, v # invalid expression
>>>           sre_constants.error: bogus escape (end of line)
>>>
>>> Why is this the proper handling of what one might think would be an
>>> obvious substitution?
> 
> 
> On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
>> Is this what you want?  What you have is a re expression consisting of
>> a single backslash that doesn't escape anything (EOL) so it barfs.
>         >>>> re.sub('abc', r'\\', '123abcdefg')
>         > '123\\defg'
> 
> 
> Turning again to the documentation:
>         "if it is a string, any backslash escapes in it are processed.
>         That is, \n is converted to a single newline character, \r is
>         converted to a linefeed, and so forth."
> So why is '\n' converted to a newline but '\\' does not become a literal
> backslash?  OK, I don't do much string processing, so perhaps this is where
> I am missing the point: how is the replacement being "converted"?
> (As Peter's example shows, if you supply the replacement via
> a function, this does not happen.) You suggest it is just a matter of
> it being an re, but::
> 
>         >>> re.sub('abc', 'a\\nc','1abcd') == re.sub('abc', 'a\nc','1abcd')
>         True
>         >>> re.compile('a\\nc') == re.compile('a\nc')
>         False
> 
> So I have two string that are not the same, nor do they compile
> equivalently, yet apparently they are "converted" to something
> equivalent for the substitution. Why? Is my question clearer?
> 
re.compile('a\\nc') _does_ compile to the same as regex as
re.compile('a\nc').

However, regex objects never compare equal to each other, so, strictly
speaking, re.compile('a\nc') != re.compile('a\nc').

However, having said that, the re module contains a cache (keyed on the
string and options supplied), so the first re.compile('a\nc') will put
the regex object in the cache and the second re.compile('a\nc') will
return that same regex object from the cache. If you clear the cache in
between the two calls (do re._cache.clear()) you'll get two different
regex objects which won't compare equal even though they are to all
intents identical.

> If the answer looks too obvious to state, assume I'm missing it anyway
> and please state it.  As I said, I seldom use the re module.
> 



More information about the Python-list mailing list