re.sub unexpected behaviour

Tue Jul 6 13:32:07 EDT 2010

On Tue, 06 Jul 2010 19:10:17 +0200, Javier Collado wrote:

> Hello,
> 
> Let's imagine that we have a simple function that generates a
> replacement for a regular expression:
> 
> def process(match):
>     return match.string
> 
> If we use that simple function with re.sub using a simple pattern and a
> string we get the expected output:
> re.sub('123', process, '123')
> '123'
> 
> However, if the string passed to re.sub contains a trailing new line
> character, then we get an extra new line character unexpectedly:
> re.sub(r'123', process, '123\n')
> '123\n\n'

I don't know why you say it is unexpected. The regex "123" matched the 
first three characters of "123\n". Those three characters are replaced by 
a copy of the string you are searching "123\n", which gives "123\n\n" 
exactly as expected.

Perhaps these examples might help:

>>> re.sub('W', process, 'Hello World')
'Hello Hello Worldorld'
>>> re.sub('o', process, 'Hello World')
'HellHello World WHello Worldrld'

Here's a simplified pure-Python equivalent of what you are doing:

def replace_with_match_string(target, s):
    n = s.find(target)
    if n != -1:
        s = s[:n] + s + s[n+len(target):]
    return s

> If we try to get the same result using a replacement string, instead of
> a function, the strange behaviour cannot be reproduced: re.sub(r'123',
> '123', '123')
> '123'
> 
> re.sub('123', '123', '123\n')
> '123\n'

The regex "123" matches the first three characters of "123\n", which is 
then replaced by "123", giving "123\n", exactly as expected.

>>> re.sub("o", "123", "Hello World")
'Hell123 W123rld'

-- 
Steven