re.sub

Tue Oct 16 14:38:28 EDT 2007

> Let me show you a very bad consequence of this...
> 
> a=open('file1.txt','rb').read()
> b=re.sub('x',a,'x')
> open('file2.txt','wb').write(b)
> 
> Now if file1.txt contains a \n or \" then file2.txt is not the
> same as file1.txt while it should be.

That's functioning as designed.  If you want to treat file1.txt 
as a literal pattern for replacement, use re.escape() on it to 
escape things you don't want.

   http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

   b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

   b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember 
that Python also uses the backslash as an escape sequence in 
string literals; if the escape sequence isn't recognized by 
Python's parser, the backslash and subsequent character are 
included in the resulting string. However, if Python would 
recognize the resulting sequence, the backslash should be 
repeated twice. This is complicated and hard to understand, so 
it's highly recommended that you use raw strings for all but the 
simplest expressions.
"""

The short upshot:  "it's highly recommended that you use raw 
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp. 
  Not a "python interpretation a regexp before the regex engine 
gets to touch it".

-tkc