inserting \ in regular expressions

Dave Angel d at davea.name
Wed Oct 26 16:47:54 EDT 2011


On 10/26/2011 03:48 PM, Ross Boylan wrote:
> I want to replace every \ and " (the two characters for backslash and
> double quotes) with a \ and the same character, i.e.,
> \ ->  \\
> " ->  \"
>
> I have not been able to figure out how to do that.  The documentation
> for re.sub says "repl can be a string or a function; if it is a string,
> any backslash escapes in it are processed.That is, \n is converted to a
> single newline character, \r is converted to a carriage return, and so
> forth. Unknown escapes such as \j are left alone."
>
> \\ is apparently unknown, and so is left as is. So I'm unable to get a
> single \.
>
> Here are some tries in Python 2.5.2.  The document suggested the result
> of a function might not be subject to the same problem, but it seems to
> be.
>>>> def f(m):
> ...    return "\\"+m.group(1)
> ...
>>>> re.sub(r"([\\\"])", f, 'Silly " quote')
> 'Silly \\" quote'
> <SNIP>
>>> re.sub(r"([\\\"])", "\\\\\\1", 'Silly " quote')
> 'Silly \\" quote'
>
> Or perhaps I'm confused about what the displayed results mean.  If a
> string has a literal \, does it get shown as \\?
>
> I'd appreciate it if you cc me on the reply.
>
> Thanks.
> Ross Boylan
>
I can't really help on the regex aspect of your code, but I can tell you 
a little about backslashes, quote literals, the interpreter, and python.

First, I'd scrap the interpreter and write your stuff to a file.  Then 
test it by running that file.  The reason for that is that the 
interpreter is helpfully trying to reconstruct the string you'd have to 
type in order to get that result.  So while you may have successfully 
turned a double bacdkslash into a single one, the interpreter helpfully 
does the inverse, and you don't see whether you're right or not.

Next, always assign to variables, and test those variables on a separate 
line with the regex.  This is probably what your document meant when it 
mentioned the result of a function.

Now some details about python.

When python compiles/interprets a quote literal, the syntax parsing has 
to decide where the literal stops, so quotes are treated specially.  
Sometimes you can sidestep the problem of embedding quotes inside 
literals by using single quotes on the outside and double inside, or 
vice versa.  As you did on the 'Silly " quote' example.

But the more general way to put funny characters into a quote literal is 
to escape each with a backslash.  So there a bunch of two-character 
escapes.  backslash-quote is how you can put either kind of quote into a 
literal, regardless of what's being used to delimit it.  backslash-n 
gets a newline, which would similarly be bad syntax.  backslash-t and 
some others are usually less troublesome, but can  be surprising.  And 
backslash-backslash represents a single backslash.  There are also 
backslash codes to represent arbitrary characters you might not have on 
your keyboard.  And these may use multiple characters after the backslash.

So write a bunch of lines like
      a = 'this is\'nt a surprise'
      print a

and experiment.  Notice that if you use \n in such a string, the print 
will put it on two lines.  Likewise the tab is executed.

Now for a digression.  The interpreter uses  repr() to display strings.  
You can experiment with that by doing
      print a
      print repr(a)

Notice the latter puts quotes around the string.  They are NOT part of 
the string object in a.  And it re-escapes any embedded funny 
characters, sometimes differently than the way you entered them.

Now, once you're confident that you can write a literal to express any 
possible string, try calling your regex.
     print re.sub(a, b, c)

or whatever.

  Now, one way to cheat on the string if you know you'll want to put 
actual backslashes is to use the raw string. That works quite well 
unless you want the string to end with a backslash.  There isn't a way 
to enter that as a single raw literal.  You'd have to do something 
string like
      a = r"strange\literal\with\some\stuff" + "\\"

My understanding is that no valid regex ends with a backslash, so this 
may not affect you.

Now there are other ways to acquire a string object. If you got it from 
a raw_input() call, it doesn't need to be escaped, but it can't have an 
embedded newline, since the enter key is how the input is completed.  If 
you read it from a file, it doesn't need to be escaped.

Now you're ready to see what other funny requirements regex needs.  You 
will be escaping stuff for their purposes, and sometimes that means your 
literal might have 4 or even more backslashes in a row.  But hopefully 
now you'll see how to separate the different problems.
-- 

DaveA




More information about the Python-list mailing list