[Tutor] RE module is working ?

Steven D'Aprano steve at pearwood.info
Fri Feb 4 02:36:22 CET 2011


Karim wrote:

>>> *Indeed what's the matter with RE module!?*
>> You should really fix the problem with your email program first;
> Thunderbird issue with bold type (appears as stars) but I don't know how 
> to fix it yet.

A man when to a doctor and said, "Doctor, every time I do this, it 
hurts. What should I do?"

The doctor replied, "Then stop doing that!"

:)

Don't add bold or any other formatting to things which should be program 
code. Even if it looks okay in *your* program, you don't know how it 
will look in other people's programs. If you need to draw attention to 
something in a line of code, add a comment, or talk about it in the 
surrounding text.


[...]
> That is not the thing I want. I want to escape any " which are not 
> already escaped.
> The sed regex  '/\([^\\]\)\?"/\1\\"/g' is exactly what I need (I have 
> made regex on unix since 15 years).

Which regex? Perl regexes? sed or awk regexes? Extended regexes? GNU 
posix compliant regexes? grep or egrep regexes? They're all different.

In any case, I am sorry, I don't think your regex does what you say. 
When I try it, it doesn't work for me.

[steve at sylar ~]$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \\"text\"

I wouldn't expect it to work. See below.

By the way, you don't need to escape the brackets or the question mark:

[steve at sylar ~]$ echo 'Some \"text"' | sed -re 's/([^\\])?"/\1\\"/g'
Some \\"text\"


> For me the equivalent python regex is buggy: r'([^\\])?"', r'\1\\"'

No it is not.

The pattern you are matching does not do what you think it does. "Zero 
or one of not-backslash, followed by a quote" will match a single quote 
*regardless* of what is before it. This is true even in sed, as you can 
see above, your sed regex matches both quotes.

\" will match, because the regular expression will match zero 
characters, followed by a quote. So the regex is correct.

 >>> match = r'[^\\]?"'  # zero or one not-backslash followed by quote
 >>> re.search(match, r'aaa\"aaa').group()
'"'

Now watch what happens when you call re.sub:


 >>> match = r'([^\\])?"'  # group 1 equals a single non-backslash
 >>> replace = r'\1\\"'  # group 1 followed by \ followed by "
 >>> re.sub(match, replace, 'aaaa')  # no matches
'aaaa'
 >>> re.sub(match, replace, 'aa"aa')  # one match
'aa\\"aa'
 >>> re.sub(match, replace, '"aaaa')  # one match, but there's no group 1
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/local/lib/python3.1/re.py", line 166, in sub
     return _compile(pattern, flags).sub(repl, string, count)
   File "/usr/local/lib/python3.1/re.py", line 303, in filter
     return sre_parse.expand_template(template, match)
   File "/usr/local/lib/python3.1/sre_parse.py", line 807, in 
expand_template
     raise error("unmatched group")
sre_constants.error: unmatched group

Because group 1 was never matched, Python's re.sub raised an error. It 
is not a very informative error, but it is valid behaviour.

If I try the same thing in sed, I get something different:

[steve at sylar ~]$ echo '"Some text' | sed -re 's/([^\\])?"/\1\\"/g'
\"Some text

It looks like this version of sed defines backreferences on the 
right-hand side to be the empty string, in the case that they don't 
match at all. But this is not standard behaviour. The sed FAQs say that 
this behaviour will depend on the version of sed you are using:

"Seds differ in how they treat invalid backreferences where no 
corresponding group occurs."

http://sed.sourceforge.net/sedfaq3.html

So you can't rely on this feature. If it works for you, great, but it 
may not work for other people.


When you delete the ? from the Python regex, group 1 is always valid, 
and you don't get an exception. Or if you ensure the input always 
matches group 1, no exception:

 >>> match = r'([^\\])?"'
 >>> replace = r'\1\\"'
 >>> re.sub(match, replace, 'a"a"a"a') # group 1 always matches
'a\\"a\\"a\\"a'

(It still won't do what you want, but that's a *different* problem.)



Jamie Zawinski wrote:

   Some people, when confronted with a problem, think "I know,
   I'll use regular expressions." Now they have two problems.

How many hours have you spent trying to solve this problem using 
regexes? This is a *tiny* problem that requires an easy solution, not 
wrestling with a programming language that looks like line-noise.

This should do what you ask for:

def escape(text):
     """Escape any double-quote characters if and only if they
     aren't already escaped."""
     output = []
     escaped = False
     for c in text:
         if c == '"' and not escaped:
             output.append('\\')
         elif c == '\\':
             output.append('\\')
             escaped = True
             continue
         output.append(c)
         escaped = False
     return ''.join(output)


Armed with this helper function, which took me two minutes to write, I 
can do this:

 >>> text = 'Some text with backslash-quotes \\" and plain quotes " 
together.'
 >>> print escape(text)
Some text with backslash-quotes \" and plain quotes \" together.


Most problems that people turn to regexes are best solved without 
regexes. Even Larry Wall, inventor of Perl, is dissatisfied with regex 
culture and syntax:

http://dev.perl.org/perl6/doc/design/apo/A05.html



-- 
Steven


More information about the Tutor mailing list