[Tutor] regex: don't match embedded quotes

Albert-Jan Roskam fomcl at yahoo.com
Tue Jun 11 12:56:34 CEST 2013


Hi,
 
I have written a regex that is supposed to match correctly quoted (single quotes on each side, or double quotes on each side) text. It works, but it also matches embedded quoted text, which I don't want to happen.
I should somehow modify the 'comment' group such that it backreferences to 'quote' and includes only the inner quote sign. Background: I am playing around with this to see how hard it would be to write my own Pygments lexer, which I could then also use in IPython notebook.
 
>>> import re
>>> s = "some enumeration 1 'test' 2 'blah' 3 'difficult \"One\"'."
>>> matches = re.finditer("(?P<quote>['\"])(?P<comment>[^'\"]*)(?P=quote)", s, re.DEBUG)
subpattern 1
  in
    literal 39
    literal 34
subpattern 2
  max_repeat 1 65535
    in
      negate None
      literal 39
      literal 34
groupref 1 
# follow-up to a previous thread about splitting on punctuation: I have no idea how the output of re.DEBUG could help me improve my regex.
>>> [match.group("comment") for match in matches]
['test', 'blah', 'One']  # I do not want to match "One"

Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 


More information about the Tutor mailing list