Making regex suck less

Harvey Thomas hst at empolis.co.uk
Mon Sep 2 07:06:08 EDT 2002


John La Rooy wrote
> 
> Carl Banks wrote:
> > Gerhard H?ring wrote:
> > 
> >>>which means the real time is not spent in the compile() 
> function, but
> >>>in the match or find function. So basically, couldn't one 
> come up with
> >>>a *human readable* syntax for re, and compile that instead?
> >>
> >>That's equally powerful? Most probably not.
> > 
> > 
> > Why not?  It won't be as fast, but it should be able to do 
> anything a
> > regexp can do, and would be much more versatile.
> > 
> >
> I think the main problem is that *human readable* doesn't map really 
> well onto regular expressions.
> 
> What would the equivalent of r"(.)(.)(.)\3\2\1"
> This means a "palindrome of 6 characters"
> But it is unlikely that the human readable processor would understand 
> that (isn't it??)
> 
> It would be more likely to look like this (I haven't put too much 
> thought into this)
> "anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
> or would you like to suggest something else?
> 
> palindrome_6 = re.compile(r"(.)(.)(.)\3\2\1")
> palindrome_6 = 
> re.compile("anything,anything,anything,same_as_3rd,same_as_2nd
> ,same_as_1st")
> 
> Sure there are some cases where the re is loaded with meta 
> characters...
> 
> hmmm
> OK is this about writing maintainable code or people not wanting to 
> learn all the ins and outs of re's?
> 
> John

I used to use OmniMark a lot when it was free. With OmniMark's
equivalent of REs, the palindrome would be

any => char1 any => char2 any => char3  char3 char2 char1

I selected a non-trivial OmniMark RE from old code at random and came up with

('<!DOCTYPE' white-space+ [any except white-space]+ white-space+ "PUBLIC" white-space+ '"'
    upto-inc('"')) => a.whole (white-space+ "SYSTEM"? white-space* '"'
    upto-inc('"'))?

The (untested) Python RE equivalent is something like

"""(?P<a.whole><!DOCTYPE\s+[^\n]+\s+"PUBLIC"\s+"[^"]+")(?:\s+(?:"SYSTEM")?\s+"[^"]*")?"""

or, more readably if compiled with the re.VERBOSE flag

"""
(?P<a.whole>
   <!DOCTYPE
   \s+[^\n]+
   \s+"PUBLIC"
   \s+"[^"]+")
(?:\s+(?:"SYSTEM")?
\s+"[^"]*")?
"""


Which is the easiest to understand?

I'm used to REs, so I don't find the verbose Python RE too difficult to read. When I was learning
OmniMark, however, it was nice to be able to use "digit" and "letter" rather than
"\d" and [a-zA-Z] as creating non-trivial effective search patterns is never easy.


_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.




More information about the Python-list mailing list