Making regex suck less
Harvey Thomas
hst at empolis.co.uk
Mon Sep 2 07:06:08 EDT 2002
John La Rooy wrote
>
> Carl Banks wrote:
> > Gerhard H?ring wrote:
> >
> >>>which means the real time is not spent in the compile()
> function, but
> >>>in the match or find function. So basically, couldn't one
> come up with
> >>>a *human readable* syntax for re, and compile that instead?
> >>
> >>That's equally powerful? Most probably not.
> >
> >
> > Why not? It won't be as fast, but it should be able to do
> anything a
> > regexp can do, and would be much more versatile.
> >
> >
> I think the main problem is that *human readable* doesn't map really
> well onto regular expressions.
>
> What would the equivalent of r"(.)(.)(.)\3\2\1"
> This means a "palindrome of 6 characters"
> But it is unlikely that the human readable processor would understand
> that (isn't it??)
>
> It would be more likely to look like this (I haven't put too much
> thought into this)
> "anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
> or would you like to suggest something else?
>
> palindrome_6 = re.compile(r"(.)(.)(.)\3\2\1")
> palindrome_6 =
> re.compile("anything,anything,anything,same_as_3rd,same_as_2nd
> ,same_as_1st")
>
> Sure there are some cases where the re is loaded with meta
> characters...
>
> hmmm
> OK is this about writing maintainable code or people not wanting to
> learn all the ins and outs of re's?
>
> John
I used to use OmniMark a lot when it was free. With OmniMark's
equivalent of REs, the palindrome would be
any => char1 any => char2 any => char3 char3 char2 char1
I selected a non-trivial OmniMark RE from old code at random and came up with
('<!DOCTYPE' white-space+ [any except white-space]+ white-space+ "PUBLIC" white-space+ '"'
upto-inc('"')) => a.whole (white-space+ "SYSTEM"? white-space* '"'
upto-inc('"'))?
The (untested) Python RE equivalent is something like
"""(?P<a.whole><!DOCTYPE\s+[^\n]+\s+"PUBLIC"\s+"[^"]+")(?:\s+(?:"SYSTEM")?\s+"[^"]*")?"""
or, more readably if compiled with the re.VERBOSE flag
"""
(?P<a.whole>
<!DOCTYPE
\s+[^\n]+
\s+"PUBLIC"
\s+"[^"]+")
(?:\s+(?:"SYSTEM")?
\s+"[^"]*")?
"""
Which is the easiest to understand?
I'm used to REs, so I don't find the verbose Python RE too difficult to read. When I was learning
OmniMark, however, it was nice to be able to use "digit" and "letter" rather than
"\d" and [a-zA-Z] as creating non-trivial effective search patterns is never easy.
_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.
More information about the Python-list
mailing list