regexp questions

Mon Jan 21 19:51:02 EST 2002

jmike at alum.mit.edu (J. Michael Hammond) wrote in message news:<f9290115.0201211037.1e68321f at posting.google.com>...
> On a whim, I grabbed the regular expression test stuff from glibc and
> am trying to translate it and throw it at Python 1.5.2's 're' package

Doesn't really affect your problems, but any good reason not to be
using version 2.2?

> to see what it does with it.  Some of the results are a little
> confusing and I want to find out what the story is.
> 
> When I say "matches in the range (x,y)" I mean according to Python's
> convention for numbering substrings.  So, for example, the regular
> expression "abc" matches the string "xxxabcxxx" in the range (3,6). 
> OK?
> 
> 
> First off I have a question about backslashing.  
> 
> Some of the time, Python seems to handle backslashes the same way as
> C.  For example, the C test suite expects the regular expression "\\."
> to match the string "a.c" in the range (1,2).  Python works the same
> way:
> 
> >>> r = re.compile("\\.")
> >>> m = r.search("a.c")
> >>> m.start(0)
>  1
> >>> m.end(0)
> 2
> 
> OK, good, I think I understand how Python deals with backslashes.

What is happening underneath the sheet: The pattern wants to match a
literal dot. As dot is special in regexes, you need to escape it with
\. As \is special to C and Python lexers, you need to escape it with \
--- OR, in python, use the r"" notation --- in this case
r = re.compile(r"\.")

You can save some typing: m.span() produces (1, 2) in this case

> 
> But here's a problem:  the C test suite expects the regular expression
> "[\\]" to match the string "a\\c" in the range (1,2).  I guess "[\\]"
> turns into the string bracket-backslash-bracket, which means "match a
> backslash".  OK, I get it.  But Python doesn't seem to:
> 
> >>> r = re.compile("[\\]")
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
>   File "/usr/local/site/lib/python1.5/re.py", line 79, in compile
>     code=pcre_compile(pattern, flags, groupindex)
> pcre.error: ('missing terminating ] for character class', 3)

Yes -- After the lexers have done, both regexers are seeing [\] -- the
C regexer seems to be treating the \ literally but Python's re is
treating it as escaping the ] (making it non-special so that it will
match a literal ]), hence the message you got. Presumably the C
regexer enables one to have a literal ] inside a character class only
by having it at the beginning: "[]a-z]" matches a lowercase letter or
a right-square-bracket.

> Yikes, it looks like I'm going to have to try to psyche out an
> intermediate level of processing here by "double double backslashing"
> the input string.  Sure enough:
> 
> >>> r = re.compile("[\\\\]")
> >>> m = r.search("a\\c")
> >>> m.start(0)
>  1
> >>> m.end(0)
> 2
> 
> But OK, if I had to double-double backslash it there, why didn't I
> have to double-double backslash it in the "\\." case?  If there are
> two levels of interpretation going on, then "\\." should have turned
> into backslash-period, which "at the second level of interpretation"
> should have, I don't know, turned into just period maybe? which should
> then have matched the input string in the range (0,1).
> 
> So what's the story?  How can I tell when I'm supposed to feed in an
> "extra level" of backslashing?

(a) Don't; use the r"raw" notation in all re work in Python
(b) RTFM

> Is it naive of me to expect regular
> expression parsing to be cut-and-dried enough for one library's test
> suite to work against another language?

Indeed. You should normally expect only vague similarities between
regex implementations, especially in areas such as what is special and
what has to be escaped [*]. Design considerations seem to include the
phase of the moon and the square root of the designer's grandparents'
mean collar-size. Python's re deliberately aims to be compatible with
Perl's equivalent but this principle has had to be abandoned in at
least one case where perl.moon_phase == "full" (I refer to the \Z
versus \z problem).

[*] You might like to look at chapter 3 of "Mastering Regular
Expressions", by Jeffrey E. F. Friedl -- published by O'Reilly -- this
expounds at length on the awful dissimilarity of features provided by
various regexers. The whole book is generally recommended to those who
want/need to dip more than one toe into the regex swamp.

> Am I wading five years too
> late into a debate that has already been opened,
> agreed-upon-to-disagree-about, and dropped?  help!

More than five.

> 
> Thanks,
>    --JMike

HTH,
John