regexp questions

Mon Jan 21 13:37:39 EST 2002

On a whim, I grabbed the regular expression test stuff from glibc and
am trying to translate it and throw it at Python 1.5.2's 're' package
to see what it does with it.  Some of the results are a little
confusing and I want to find out what the story is.

When I say "matches in the range (x,y)" I mean according to Python's
convention for numbering substrings.  So, for example, the regular
expression "abc" matches the string "xxxabcxxx" in the range (3,6). 
OK?

First off I have a question about backslashing.  

Some of the time, Python seems to handle backslashes the same way as
C.  For example, the C test suite expects the regular expression "\\."
to match the string "a.c" in the range (1,2).  Python works the same
way:

>>> r = re.compile("\\.")
>>> m = r.search("a.c")
>>> m.start(0)
1
>>> m.end(0)
2

OK, good, I think I understand how Python deals with backslashes.

But here's a problem:  the C test suite expects the regular expression
"[\\]" to match the string "a\\c" in the range (1,2).  I guess "[\\]"
turns into the string bracket-backslash-bracket, which means "match a
backslash".  OK, I get it.  But Python doesn't seem to:

>>> r = re.compile("[\\]")
Traceback (innermost last):
  File "<stdin>", line 1, in ?
  File "/usr/local/site/lib/python1.5/re.py", line 79, in compile
    code=pcre_compile(pattern, flags, groupindex)
pcre.error: ('missing terminating ] for character class', 3)

Yikes, it looks like I'm going to have to try to psyche out an
intermediate level of processing here by "double double backslashing"
the input string.  Sure enough:

>>> r = re.compile("[\\\\]")
>>> m = r.search("a\\c")
>>> m.start(0)
1
>>> m.end(0)
2

But OK, if I had to double-double backslash it there, why didn't I
have to double-double backslash it in the "\\." case?  If there are
two levels of interpretation going on, then "\\." should have turned
into backslash-period, which "at the second level of interpretation"
should have, I don't know, turned into just period maybe? which should
then have matched the input string in the range (0,1).

So what's the story?  How can I tell when I'm supposed to feed in an
"extra level" of backslashing?  Is it naive of me to expect regular
expression parsing to be cut-and-dried enough for one library's test
suite to work against another language?  Am I wading five years too
late into a debate that has already been opened,
agreed-upon-to-disagree-about, and dropped?  help!

Thanks,
   --JMike