regex in python

Thu May 25 08:30:57 EDT 2006

On 25/05/2006 7:58 PM, gisleyt wrote:
> I'm trying to compile a perfectly valid regex, but get the error
> message:
> 
>  r =
> re.compile(r'([^\d]*)(\d{1,3}\.\d{0,2})?(\d*)(\,\d{1,3}\.\d{0,2})?(\d*)?.*')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.3/sre.py", line 179, in compile
>     return _compile(pattern, flags)
>   File "/usr/lib/python2.3/sre.py", line 230, in _compile
>     raise error, v # invalid expression
> sre_constants.error: nothing to repeat
> 
> What does this mean? I know that the regex
> ([^\d]*)(\d{1,3}\.\d{0,2})?(\d*)(\,\d{1,3}\.\d{0,2})?(\d*)?.*
> is valid because i'm able to use it in Regex Coach.

Say what??? From the Regex Coach website:
(1) "can be used to experiment with (Perl-compatible) regular expressions"
(2) "PCRE (which is used by projects like Python" -- once upon a time, 
way back in the dream-time, when the world was young, ...

The problem is this little snippet near the end of your regex:

 >>> re.compile(r'(\d*)?')
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "C:\Python24\lib\sre.py", line 180, in compile
     return _compile(pattern, flags)
   File "C:\Python24\lib\sre.py", line 227, in _compile
     raise error, v # invalid expression
sre_constants.error: nothing to repeat

The message is a little cryptic, should be something like "a repeat 
operator has an operand which may match nothing". In other words, you 
have said X? (optional occurrence of X) *BUT* X can already match a 
zero-length string. X in this case is (\d*)

This is a theoretically valid regex, but it's equivalent to just plain 
X, and leaves the reader (and the re implementors, obviously) wondering 
whether you (a) have made a typo (b) are a member of the re 
implementation quality assurance inspectorate or (c) just plain confused :-)

BTW, reading your regex was making my eyes bleed, so I did this to find 
out which piece was the problem:
import re
pat0 = r'([^\d]*)(\d{1,3}\.\d{0,2})?(\d*)(\,\d{1,3}\.\d{0,2})?(\d*)?.*'
pat1 = r'([^\d]*)'
pat2 =         r'(\d{1,3}\.\d{0,2})?'
pat3 =                            r'(\d*)'
pat4 =                                 r'(\,\d{1,3}\.\d{0,2})?'
pat5 =                                                      r'(\d*)?.*'
for k, pat in enumerate([pat1, pat2, pat3, pat4, pat5]):
     print k+1
     re.compile(pat)

> But is Python's
> regex syntax different that an ordinary syntax?

Python aims to lift itself above the ordinary :-)

> 
> By the way, i'm using it to normalise strings like:
> 
> London|country/uk/region/europe/geocoord/32.3244,42,1221244
> to:
> London|country/uk/region/europe/geocoord/32.32,42,12
> 
> By using \1\2\4 as replace. I'm open for other suggestions to achieve
> this!
> 

Well, you are just about on the right track. You need to avoid the 
eye-bleed (by using VERBOSE patterns) and having test data that doesn't 
have typos in it, and more test data. You may like to roll your own test 
harness, in *Python*, for *Python* regexes, like the following:

C:\junk>type re_demo.py
import re

tests = [
     ["AA222.22333,444.44555FF", "AA222.22,444.44"],
     ["foo/geocoord/32.3244,42.1221244", "foo/geocoord/32.32,42.12"], # 
what you meant
     ["foo/geocoord/32.3244,42,1221244", "foo/geocoord/32.32,42,12"], # 
what you posted
     ]

pat0 = r'([^\d]*)(\d{1,3}\.\d{0,2})?(\d*)(\,\d{1,3}\.\d{0,2})?(\d*)?.*'
patx = r"""
     ([^\d]*)               # Grp 1: zero/more non-digits
     (\d{1,3}\.\d{0,2})?    # Grp 2: 1-3 digits, a dot, 0-2 digits 
(optional)
     (\d*)                  # Grp 3: zero/more digits
     (\,\d{1,3}\.\d{0,2})?  # Grp 4: like grp 2 with comma in front 
(optional)
     (\d*)                  # Grp 5: zero/more digits
     (.*)                   # Grp 6: any old rubbish
     """

rx = re.compile(patx, re.VERBOSE)
for testin, expected in tests:
     print "\ntestin:", testin
     mobj = rx.match(testin)
     if not mobj:
         print "no match"
         continue
     for k, grp in enumerate(mobj.groups()):
         print "Group %d matched %r" % (k+1, grp)
     actual = rx.sub(r"\1\2\4", testin)
     print "expected: %r; actual: %r; same: %r" % (expected, actual, 
expected ==
actual)

C:\junk>re_demo.py

testin: AA222.22333,444.44555FF
Group 1 matched 'AA'
Group 2 matched '222.22'
Group 3 matched '333'
Group 4 matched ',444.44'
Group 5 matched '555'
Group 6 matched 'FF'
expected: 'AA222.22,444.44'; actual: 'AA222.22,444.44'; same: True

testin: foo/geocoord/32.3244,42.1221244
Group 1 matched 'foo/geocoord/'
Group 2 matched '32.32'
Group 3 matched '44'
Group 4 matched ',42.12'
Group 5 matched '21244'
Group 6 matched ''
expected: 'foo/geocoord/32.32,42.12'; actual: 
'foo/geocoord/32.32,42.12'; same:
True

testin: foo/geocoord/32.3244,42,1221244
Group 1 matched 'foo/geocoord/'
Group 2 matched '32.32'
Group 3 matched '44'
Group 4 matched None
Group 5 matched ''
Group 6 matched ',42,1221244'
Traceback (most recent call last):
   File "C:\junk\re_demo.py", line 28, in ?
     actual = rx.sub(r"\1\2\4", testin)
   File "C:\Python24\lib\sre.py", line 260, in filter
     return sre_parse.expand_template(template, match)
   File "C:\Python24\lib\sre_parse.py", line 782, in expand_template
     raise error, "unmatched group"
sre_constants.error: unmatched group

===

HTH,
John