[Tutor] superscripts in a regex

Wed Jul 31 15:28:08 CEST 2013

----- Original Message -----
> From: Peter Otten <__peter__ at web.de>
> To: tutor at python.org
> Cc: 
> Sent: Wednesday, July 31, 2013 2:24 PM
> Subject: Re: [Tutor] superscripts in a regex
> 
> Albert-Jan Roskam wrote:
> 
>> In the script below I want to filter out the digits and I do not want to
>> retain the decimal grouping symbol, if there are any. The weird thing is
>> that re.findall returns the expected result (group 1 with digits and
>> optionally group2 too), but re.sub does not (it just returns the entire
>> string). I tried using flags re.LOCALE, re.UNICODE, and re.DEBUG for
>> solutions/clues, but no luck
> 
>> regex = "(^\d+)[.,]?(\d*)[ \w]+"
>> surfaces = ["79 m\xb2", "1.000 m\xb2", 
> "2,000 m\xb2"]
> 
>> print re.sub(regex, r"\1\2", surface)  # huh?!
>> print re.findall(regex, surface)  # works as expected
> 
> Instead of "huh?!" I would have appreciated a simple
> 
> Did... Expected... But got... Why?

Ok, sorry. Expected: one or two groups of digits. If a decimal separator is present, there would be two non-empty roups, else one.

>> It's a no-no to ask this (esp. because it concerns a builtin) but: is 
> this
>> a b-u-g?
> 
> No bug. Let's remove all the noise from your exposition. Then we get
> 
>>>> re.sub("(a+)b?(c+)d*", r"\1\2", 
> "aaaabccdddeee")
> 'aaaacceee'
>>>> re.findall("(a+)b?(c+)d*", "aaaabccdddeee")
> [('aaaa', 'cc')]
> 
> The 'e's are left alone as they are not matched by the regexp. The fix 
> should be obvious, include them in the bytes allowed after group #2:
> 
>>>> re.sub("(a+)b?(c+)[de]*", r"\1\2", 
> "aaaabccdddeee")
> 'aaaacc'
> 
> Translating back to your regex, The byte "\xb2" is not matched by 
> r"[ \w]":
> 
>>>> re.findall(r"[ \w]", "\xb2")
> []
> 
> Include it explictly (why no $, by the way?)

No $ because of possible trailing blanks.

> 
>>>> re.sub(r"(^\d+)[.,]?(\d*)[ \w\xb2]+", 
> r"\1\2", "1.000 m\xb2")
> '1000'
> 
> or implicitly
> 
>>>> re.sub(r"(^\d+)[.,]?(\d*)\D+", 
> r"\1\2", "1.000 m\xb2")
> '1000'
> 
> and you are golden.

aaah, thank you so much! I had been staring at this too long. It seemed so strange that the same regex would result in different groupings, depending on use of re.sub vs. re.findall, but it isn't, after all.

> PS: I'll leave nudging you to use unicode instead of byte strings to someone 
> 
> else. Only so much (on a console using utf-8):
> 
>>>> re.findall("[¹]", "¹²³")
> ['\xc2', '\xb9', '\xc2', '\xc2']
>>>> print "".join(_)
> ¹��
> 
>>>> re.findall(u"[¹]", u"¹²³")
> [u'\xb9']
>>>> print _[0]
> ¹
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>