possible bug in re expression?

MRAB python at mrabarnett.plus.com
Fri Apr 25 13:57:23 EDT 2014


On 2014-04-25 17:55, Chris Angelico wrote:
> On Sat, Apr 26, 2014 at 2:30 AM, Robin Becker <robin at reportlab.com> wrote:
>> Whilst translating some javascript code I find that this
>>
>> A=re.compile('.{1,+3}').findall(p)
>>
>> doesn't give any error, but doesn't manage to find the strings in p that I
>> want len(A)==>0, the correct translation should have been
>>
>> A=re.compile('.{1,3}').findall(p)
>>
>> which works fine.
>>
>> should
>>
>> re.compile('.{1,+3}')
>>
>> raise an error? It doesn't on python 2.7 or 3.3.
>
> I would say the surprising part is that your js code doesn't mind an
> extraneous character in the regex. In a brace like that, negative
> numbers have no meaning, so I would expect the definition of the regex
> to look for digits, not "anything that can be parsed as a number". So
> you've uncovered a bug in your code that just happened to work in js.
>
> Should it raise an error? Good question. Quite possibly it should,
> unless that has some other meaning that I'm not familiar with. Do you
> know how it's being interpreted? I'm not entirely sure what you mean
> by "len(A)==>0", as ==> isn't an operator in Python or JS. Best way to
> continue, I think, would be to use regular expression matching (rather
> than findall'ing) and something other than dot, and tabulate input
> strings, expected result (match or no match), what JS does, and what
> Python does. For instance:
>
> Regex: "^a{1,3}$"
>
> "": Not expected, not Python
> "a": Expected, Python
> "aa": Expected, Python
> "aaa": Expected, Python
> "aaaa": Not expected, not Python
>
> Just what we'd expect. Now try the same thing with the plus in there.
> I'm finding that none of the above strings yields a match. Maybe
> there's something else being matched?
>
The DEBUG flag helps to show what's happening:

 >>> r = re.compile('.{1,+3}', flags=re.DEBUG)
any None
literal 123
literal 49
max_repeat 1 4294967295
   literal 44
literal 51
literal 125

When it's parsing the pattern it's doing this:

.    OK, match any character
{    Looks like the start of a quantifier
1    OK, the minimum count
,    OK, the maximum count probably follows
+    Error; it looks like the '{' was a literal

Trying again from the brace:

{    Literal
1    Literal
,    Literal
+    Repeat the previous item one or more times
3    Literal
}    Literal




More information about the Python-list mailing list