regex walktrough

Sat Dec 8 13:08:36 EST 2012

On 2012-12-08 17:48, rh wrote:
>   Look through some code I found this and wondered about what it does:
> ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
>
> Here's my walk through:
>
> 1) ^ match at start of string
> 2) ?P<salsipuedes> if a match is found it will be accessible in a variable
> salsipuedes
> 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
> 4) + one or more from the preceeding char class
> 5) () the grouping we want returned (see #2)
> 6) $ end of the string to match against but before any newline
>
>
> more on #3
> the z-_ part looks wrong and seems that the - should be at the start
> of the char set otherwise we get another range z-_ or does the a-z
> preceeding the z-_ negate the z-_ from becoming a range?  The "."
> might be ok inside a char set. The two slashes look wrong but maybe
> it has some special meaning in some case? I think only one slash is
> needed.
>
> I've looked at pydoc re, but it's cursory.
>
Python itself will help you:

 >>> re.compile(r"^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$", flags=re.DEBUG)
at at_beginning
subpattern 1
   max_repeat 1 65535
     in
       range (48, 57)
       range (65, 90)
       range (97, 122)
       literal 45
       literal 95
       literal 46
       literal 47
       literal 47
at at_end

Inside the character set: "0-9", "A-Z" and "a-z" are ranges; "-", "_",
"." and "/" are literals. Doubling the "/" is unnecessary (it has no
special meaning). "-" is a literal because it immediately follows a
range, so it can't be defining another range (if it immediately
followed a literal and wasn't immediately followed by an unescaped "]"
then it would, so r"[a-]" is the same as r"[a\-]").

As for "(?P<salsipuedes>...)", it won't be accessible in a variable
"salsipuedes", but will be accessible as a named group in the match
object:

 >>> m = re.match(r"(?P<foo>[a-z]+)", "xyz")
 >>> m.group("foo")
'xyz'