[docs] copy&waste problem

Senthil Kumaran senthil at uthcode.com
Tue Mar 13 17:16:53 CET 2012


I understand  your points. My reasoning was based on - 

man 5 locale which stated the following for locale.

space  followed by a list of characters defined as white-space
characters.  Characters also  specified  as  upper,  lower,
alpha,  digit, graph, or xdigit are not allowed.  The
characters <space>, <form-feed>, <newline>, <carriage-return>,
<tab>, and <vertical-tab> are automatically included.

If the intersection logic were to be followed then, it would
completely remove those "<space>, <form-feed>, <newline>,
<carriage-return>, <tab>, and <vertical-tab>" from the match as they
are included as space characters in the locale definition too.
Isn't it?

Here is bug report - 

http://bugs.python.org/issue14258

Thanks,
Senthil


On Tue, Mar 13, 2012 at 12:17:37PM +0000, Hauke Rehr wrote:
> Hello again,
> 
> I’d rather use the ticket you started, but I couldn’t find where to post an
> answer/where the discussion is tracked.
> 
> So once again, to clarify what I had in mind:
> 
> As for the positive (lowercase) classes: Yes, that’s union (first one is
> matched, if there’s no success, the other one is tried). For the negative
> (uppercase) classes it’s deMorgan:
> complement(union(A, B)) = intersection(complement(A), complement(B))
> So we have
> \<uppercase> = complement(\<lowercase>)
> = complement(union(\<lowercase_locale>, \<lowercase_unicode>))
> = intersection(complement(\<lowercase_locale>), complement(\
> <lowercase_unicode>))
> = intersection(\<uppercase_locale>, \<uppercase_unicode>).
> 
> At least, that’s how it should be and what the code means you quoted
> (it doesn’t matter which one you try first: union is symmetric)
> for a match of an uppercase class means nothing but
> a char that doesn’t match the corresponding lowercase class
> 
> So I still believe my corrections to be - well, correct; and your suggestion to
> be erroneous.
> 
> Hauke
> 
> 
> --- Senthil Kumaran <senthil at uthcode.com> schrieb am Mo, 12.3.2012:
> 
> 
>     Von: Senthil Kumaran <senthil at uthcode.com>
>     Betreff: Re: [docs] copy&waste problem
>     An: "Hauke Rehr" <homo_laber at yahoo.de>
>     CC: docs at python.org
>     Datum: Montag, 12. März, 2012 04:09 Uhr
> 
>     Hello Hauke,
> 
>     I guess, you are mistaken with the meaning of re.LOCALE flag for space.  It
>     is not intersection but Union of the locale's space characters with the
>     ascii space characters.
> 
>     For \S, with `LOCALE flag set, it will match [^ \t\n\r\f\v] plus any
>     non-whitespace characters defined by that locale. 
> 
> 
> 
>     +   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
>     +   these character classes will behave as if the union was given.
> 
>     Where did you find this logic? I see that, locale flag is matched first and
>     then unicode.
> 
>     In Modules\_sre.c    
> 
>     if (pattern->flags & SRE_FLAG_LOCALE)
>             state->lower = sre_lower_locale;
>         else if (pattern->flags & SRE_FLAG_UNICODE)
> 
> 
>     I am going ahead with the changes as I suggested previously and also
>     opening a bug report. Further discussions and changes can be tracked there.
>     Yeah, sometimes doc changes go for discussions and iterations too. :( 
> 
>     -- 
>     Senthil
> 
> 
> 
>     On Fri, Mar 9, 2012 at 6:12 AM, Hauke Rehr <homo_laber at yahoo.de> wrote:
> 
>         Hello again,
> 
>         I can’t agree with your rewrite either, sorry - my suggestion based on
>         yours:
> 
> 
>         +   When the :const:`LOCALE` and :const:`UNICODE` flags are not
>         specified,
>         +   matches any non-whitespace character; this is equivalent to the set
>         ``[^
>         +   \t\n\r\f\v]`` With :const:`LOCALE`, it will match those elements of
>         the above set
>         +   not defined as space in the current locale. If :const:`UNICODE` is
>         set, those elements
>         +   of ``[^ \t\n\r\f\v]`` not marked as space in the Unicode character
>         properties database
>         +   will be matched.
> 
>         If I don’t get the meaning of \S (that is: anything but \s) wrong, this
>         should be correct.
>         The same applies to \W:
> 
>         +   this will match anything other than ``[0-9_]`` not classified as
>         +   alphanumeric in the Unicode character properties database.
> 
> 
>         For the additional sentence, I’d prefer:
> 
>         +   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified
>         alongside,
>         +   these character classes will behave as if the union was given.
> 
>         for that’s the logic behind.
> 
>         Hauke
> 
> 
> 


More information about the docs mailing list