[docs] copy&waste problem

Senthil Kumaran senthil at uthcode.com
Wed Mar 14 01:39:16 CET 2012


Hello Hauke,

The deMorgan's completion logic does not seem to apply here. I tried
it by doing an exercise of of intersecting \S between two languages.
It is clear to see that output did not include space from the 2nd
language.

Here is the tracker # http://bugs.python.org/issue14258

Thanks,
Senthil

On Tue, Mar 13, 2012 at 12:17:37PM +0000, Hauke Rehr wrote:
> Hello again,
> 
> I’d rather use the ticket you started, but I couldn’t find where to post an
> answer/where the discussion is tracked.
> 
> So once again, to clarify what I had in mind:
> 
> As for the positive (lowercase) classes: Yes, that’s union (first one is
> matched, if there’s no success, the other one is tried). For the negative
> (uppercase) classes it’s deMorgan:
> complement(union(A, B)) = intersection(complement(A), complement(B))
> So we have
> \<uppercase> = complement(\<lowercase>)
> = complement(union(\<lowercase_locale>, \<lowercase_unicode>))
> = intersection(complement(\<lowercase_locale>), complement(\
> <lowercase_unicode>))
> = intersection(\<uppercase_locale>, \<uppercase_unicode>).
> 
> At least, that’s how it should be and what the code means you quoted
> (it doesn’t matter which one you try first: union is symmetric)
> for a match of an uppercase class means nothing but
> a char that doesn’t match the corresponding lowercase class
> 
> So I still believe my corrections to be - well, correct; and your suggestion to
> be erroneous.
> 
> Hauke
> 
> 
> --- Senthil Kumaran <senthil at uthcode.com> schrieb am Mo, 12.3.2012:
> 
> 
>     Von: Senthil Kumaran <senthil at uthcode.com>
>     Betreff: Re: [docs] copy&waste problem
>     An: "Hauke Rehr" <homo_laber at yahoo.de>
>     CC: docs at python.org
>     Datum: Montag, 12. März, 2012 04:09 Uhr
> 
>     Hello Hauke,
> 
>     I guess, you are mistaken with the meaning of re.LOCALE flag for space.  It
>     is not intersection but Union of the locale's space characters with the
>     ascii space characters.
> 
>     For \S, with `LOCALE flag set, it will match [^ \t\n\r\f\v] plus any
>     non-whitespace characters defined by that locale. 
> 
> 
> 
>     +   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
>     +   these character classes will behave as if the union was given.
> 
>     Where did you find this logic? I see that, locale flag is matched first and
>     then unicode.
> 
>     In Modules\_sre.c    
> 
>     if (pattern->flags & SRE_FLAG_LOCALE)
>             state->lower = sre_lower_locale;
>         else if (pattern->flags & SRE_FLAG_UNICODE)
> 
> 
>     I am going ahead with the changes as I suggested previously and also
>     opening a bug report. Further discussions and changes can be tracked there.
>     Yeah, sometimes doc changes go for discussions and iterations too. :( 
> 
>     -- 
>     Senthil
> 
> 
> 
>     On Fri, Mar 9, 2012 at 6:12 AM, Hauke Rehr <homo_laber at yahoo.de> wrote:
> 
>         Hello again,
> 
>         I can’t agree with your rewrite either, sorry - my suggestion based on
>         yours:
> 
> 
>         +   When the :const:`LOCALE` and :const:`UNICODE` flags are not
>         specified,
>         +   matches any non-whitespace character; this is equivalent to the set
>         ``[^
>         +   \t\n\r\f\v]`` With :const:`LOCALE`, it will match those elements of
>         the above set
>         +   not defined as space in the current locale. If :const:`UNICODE` is
>         set, those elements
>         +   of ``[^ \t\n\r\f\v]`` not marked as space in the Unicode character
>         properties database
>         +   will be matched.
> 
>         If I don’t get the meaning of \S (that is: anything but \s) wrong, this
>         should be correct.
>         The same applies to \W:
> 
>         +   this will match anything other than ``[0-9_]`` not classified as
>         +   alphanumeric in the Unicode character properties database.
> 
> 
>         For the additional sentence, I’d prefer:
> 
>         +   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified
>         alongside,
>         +   these character classes will behave as if the union was given.
> 
>         for that’s the logic behind.
> 
>         Hauke
> 
> 
> 


More information about the docs mailing list