[docs] copy&waste problem
Senthil Kumaran
senthil at uthcode.com
Tue Mar 13 17:16:53 CET 2012
I understand your points. My reasoning was based on -
man 5 locale which stated the following for locale.
space followed by a list of characters defined as white-space
characters. Characters also specified as upper, lower,
alpha, digit, graph, or xdigit are not allowed. The
characters <space>, <form-feed>, <newline>, <carriage-return>,
<tab>, and <vertical-tab> are automatically included.
If the intersection logic were to be followed then, it would
completely remove those "<space>, <form-feed>, <newline>,
<carriage-return>, <tab>, and <vertical-tab>" from the match as they
are included as space characters in the locale definition too.
Isn't it?
Here is bug report -
http://bugs.python.org/issue14258
Thanks,
Senthil
On Tue, Mar 13, 2012 at 12:17:37PM +0000, Hauke Rehr wrote:
> Hello again,
>
> I’d rather use the ticket you started, but I couldn’t find where to post an
> answer/where the discussion is tracked.
>
> So once again, to clarify what I had in mind:
>
> As for the positive (lowercase) classes: Yes, that’s union (first one is
> matched, if there’s no success, the other one is tried). For the negative
> (uppercase) classes it’s deMorgan:
> complement(union(A, B)) = intersection(complement(A), complement(B))
> So we have
> \<uppercase> = complement(\<lowercase>)
> = complement(union(\<lowercase_locale>, \<lowercase_unicode>))
> = intersection(complement(\<lowercase_locale>), complement(\
> <lowercase_unicode>))
> = intersection(\<uppercase_locale>, \<uppercase_unicode>).
>
> At least, that’s how it should be and what the code means you quoted
> (it doesn’t matter which one you try first: union is symmetric)
> for a match of an uppercase class means nothing but
> a char that doesn’t match the corresponding lowercase class
>
> So I still believe my corrections to be - well, correct; and your suggestion to
> be erroneous.
>
> Hauke
>
>
> --- Senthil Kumaran <senthil at uthcode.com> schrieb am Mo, 12.3.2012:
>
>
> Von: Senthil Kumaran <senthil at uthcode.com>
> Betreff: Re: [docs] copy&waste problem
> An: "Hauke Rehr" <homo_laber at yahoo.de>
> CC: docs at python.org
> Datum: Montag, 12. März, 2012 04:09 Uhr
>
> Hello Hauke,
>
> I guess, you are mistaken with the meaning of re.LOCALE flag for space. It
> is not intersection but Union of the locale's space characters with the
> ascii space characters.
>
> For \S, with `LOCALE flag set, it will match [^ \t\n\r\f\v] plus any
> non-whitespace characters defined by that locale.
>
>
>
> + In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
> + these character classes will behave as if the union was given.
>
> Where did you find this logic? I see that, locale flag is matched first and
> then unicode.
>
> In Modules\_sre.c
>
> if (pattern->flags & SRE_FLAG_LOCALE)
> state->lower = sre_lower_locale;
> else if (pattern->flags & SRE_FLAG_UNICODE)
>
>
> I am going ahead with the changes as I suggested previously and also
> opening a bug report. Further discussions and changes can be tracked there.
> Yeah, sometimes doc changes go for discussions and iterations too. :(
>
> --
> Senthil
>
>
>
> On Fri, Mar 9, 2012 at 6:12 AM, Hauke Rehr <homo_laber at yahoo.de> wrote:
>
> Hello again,
>
> I can’t agree with your rewrite either, sorry - my suggestion based on
> yours:
>
>
> + When the :const:`LOCALE` and :const:`UNICODE` flags are not
> specified,
> + matches any non-whitespace character; this is equivalent to the set
> ``[^
> + \t\n\r\f\v]`` With :const:`LOCALE`, it will match those elements of
> the above set
> + not defined as space in the current locale. If :const:`UNICODE` is
> set, those elements
> + of ``[^ \t\n\r\f\v]`` not marked as space in the Unicode character
> properties database
> + will be matched.
>
> If I don’t get the meaning of \S (that is: anything but \s) wrong, this
> should be correct.
> The same applies to \W:
>
> + this will match anything other than ``[0-9_]`` not classified as
> + alphanumeric in the Unicode character properties database.
>
>
> For the additional sentence, I’d prefer:
>
> + In case both ``re.LOCALE`` and ``re.UNICODE`` are specified
> alongside,
> + these character classes will behave as if the union was given.
>
> for that’s the logic behind.
>
> Hauke
>
>
>
More information about the docs
mailing list