split on NO-BREAK SPACE
Jean-Paul Calderone
exarkun at divmod.com
Sun Jul 22 15:37:32 EDT 2007
On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg <p.c.j.kleiweg at rug.nl> wrote:
>Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
>
>> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
>> > > It's a feature. See help(str.split): "If sep is not specified or is
>> > > None, any whitespace string is a separator."
>> >
>> > Define "any whitespace".
>>
>> Any string for which isspace returns True.
>
>Define white space to isspace()
>
>> > Why is it different in <type 'str'> and <type 'unicode'>?
>>
>> >>> '\xa0'.isspace()
>> False
>> >>> u'\xa0'.isspace()
>> True
>
>Here is another "space":
>
> >>> u'\uFEFF'.isspace()
> False
>
>isspace() is inconsistent
It's only inconsistent if you think it should behave based on the
name of a unicode code point. It doesn't use the name, though. It
uses the category. NO-BREAK SPACE is in the Zs category (Separator, Space).
ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).
Maybe that makes unicode inconsistent (I won't try to argue either way),
but it's pretty clear that isspace is being consistent based on the data
it has to work with.
Jean-Paul
More information about the Python-list
mailing list