split on NO-BREAK SPACE

Sun Jul 22 15:37:32 EDT 2007

On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg <p.c.j.kleiweg at rug.nl> wrote:
>Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
>
>> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
>> > > It's a feature. See help(str.split): "If sep is not specified or is
>> > > None, any whitespace string is a separator."
>> >
>> > Define "any whitespace".
>>
>> Any string for which isspace returns True.
>
>Define white space to isspace()
>
>> > Why is it different in <type 'str'> and <type 'unicode'>?
>>
>> >>> '\xa0'.isspace()
>> False
>> >>> u'\xa0'.isspace()
>> True
>
>Here is another "space":
>
>  >>> u'\uFEFF'.isspace()
>  False
>
>isspace() is inconsistent

It's only inconsistent if you think it should behave based on the
name of a unicode code point.  It doesn't use the name, though. It
uses the category.  NO-BREAK SPACE is in the Zs category (Separator, Space).
ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).

Maybe that makes unicode inconsistent (I won't try to argue either way),
but it's pretty clear that isspace is being consistent based on the data
it has to work with.

Jean-Paul