split on NO-BREAK SPACE

Sun Jul 22 18:13:45 EDT 2007

Jean-Paul Calderone wrote:
	> On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg 
<p.c.j.kleiweg at rug.nl> wrote:
>> Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
>>
>>> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
>>>>> It's a feature. See help(str.split): "If sep is not specified or is
>>>>> None, any whitespace string is a separator."
>>>> Define "any whitespace".
>>> Any string for which isspace returns True.
>> Define white space to isspace()
>>
>>>> Why is it different in <type 'str'> and <type 'unicode'>?
>>>>>> '\xa0'.isspace()
>>> False
>>>>>> u'\xa0'.isspace()
>>> True
>> Here is another "space":
>>
>>  >>> u'\uFEFF'.isspace()
>>  False
>>
>> isspace() is inconsistent
> 
> It's only inconsistent if you think it should behave based on the
> name of a unicode code point.  It doesn't use the name, though. It
> uses the category.  NO-BREAK SPACE is in the Zs category (Separator, Space).
> ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).
> 
> Maybe that makes unicode inconsistent (I won't try to argue either way),
> but it's pretty clear that isspace is being consistent based on the data
> it has to work with.
> 
Well, if you're going to start answering questions with FACTS, how can 
questioners reply on their prejudices to guide them any more?

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------