split on NO-BREAK SPACE

Peter Kleiweg p.c.j.kleiweg at rug.nl
Sun Jul 22 15:13:02 EDT 2007


Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote: 
> > > It's a feature. See help(str.split): "If sep is not specified or is
> > > None, any whitespace string is a separator."
> > 
> > Define "any whitespace".
> 
> Any string for which isspace returns True.

Define white space to isspace()
 
> > Why is it different in <type 'str'> and <type 'unicode'>?
> 
> >>> '\xa0'.isspace()
> False
> >>> u'\xa0'.isspace()
> True

Here is another "space":

  >>> u'\uFEFF'.isspace()
  False

isspace() is inconsistent
 
> For byte strings, Python doesn't know whether 0xA0 is a whitespace
> because it depends on the encoding whether the number 160 corresponds to
> a whitespace character. For unicode strings, code point 160 is
> unquestionably a whitespace, because it is a no-break SPACE.

I question it. And so does the sre module:

  \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]

Where is the NO-BREAK SPACE in there?

 
> > Why does split() split when it says NO-BREAK?
> 
> Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

That is a stupid answer.


-- 
Peter Kleiweg  L:NL,af,da,de,en,ia,nds,no,sv,(fr,it)  S:NL,de,en,(da,ia)
info: http://www.let.rug.nl/kleiweg/ls.html



More information about the Python-list mailing list