[ python-Bugs-1390608 ] split() breaks no-break spaces

SourceForge.net noreply at sourceforge.net
Fri Dec 30 13:35:01 CET 2005


Bugs item #1390608, was opened at 2005-12-26 16:03
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: MvR (maxim_razin)
Assigned to: Nobody/Anonymous (nobody)
Summary: split() breaks no-break spaces

Initial Comment:
string.split(), str.split() and unicode.split() without
parameters break strings by the No-break space (U+00A0)
character.  This character is specially intended not to
be a split border.  

>>> u"Hello\u00A0world".split()
[u'Hello', u'world']


----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2005-12-30 13:35

Message:
Logged In: YES 
user_id=89016

What's wrong with the following?

import sys, unicodedata
spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
foo.split(spaces)

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2005-12-30 01:30

Message:
Logged In: YES 
user_id=55188

Python documentation says that it splits in "whitespace 
characters" not "breaking characters". So, current 
behavior is correct according to the documentation. And 
even rationale among string methods are heavily depends on 
ctype functions on libc. Therefore, we can't serve special 
treatment for the NBSP.

However, I feel the need for the splitting function that 
awares what character is breaking or not. How about to add 
it as unicodedata.split()?

----------------------------------------------------------------------

Comment By: Fredrik Lundh (effbot)
Date: 2005-12-29 21:42

Message:
Logged In: YES 
user_id=38376

split isn't a word-wrapping split, so I'm not sure that's
the right place to fix this.  ("no-break space" is white-
space, according to the Unicode standard, and split breaks
on whitespace).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470


More information about the Python-bugs-list mailing list