A bug for unicode strings in Python 2.4?

Wed Jan 11 02:31:21 EST 2006

Thomas Moore wrote:

> Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
> >>> u.split()
> [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']
> >>>
>
> I think u should get split.

why?  split splits on whitespace (basically unicode category Zs), and
there are no whitespace symbols in there:

>>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
>>> [c.isspace() for c in u]
[False, False, False, False, False, False]

there's no universal "split on words in all languages" function in the
standard python library.  You may be able to roll your own using the
information in http://www.unicode.org/reports/tr29/ plus functions
in the unicodedata module (which currently doesn't include the
BreakTest tables; patches are welcome).  Or maybe google can
help you find an existing implementation.

</F>