space / nonspace

Sun Jul 22 17:28:30 EDT 2007

On Sun, 2007-07-22 at 22:33 +0200, Peter Kleiweg wrote:
> >>> import re
>     >>> s = u'a b\u00A0c d'
>     >>> s.split()
>     [u'a', u'b', u'c', u'd']
>     >>> re.findall(r'\S+', s)
>     [u'a', u'b\xa0c', u'd']  
> 

If you want the Unicode interpretation of \S+, etc, you pass the
re.UNICODE flag:

        >>> re.findall(r'\S+', s,re.UNICODE)
        [u'a', u'b', u'c', u'd']

See http://docs.python.org/lib/node46.html

> 
> This isn't documented either:
> 
>     >>> s = ' b c '
>     >>> s.split()
>     ['b', 'c']
>     >>> s.split(' ')
>     ['', 'b', 'c', '']

I believe the following documents it accurately:
http://docs.python.org/lib/string-methods.html

        If sep is not specified or is None, a different splitting
        algorithm is applied. First, whitespace characters (spaces,
        tabs, newlines, returns, and formfeeds) are stripped from both
        ends. Then, words are separated by arbitrary length strings of
        whitespace characters. Consecutive whitespace delimiters are
        treated as a single delimiter ("'1 2 3'.split()" returns "['1',
        '2', '3']"). Splitting an empty string or a string consisting of
        just whitespace returns an empty list.