space / nonspace

marduk marduk at nbk.hopto.org
Sun Jul 22 17:28:30 EDT 2007


On Sun, 2007-07-22 at 22:33 +0200, Peter Kleiweg wrote:
> >>> import re
>     >>> s = u'a b\u00A0c d'
>     >>> s.split()
>     [u'a', u'b', u'c', u'd']
>     >>> re.findall(r'\S+', s)
>     [u'a', u'b\xa0c', u'd']  
> 

If you want the Unicode interpretation of \S+, etc, you pass the
re.UNICODE flag:

        >>> re.findall(r'\S+', s,re.UNICODE)
        [u'a', u'b', u'c', u'd']

See http://docs.python.org/lib/node46.html

> 
> This isn't documented either:
> 
>     >>> s = ' b c '
>     >>> s.split()
>     ['b', 'c']
>     >>> s.split(' ')
>     ['', 'b', 'c', '']

I believe the following documents it accurately:
http://docs.python.org/lib/string-methods.html

        If sep is not specified or is None, a different splitting
        algorithm is applied. First, whitespace characters (spaces,
        tabs, newlines, returns, and formfeeds) are stripped from both
        ends. Then, words are separated by arbitrary length strings of
        whitespace characters. Consecutive whitespace delimiters are
        treated as a single delimiter ("'1 2 3'.split()" returns "['1',
        '2', '3']"). Splitting an empty string or a string consisting of
        just whitespace returns an empty list.




More information about the Python-list mailing list