space / nonspace
marduk
marduk at nbk.hopto.org
Sun Jul 22 17:28:30 EDT 2007
On Sun, 2007-07-22 at 22:33 +0200, Peter Kleiweg wrote:
> >>> import re
> >>> s = u'a b\u00A0c d'
> >>> s.split()
> [u'a', u'b', u'c', u'd']
> >>> re.findall(r'\S+', s)
> [u'a', u'b\xa0c', u'd']
>
If you want the Unicode interpretation of \S+, etc, you pass the
re.UNICODE flag:
>>> re.findall(r'\S+', s,re.UNICODE)
[u'a', u'b', u'c', u'd']
See http://docs.python.org/lib/node46.html
>
> This isn't documented either:
>
> >>> s = ' b c '
> >>> s.split()
> ['b', 'c']
> >>> s.split(' ')
> ['', 'b', 'c', '']
I believe the following documents it accurately:
http://docs.python.org/lib/string-methods.html
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from both
ends. Then, words are separated by arbitrary length strings of
whitespace characters. Consecutive whitespace delimiters are
treated as a single delimiter ("'1 2 3'.split()" returns "['1',
'2', '3']"). Splitting an empty string or a string consisting of
just whitespace returns an empty list.
More information about the Python-list
mailing list