[ python-Bugs-1105286 ] Undocumented implicit strip() in split(None) string method

SourceForge.net noreply at sourceforge.net
Tue Nov 7 15:06:16 CET 2006


Bugs item #1105286, was opened at 2005-01-19 16:04
Message generated for change (Comment added) made by yohell
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1105286&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Documentation
Group: None
>Status: Open
Resolution: Fixed
Priority: 5
Private: No
Submitted By: YoHell (yohell)
Assigned to: Raymond Hettinger (rhettinger)
Summary: Undocumented implicit strip() in split(None) string method

Initial Comment:
Hi! 

I noticed that the string method split() first does an
implicit strip() before splitting when it's used with
no arguments or with None as the separator (sep in the
docs). There is no mention of this implicit strip() in
the docs.

Example 1:
s = " word1 word2 "

s.split() then returns ['word1', 'word2'] and not ['',
'word1', 'word2', ''] as one might expect.

WHY IS THIS BAD?

1. Because it's undocumented. See:
http://www.python.org/doc/current/lib/string-methods.html#l2h-197

2. Because it may lead to unexpected behavior in programs. 
Example 2:
FASTA sequence headers are one line descriptors of
biological sequences and are on this form: 
">" + Identifier + whitespace + free text description.

Let sHeader be a Python string containing a FASTA
header. One could then use the following syntax to
extract the identifier from the header:

sID = sHeader[1:].split(None, 1)[0]

However, this does not work if sHeader contains a
faulty FASTA header where the identifier is missing or
consists of whitespace. In that case sID will contain
the first word of the free text description, which is
not the desired behavior. 

WHAT SHOULD BE DONE?

The implicit strip() should be removed, or at least
should programmers be given the option to turn it off.
At the very least it should be documented so that
programmers have a chance of adapting their code to it.

Thank you for an otherwise splendid language!
/Joel Hedlund
Ph.D. Student
IFM Bioinformatics
Linköping University

----------------------------------------------------------------------

>Comment By: YoHell (yohell)
Date: 2006-11-07 15:06

Message:
Logged In: YES 
user_id=1008220

I'm opening this again, since the docs still don't reflect
the behavior of the method. 

from the docs:
"""
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from
both ends. 
"""

This is not true when maxsplit is given.

Example:

>>> " foo bar ".split(None)
['foo', 'bar']
>>> " foo bar ".split(None, 1)
['foo', 'bar ']

Whitespace is obviously not stripping whitespace from the
ends of the string before splitting the rest of the string. 



----------------------------------------------------------------------

Comment By: Wummel (calvin)
Date: 2005-01-24 13:51

Message:
Logged In: YES 
user_id=9205

This should probably also be added to rsplit()?

----------------------------------------------------------------------

Comment By: Terry J. Reedy (tjreedy)
Date: 2005-01-24 08:15

Message:
Logged In: YES 
user_id=593130

To me, the removal of whitespace at the ends (stripping) is 
consistent with the removal (or collapsing) of extra 
whitespace in between so that .split() does not return empty 
words anywhere.  Consider:

>>> ',1,,2,'.split(',')
['', '1', '', '2', '']

If ' 1  2 '.split() were to return null strings at the beginning 
and end of the list, then to be consistent, it should also put 
one in the middle.  One can get this by being explicit (mixed 
WS can be handled by translation):

>>> ' 1  2 '.split(' ')
['', '1', '', '2', '']

Having said this, I also agree that the extra words proposed 
by jj are helpful.

BUG??  In 2.2, splitting an empty or whitespace only string 
produces an empty list [], not a list with a null word [''].

>>> ''.split()
[]
>>> '   '.split()
[]

which is what I see as consistent with the rest of the no-null-
word behavior.  Has this changed since?  (Yes, must 
upgrade.)  I could find no indication of such change in either 
the tracker or CVS.

----------------------------------------------------------------------

Comment By: YoHell (yohell)
Date: 2005-01-20 15:59

Message:
Logged In: YES 
user_id=1008220

Brilliant, guys!

Thanks again for a superb scripting language, and with
documentation to match!

Take care!
/Joel Hedlund

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2005-01-20 15:50

Message:
Logged In: YES 
user_id=80475

The prosposed wording is fine.

If there are no objections or concerns, I'll apply it soon.

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2005-01-20 15:28

Message:
Logged In: YES 
user_id=764593

Replacing the quoted line:

"""
...

If sep is not specified or is None, a different splitting 
algorithm is applied. First whitespace (spaces, tabs, 
newlines, returns, and formfeeds) is stripped from both 
ends.   Then words are separated by arbitrary length 
strings of whitespace characters . Consecutive whitespace 
delimiters are treated as a single delimiter ("'1 2 3'.split()" 
returns "['1', '2', '3']"). Splitting an empty (or whitespace-
only) string returns "['']".
"""


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2005-01-20 15:04

Message:
Logged In: YES 
user_id=80475

What new wording do you propose to be added?

----------------------------------------------------------------------

Comment By: YoHell (yohell)
Date: 2005-01-20 11:15

Message:
Logged In: YES 
user_id=1008220

In RE to tim_one:
> I think the docs for split() under "String Methods" are quite 
> clear:

On the countrary, my friend, and here's why:

> """
> ...
> If sep is not specified or is None, a different splitting
> algorithm is applied. 

This sentecnce does not say that whitespace will be
implicitly stripped from the edges of the string.

> Words are separated by arbitrary length strings of whitespace 
> characters (spaces, tabs, newlines, returns, and formfeeds). 

Neither does this one.

> Consecutive whitespace delimiters are treated as a single
delimiter ("'1 
> 2 3'.split()" returns "['1', '2', '3']"). 

And not that one.

> Splitting an empty string returns "['']".
> """

And that last one does not mention it either. In fact, there
is no mention in the docs of how separators on edges of
strings are treated by the split method. And furthermore,
there is no mention of that s.split(sep) treats them
differrently when sep is None than it does otherwise. Example:

>>> ",2,".split(',')
['', '2', '']
>>> " 2 ".split()
['2']

This inconsistent behavior is not in line with how
beautifully thought out the Python language is otherwise,
and how brilliantly everything else is documented on the
http://python.org/doc/ documentation pages. 

> This won't change, because mountains of code rely on this 
> behavior -- it's probably the single most common use case 
> for .split().

I thought as much. However - it's would be Really easy for
an admin to add a line of documentation to .split() to
explain this. That would certainly help make me a happier
man, and hopefully others too.

Cheers guys!
/Joel

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2005-01-19 17:56

Message:
Logged In: YES 
user_id=31435

I think the docs for split() under "String Methods" are quite 
clear:

"""
...

If sep is not specified or is None, a different splitting 
algorithm is applied. Words are separated by arbitrary length 
strings of whitespace characters (spaces, tabs, newlines, 
returns, and formfeeds). Consecutive whitespace delimiters 
are treated as a single delimiter ("'1 2 3'.split()" 
returns "['1', '2', '3']"). Splitting an empty string returns "['']". 
"""

This won't change, because mountains of code rely on this 
behavior -- it's probably the single most common use case 
for .split().


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1105286&group_id=5470


More information about the Python-bugs-list mailing list