[Python-ideas] This seems like a wart to me...

Ron Adam rrr at ronadam.com
Fri Dec 12 00:58:38 CET 2008



skip at pobox.com wrote:
>     Guido> Which of the two would you choose for all? The empty string is the
>     Guido> only reasonable behavior for split-with-argument, it is the logical
>     Guido> consequence of how it behaves when the string is not empty. E.g.
>     Guido> "x:y".split(":") -> ["x", "y"], "x::y".split(":") -> ["x", "", "y"],
>     Guido> ":".split(":") -> ["", ""]. OTOH split-on-whitespace doesn't behave
>     Guido> this way; it extracts the non-empty non-whitespace-containing
>     Guido> substrings.
> 
> In my feeble way of thinking I go from something which evaluates to false to
> something which doesn't. It's almost like making matter out of empty space:
> 
>     bool("") -> False
>     bool("".split()) -> False
>     bool("".split("n")) -> True
> 
>     Guido> If anything it's wrong, it's that they share the same name. This
>     Guido> wasn't always the case. Do you really want to go back to .split()
>     Guido> and .splitfields(sep)?
> 
> That might be preferable.  The same method having such strikingly different
> behavior throws me every time I try splitting a possibly empty string with a
> non-whitespace character.  It's a relatively uncommon case.  Most of the
> time when you split a string with a non-whitespace character I think you
> know that the input can't be empty.
> 
> Skip


It looks like there are several behaviors involved in split, and you want 
to split those behaviors out.



Behaviors of string split:


1. Split on white space chrs by giving no argument.

This has the effect of splitting on multiple characters. Strings with 
multiple white space characters are not multiply split.

 >>> '       '.split()
[]
 >>> ' \t\n'.split()
[]



2. Split on word by giving an argument. (A word can be one char.)

In this case, the split is strict and does not combine/remove null string 
results.

 >>> '       '.split(' ')
['', '', '', '', '', '', '', '']
 >>> ' \t\n'.split(' ')
['', '\t\n']


There doesn't seem to be an obvious way to split on different characters.


A new to python programmer might try:

 >>> '1 (123) 456-7890'.split(' ()-')
['1 (123) 456-7890']

Expecting: ['1', '123', '456', '7890']


 >>> '1 (123) 456-7890'.split([' ', '(', ')', '-'])
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: expected a character buffer object


When I needed to split on multiple chars other than the default white 
space, I have used .replace() to replace different splitting character with 
one single char sequence which I could then split on.


It might be nice to have a .splitonchars() version of split with the 
default being whitespace chars, and an argument to specify other multiple 
characters to split on.

The other behavior could be called .splitonwords(arg). The .splitonwords() 
method could possibly also accept a list of words.


That leaves the possibility to leave the current .split() behavior alone 
and would not break current code.

And alternately these could be functions in the string module.  In that 
case the current .split() could just continue to exist as is.

I find the name 'splitfields' to not be as intuitive as 'splitonwords' and 
'splitonchars'.   While both of those require more letters to type than 
split, they are more readable, and when you do need the capability of 
splitting on more than one char or word, they are far shorter and less 
prone to errors than rolling your own function.

Ron








More information about the Python-ideas mailing list