Behavior of re.split on empty strings is unexpected

Mon Aug 2 17:35:41 EDT 2010

On Mon, Aug 2, 2010 at 2:22 PM, John Nagle <nagle at animats.com> wrote:

> On 8/2/2010 12:52 PM, Thomas Jollans wrote:
>
>> On 08/02/2010 09:41 PM, John Nagle wrote:
>>
>>> On 8/2/2010 11:02 AM, MRAB wrote:
>>>
>>>> John Nagle wrote:
>>>>
>>>>> The regular expression "split" behaves slightly differently than
>>>>> string split:
>>>>>
>>>> occurrences of pattern", which is not too helpful.
>>>
>>>>
>>>>> It's the plain str.split() which is unusual in that:
>>>>
>>>> 1. it splits on sequences of whitespace instead of one per occurrence;
>>>>
>>>
>>>    That can be emulated with the obvious regular expression:
>>>
>>>     re.compile(r'\W+')
>>>
>>> 2. it discards leading and trailing sequences of whitespace.
>>>>
>>>
>>>    But that can't, or at least I can't figure out how to do it.
>>>
>>
>> [ s in rexp.split(long_s) if s ]
>>
>
>   Of course I can discard the blank strings afterward, but
> is there some way to do it in the "split" operation?  If
> not, then the default case for "split()" is too non-standard.
>
>   (Also, "if s" won't work;   if s != ''   might)
>
>                                John Nagle
> --
>

What makes it non-standard? The fact that it's not a 1-line
regex? The default case for str.split is designed to handle the most common
case: you want to break a string into words, where a word is defined as a
sequence of non-whitespace characters.

> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100802/3a3cc0a1/attachment-0001.html>