Behavior of re.split on empty strings is unexpected

MRAB python at mrabarnett.plus.com
Mon Aug 2 14:02:47 EDT 2010


John Nagle wrote:
> The regular expression "split" behaves slightly differently than string 
> split:
> 
>  >>> import re
>  >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)   
> 
>  >>> kresplit2.split("   HELLO    THERE   ")
> ['', 'HELLO', 'THERE', '']
> 
>  >>> kresplit2.split("VERISIGN INC.")
> ['VERISIGN', 'INC', '']
> 
> I'd thought that "split" would never produce an empty string, but
> it will.
> 
> The regular string split operation doesn't yield empty strings:
> 
>  >>> "   HELLO   THERE ".split()
> ['HELLO', 'THERE']
> 
Yes it does.

 >>> "   HELLO    THERE   ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']

> If I try to get the functionality of string split with re:
> 
>  >>> s2 = "   HELLO   THERE  "
>  >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
>  >>> kresplit4.split(s2)
> ['', 'HELLO', 'THERE', '']
> 
> I still get empty strings.
> 
> The documentation just describes re.split as "Split string by the 
> occurrences of pattern", which is not too helpful.
> 
It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

2. it discards leading and trailing sequences of whitespace.

Compare:

 >>> "  A  B  ".split(" ")
['', '', 'A', '', 'B', '', '']

with:

 >>> "  A  B  ".split()
['A', 'B']

It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)



More information about the Python-list mailing list