[issue17668] re.split loses characters matching ungrouped parts of a pattern

Tue Apr 9 08:21:24 CEST 2013

R. David Murray added the comment:

Only group the stuff you want to see in the result:

>>> re.split(r'(^>.*$)', '>Homo sapiens catenin (cadherin-associated)')
['', '>Homo sapiens catenin (cadherin-associated)', '']

>>> re.split(r'^(>.*)$', '>Homo sapiens catenin (cadherin-associated)')
['', '>Homo sapiens catenin (cadherin-associated)', '']

If you are using grouping to get alternatives, you can use a non-capturing group:

>>> re.split(r'(ca(?:t|d))', '>Homo sapiens catenin (cadherin-associated)')
['>Homo sapiens ', 'cat', 'enin (', 'cad', 'herin-associated)']

(By the way, I'm a bit confused as to what exactly you are splitting in your original example, since you seem to be matching the whole string, and only if it is the whole string.  On the other hand, regular expressions regularly confuse me... :)

I indeed do not think it is worth complicating the interface to handle the unusual case of accepting and applying unknown regexes.  The one change I could see as a possibility would be to allow all of the groups matched by the split regex to appear as a single sublist.  But I'm not the maintainer of this module either :)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue17668>
_______________________________________