Different number of matches from re.findall and re.split

Mon Jan 11 11:03:42 EST 2010

Jeremy wrote:

> On Jan 11, 8:44 am, Iain King <iaink... at gmail.com> wrote:
>> On Jan 11, 3:35 pm, Jeremy <jlcon... at gmail.com> wrote:
>>
>>
>>
>>
>>
>> > Hello all,
>>
>> > I am using re.split to separate some text into logical structures.
>> > The trouble is that re.split doesn't find everything while re.findall
>> > does; i.e.:
>>
>> > > found = re.findall('^ 1', line, re.MULTILINE)
>> > > len(found)
>> > 6439
>> > > tables = re.split('^ 1', line, re.MULTILINE)
>> > > len(tables)
>> > > 1
>>
>> > Can someone explain why these two commands are giving different
>> > results?  I thought I should have the same number of matches (or maybe
>> > different by 1, but not 6000!)
>>
>> > Thanks,
>> > Jeremy
>>
>> re.split doesn't take re.MULTILINE as a flag: it doesn't take any
>> flags. It does take a maxsplit parameter, which you are passing the
>> value of re.MULTILINE (which happens to be 8 in my implementation).
>> Since your pattern is looking for line starts, and your first line
>> presumably has more splits than the maxsplits you are specifying, your
>> re.split never finds more than 1.
> 
> Yep.  Thanks for pointing that out.  I guess I just assumed that
> re.split was similar to re.search/match/findall in what it accepted as
> function parameters.  I guess I'll have to use a \n instead of a ^ for
> split.

You can precompile the pattern and then invoke the split() method:

>>> re.compile("^X", re.MULTILINE).split("""X alpha
... beta
... X gamma
... delta X
... X
... zeta
... """)
['', ' alpha\nbeta\n', ' gamma\ndelta X\n', '\nzeta\n']

Peter