Please help... with re

Tim Hochberg tim.hochberg at ieee.org
Wed Jul 26 16:42:15 EDT 2000


Gilles Lenfant <glt at e-pack.net> writes:
> I made an horrid 68 lines monster to split a string to a list of
> substrings based on following example:
> 
> This is an "example of a \"splitted\" text " by my monster.
> 
> results to this list:
> 
> [ 'This' , 'is' , 'an' , 'example of a "splitted" text ' , 'by' , 'my' ,
> 'monster' ]
> 
> But the stuff is too slow to parse the lines of giant log files.
> I would like to use "re" package to make a shorter and faster script but
> understanding its patterns/methods is not in my poor brain capabilities.
> 
> I have burned my last neurons to try to do it, and I'm close to the edge
> of a nervous breakdown.
> Who can help me to get it at work ?

In general, I think you're better off breaking this into a number of
shorter, simpler operations. It'll probably be faster and will almost
certainly be easier to read. Consider the attached function,
minimonster, it parses your example correctly and can do 10,000
repitions of it in about 2.5 s on a slowish (300 MHz machine.) I
haven't seen you're monster example, but I would suspect that this is
faster and easier to read. This may not be exactly what you want (it
assumes no embebedded nulls for example), but maybe it'll get you
started.


N=10000
sampletext = r'This is an "example of a \"splitted\" text " by my monster. '*N

import string, re

quote = re.compile(r'[^\\]"')
def minimonster(text):
    # Replace ", but not \" with \0 so we can split on \0 
    # (ASSUMES NO EMBEDDED NULLS)
    newtext = quote.sub('\0', text)
    # Now replace \" with "
    newtext = string.replace(newtext, r'\"', '"')
    # now split on nulls
    textlist = string.split(newtext, '\0')
    # Now split on even sections only (odd sections are quoted).
    result = []
    isEven = 1
    for item in textlist:
        if isEven:
            result.extend(string.split(item))
        else:
            result.append(item)
        isEven = not isEven
    return result


import time

t0 = time.clock()
answer = minimonster(sampletext)
print time.clock() - t0
print answer[:50]



More information about the Python-list mailing list