Please help... with re

Wed Jul 26 17:43:43 EDT 2000

"Olivier Dagenais" <olivierS.dagenaisP at canadaA.comM> wrote in message
news:bzHf5.48352$1h3.670995 at news20.bellglobal.com...
    [snip]
> A - stream your input character by character
> B - when you encounter a space, add all "buffered" characters to the list
> C - if you encounter a quote, ignore rule B until you hit another quote
> D - if you hit a backslash, ignore rule C for the next character
> E - once you run out of characters, add all "buffered" characters to the
> list

Nice, clean approach.  Who knows about performance, since the re engine
is coded in C while this FSM would be coded in Python, but worth giving
it a try, I think. Here's a rather straightforward coding of it -- it
can no doubt be coded more elegantly by making the FSM explicit; this
version relies far too much on making its checks in a specific order,
and on 'continue' statements to avoid too-deep nesting... still, here
comes, coded off-the-cuff:

def splitaline(line):
    result=[]
    curtok=[]
    insidequote=0
    literalnext=0
    for c in line:
        if literalnext:
            curtok.append(c)
            literalnext=0
            continue
        if c=='\\':
            literalnext=1
            continue
        if insidequote:
            if c=='"':
                result.append(string.join(curtok,''))
                curtok=[]
                insidequote=0
            else:
                curtok.append(c)
            continue
        if c=='"':
            insidequote=1
        elif c==' ':
            result.append(string.join(curtok,''))
            curtok=[]
        else:
            curtok.append(c)
    if len(curtok):
        result.append(string.join(curtok,''))
    return result

> I made an horrid 68 lines monster to split a string to a list of
substrings

Well, at least this halves it:-).

> based on following example:
>
> This is an "example of a \"splitted\" text " by my monster.
>
> results to this list:
>
> [ 'This' , 'is' , 'an' , 'example of a "splitted" text ' , 'by' , 'my' ,
> 'monster' ]
>
> But the stuff is too slow to parse the lines of giant log files.

Alex