Can you do it faster? (parsing text file)

Marcus Stojek stojek at part-gmbh.de
Tue Jan 28 06:16:04 EST 2003


Hi,

I have to parse large txt-files (15Mb and more, 200000 lines).
The file consits of comma separated keywords and alphanumerical data.
Each command line starts with some or no blanks, a '*' and the keyword
followed by comma separated paramters
(*LINE,colour= red, thiCKneSS=2.3
 12.3,34.5,2.0
 67.0,3.1,45.9
 12.3,34.5,2.0
 67.0,3.1,45.9
*LINE,colour= blue, thiCKneSS=2.6,
 12.3,34.5,2.0
 67.0,3.1,45.9
**This is a comment
**default color and thickness
*LINE,, 
 67.0,3.1,45.9
)
comment lines  have to start with some or no blanks and a '**'

For parsing this file I want to generate a list containing all tokens 
(keyword,parameter,data), one after the other.

The replace operations are done with the whole file instead of 
for li in fa.readlines():
because I think it's faster.

To delete the comment lines I have to split the file, then filter it.

Afterwards the single lines are attached to the token list that will
be parsed.

Below you find an example that generates a dummy file and parses it.
On my office PC it takes 16 seconds for preparing the token list and
16 seconds for parsing. This is too slow for me. Does anybody see
where I am loosing speed? I tried working with re to get rid of the
comment lines, but this is not much faster. Using string.translate()
instead of the two replace() and one upper() didn't increase the speed

either.

Is there a differnt approach for parsing such a file?

Thanks for any help.

marcus



#---snip----------------------------------------------------
import string
import time

#----------------------------------------------------------
def delcomm(x):
    if x[:2]!='**':
        return 1
    else:
        return 0
#----------------------------------------------------------
def read_file(filename):
    fa=open(filename,'r')
    d=fa.read()
    fa.close()

    d=string.replace(d,' ','')
    d=string.replace(d,'\t','')
    d=string.upper(d)

    d=string.split(d,'\n')

    d=filter(delcomm,d)

    tok=[]
    for s in d:
        tok.extend(string.split(s,','))

    tok=filter(lambda x:x,tok)
    tok.append('*ENDFILE')#emulate EOF

    print 'prepare token list:',time.clock()
    print tok[-10:]
    return tok
#----------------------------------------------------------
def parse_file(filename):
# the light version, assuming number and order of parameters are fixed
    tok=read_file(filename)
    lines=[]
    i=0
    while tok[i]:
        if tok[i]=='*LINE':
            color=tok[i+1]
            thick=tok[i+2]
            i+=3
            data_per_line=3
            while tok[i][0]!='*':
                x,y,z=tok[i],tok[i+1],tok[i+2]
                lines.append((float(x),float(y),float(z)))
                i+=data_per_line
        elif tok[i]=='*ENDFILE':
            return lines
        else:
            i+=1
    return 0
#------------------------------------------------------------
def gen_file(filename):
    fa=open(filename,'w')
    for i in range(400):
        fa.write('*LINE, Color =red, Thickness=3.1,\n')
        for j in range(500):
            fa.write('3.0000000,  2.0000000,  7.000000000\n')
            fa.write('13.0000000,  97.0000000,  -127.000000000,\n')
        for j in range(15):
            fa.write('** This is a comment\n')
            fa.write('     *** This is a comment, with a comma\n')
    fa.close()
#------------------------------------------------------------

filename='d:/temp/bench.txt'
gen_file(filename)
time.clock()
L=parse_file(filename)
print 'parsing: ',time.clock()
#---snip----------------------------------------------------





More information about the Python-list mailing list