Can you do it faster? (parsing text file)

Tue Jan 28 08:44:05 EST 2003

Marcus,

Have you looked at Mike Fletcher's SimpleParse package
(http://sourceforge.net/projects/simpleparse/)? This is an EBNF interface
for Marc-Andre Lemburg's mxTextTools text tagging engine
(http://www.lemburg.com/files/python/mxTextTools.html). You write an EBNF
grammar for your data format, then use SimpleParse to build a mxTextTools
parser. It's simple to use, very fast (since the mxTextTools parser does all
its hard work in a C module rather than the Python interpreter), and--dare I
say it--a lot of fun!

I use it for parsing complicated text structures out of 60Mb RTF text files.
The EBNF grammars are long and complicated with deeply nested entity
definitions (the grammar itself is ~200 lines long). Nevertheless
SimpleParse/mxTextTools takes around a minute to parse and tag, which is
several orders of magnitude faster that I feel it really ought to take.

Your data structure is much, much simpler, and your files smaller. So if
your code takes 32 seconds at the moment, maybe 5 seconds or less or so
could be possible.

Best of luck,

Stephen

----- Original Message -----
From: "Marcus Stojek" <stojek at part-gmbh.de>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Tuesday, January 28, 2003 12:16 PM
Subject: Can you do it faster? (parsing text file)

> Hi,
>
> I have to parse large txt-files (15Mb and more, 200000 lines).
> The file consits of comma separated keywords and alphanumerical data.
> Each command line starts with some or no blanks, a '*' and the keyword
> followed by comma separated paramters
> (*LINE,colour= red, thiCKneSS=2.3
>  12.3,34.5,2.0
>  67.0,3.1,45.9
>  12.3,34.5,2.0
>  67.0,3.1,45.9
> *LINE,colour= blue, thiCKneSS=2.6,
>  12.3,34.5,2.0
>  67.0,3.1,45.9
> **This is a comment
> **default color and thickness
> *LINE,,
>  67.0,3.1,45.9
> )
> comment lines  have to start with some or no blanks and a '**'
>
> For parsing this file I want to generate a list containing all tokens
> (keyword,parameter,data), one after the other.
>
> The replace operations are done with the whole file instead of
> for li in fa.readlines():
> because I think it's faster.
>
> To delete the comment lines I have to split the file, then filter it.
>
> Afterwards the single lines are attached to the token list that will
> be parsed.
>
> Below you find an example that generates a dummy file and parses it.
> On my office PC it takes 16 seconds for preparing the token list and
> 16 seconds for parsing. This is too slow for me. Does anybody see
> where I am loosing speed? I tried working with re to get rid of the
> comment lines, but this is not much faster. Using string.translate()
> instead of the two replace() and one upper() didn't increase the speed
>
> either.
>
> Is there a differnt approach for parsing such a file?
>
> Thanks for any help.
>
> marcus
>
>
>
> #---snip----------------------------------------------------
> import string
> import time
>
> #----------------------------------------------------------
> def delcomm(x):
>     if x[:2]!='**':
>         return 1
>     else:
>         return 0
> #----------------------------------------------------------
> def read_file(filename):
>     fa=open(filename,'r')
>     d=fa.read()
>     fa.close()
>
>     d=string.replace(d,' ','')
>     d=string.replace(d,'\t','')
>     d=string.upper(d)
>
>     d=string.split(d,'\n')
>
>     d=filter(delcomm,d)
>
>     tok=[]
>     for s in d:
>         tok.extend(string.split(s,','))
>
>     tok=filter(lambda x:x,tok)
>     tok.append('*ENDFILE')#emulate EOF
>
>     print 'prepare token list:',time.clock()
>     print tok[-10:]
>     return tok
> #----------------------------------------------------------
> def parse_file(filename):
> # the light version, assuming number and order of parameters are fixed
>     tok=read_file(filename)
>     lines=[]
>     i=0
>     while tok[i]:
>         if tok[i]=='*LINE':
>             color=tok[i+1]
>             thick=tok[i+2]
>             i+=3
>             data_per_line=3
>             while tok[i][0]!='*':
>                 x,y,z=tok[i],tok[i+1],tok[i+2]
>                 lines.append((float(x),float(y),float(z)))
>                 i+=data_per_line
>         elif tok[i]=='*ENDFILE':
>             return lines
>         else:
>             i+=1
>     return 0
> #------------------------------------------------------------
> def gen_file(filename):
>     fa=open(filename,'w')
>     for i in range(400):
>         fa.write('*LINE, Color =red, Thickness=3.1,\n')
>         for j in range(500):
>             fa.write('3.0000000,  2.0000000,  7.000000000\n')
>             fa.write('13.0000000,  97.0000000,  -127.000000000,\n')
>         for j in range(15):
>             fa.write('** This is a comment\n')
>             fa.write('     *** This is a comment, with a comma\n')
>     fa.close()
> #------------------------------------------------------------
>
> filename='d:/temp/bench.txt'
> gen_file(filename)
> time.clock()
> L=parse_file(filename)
> print 'parsing: ',time.clock()
> #---snip----------------------------------------------------
>
>