Can you do it faster? (parsing text file)

Mike C. Fletcher mcfletch at rogers.com
Tue Jan 28 11:05:51 EST 2003


Since Stephen suggested it, here is a simpleparse grammar that parses 
out the various features in the file, though it leaves the array of data 
values as a single block, with the idea that you'd want to use low-level 
processing for that (string.translate then string.split, and then map 
float, storing the result in Numeric python arrays).  (Parsing 
individual floats would create 5 Python objects for each float _on top 
of_ what you would need with the low-level manipulation you'll 
eventually have to do.)

Parsing the test file with this grammar takes ~2 seconds on a 1GHz 
machine, incidentally.  You'll still need to process the parse-tree, but 
that's not particularly slow. Parsing the same file, but extracting 
individual floats during parsing takes approximately three minutes.

Testing this has pointed out a number of small memory leaks in a few 
features, will have to look at those when I get back to SimpleParse 
development.

BTW, Stephen, regarding speed:
    Keep in mind that a large part of the time spent in parsing is 
actually spent building the result tree (which isn't done if you're 
skipping text), and that the character-class and string-literal parsers 
are wickedly fast (they map directly to mxTextTools primitives).  Your 
grammar is running about three times faster than my VRML97 grammars 
(which run about 200,000cps on my 1GHz machine), but those build a full 
parse-tree for the entire file (I don't often try building parse-trees 
for files > 10 MB, would probably be slower for those). Can't explain 
any other orders of magnitude ;) , you'll have to ask Marc-André :) .

Enjoy,
Mike

from simpleparse.parser import Parser
from simpleparse import dispatchprocessor
from simpleparse.common import numbers, chartypes
import time

definition = r"""
file  := (comment_line/command_line)+

<comment_line> := '**', -[\n]*, '\n'?,ws
command_line   := '*', name, ws, parameter_list?,ws, EOL?, 
(data/comment_line)*
## Note the simplified definition!!!
data           := [0-9,-eE. \t\n]+

 >parameter_list< := ',',ws, (parameter, ','?, ws)+

parameter := name, ws, '=', ws, (number/name), ws

name    := [a-zA-Z0-9-]+
<ws>    := [ \t]*
<EOL>   := '\n'
"""
p = Parser( definition, 'file')

Enjoy,
Mike

Stephen Simmons wrote:

>Marcus,
>
>Have you looked at Mike Fletcher's SimpleParse package
>  
>
...

>I use it for parsing complicated text structures out of 60Mb RTF text files.
>The EBNF grammars are long and complicated with deeply nested entity
>definitions (the grammar itself is ~200 lines long). Nevertheless
>SimpleParse/mxTextTools takes around a minute to parse and tag, which is
>several orders of magnitude faster that I feel it really ought to take.
>
>Your data structure is much, much simpler, and your files smaller. So if
>your code takes 32 seconds at the moment, maybe 5 seconds or less or so
>could be possible.
>
>Best of luck,
>
>Stephen
>  
>
...

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/








More information about the Python-list mailing list