Can you do it faster? (parsing text file)
Mike C. Fletcher
mcfletch at rogers.com
Tue Jan 28 11:05:51 EST 2003
Since Stephen suggested it, here is a simpleparse grammar that parses
out the various features in the file, though it leaves the array of data
values as a single block, with the idea that you'd want to use low-level
processing for that (string.translate then string.split, and then map
float, storing the result in Numeric python arrays). (Parsing
individual floats would create 5 Python objects for each float _on top
of_ what you would need with the low-level manipulation you'll
eventually have to do.)
Parsing the test file with this grammar takes ~2 seconds on a 1GHz
machine, incidentally. You'll still need to process the parse-tree, but
that's not particularly slow. Parsing the same file, but extracting
individual floats during parsing takes approximately three minutes.
Testing this has pointed out a number of small memory leaks in a few
features, will have to look at those when I get back to SimpleParse
development.
BTW, Stephen, regarding speed:
Keep in mind that a large part of the time spent in parsing is
actually spent building the result tree (which isn't done if you're
skipping text), and that the character-class and string-literal parsers
are wickedly fast (they map directly to mxTextTools primitives). Your
grammar is running about three times faster than my VRML97 grammars
(which run about 200,000cps on my 1GHz machine), but those build a full
parse-tree for the entire file (I don't often try building parse-trees
for files > 10 MB, would probably be slower for those). Can't explain
any other orders of magnitude ;) , you'll have to ask Marc-André :) .
Enjoy,
Mike
from simpleparse.parser import Parser
from simpleparse import dispatchprocessor
from simpleparse.common import numbers, chartypes
import time
definition = r"""
file := (comment_line/command_line)+
<comment_line> := '**', -[\n]*, '\n'?,ws
command_line := '*', name, ws, parameter_list?,ws, EOL?,
(data/comment_line)*
## Note the simplified definition!!!
data := [0-9,-eE. \t\n]+
>parameter_list< := ',',ws, (parameter, ','?, ws)+
parameter := name, ws, '=', ws, (number/name), ws
name := [a-zA-Z0-9-]+
<ws> := [ \t]*
<EOL> := '\n'
"""
p = Parser( definition, 'file')
Enjoy,
Mike
Stephen Simmons wrote:
>Marcus,
>
>Have you looked at Mike Fletcher's SimpleParse package
>
>
...
>I use it for parsing complicated text structures out of 60Mb RTF text files.
>The EBNF grammars are long and complicated with deeply nested entity
>definitions (the grammar itself is ~200 lines long). Nevertheless
>SimpleParse/mxTextTools takes around a minute to parse and tag, which is
>several orders of magnitude faster that I feel it really ought to take.
>
>Your data structure is much, much simpler, and your files smaller. So if
>your code takes 32 seconds at the moment, maybe 5 seconds or less or so
>could be possible.
>
>Best of luck,
>
>Stephen
>
>
...
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list