Guru advice needed for mxTextTools

Pekka Niiranen krissepu at vip.fi
Mon Jun 3 16:07:04 EDT 2002


I am trying to optimize a function that searches nested strings from
set of (allmost) flat files (about 2 MB each) . If I use regular
expressions, I must
fix the amount of nesting:

----- code example ----
    def make_regular_expression(prefix, suffix):
        if prefix == suffix:
            pattern = "%s%s%s" % (re.escape(prefix), '.+?',
re.escape(suffix))
        else:
            repeat = "%s%s%s%s" % ('[^', prefix, suffix,']+')
            pattern = "%s%s%s%s%s%s%s%s%s%s%s" %
('(',re.escape(prefix),repeat,

'(',re.escape(prefix),repeat,
                                              re.escape(suffix),')*',

repeat,re.escape(suffix),')')
            return pattern
---- code example ends ----

In the code above support two nested strings. If prefix is "?" and
suffix is "!" then
it will evaluate into:

>>> pattern = re.compile("(\?[^?!]+(\?[^?!]+\!)*[^?!]+\!)")
>>> Line = "?AA?BB!CC!?DD!ee?EE!ff?FF?GG!HH!"
>>> print re.findall(pattern, Line)
[('?AA?BB!CC!', '?BB!'), ('?DD!', ''), ('?EE!', ''), ('?FF?GG!HH!',
'?GG!')]

So far so good, but:

1)    Re -module returns also empty matches which I have to clean:
            pars =
filter(operator.truth,reduce(operator.add,re.findall(pattern, Line)))

2)    The file is not flat: I also need to check the contents of the
previous line. If previous line
       does not contain correct value, I do not have to run the regular
       expression on the current line:
            for i in range(1,len(lines),2):
                        test = lines[i-1].strip()
                        if (test == 'x' or test == 'y'):
                            matches = re.findall(pattern,
lines[i].strip())
                            if matches:
                                # Remove empty results with filters
                                pars =
filter(operator.truth,reduce(operator.add, matches))

3)    Amount of nesting may vary in the future

I have managed to speed up the search about 10x by using map() instead
of for -loop
and the current bottleneck is the regular expression.

I have thought of EBNF -notation that should be supported with
Simpleparse  + mxtexttools

Questions are:

1)    What is the mxtexttool tagtable for the regular expression above
with additions of unlimited nesting.
       If suffix is the same as prefix, no nesting is assumed
2)    Is it possible to parse the file without keeping the record of the
current line number since
       values to be checked are allways on odd line numbers and regular
expression is allways run
       on even line numbers. If I could read two lines at a time and
parsing them both simultaneously
       (as a single line) with mxtexttools (with lookAhead or whatever
), I could gain some speed ?
3)    Should I seek examples from XML -tools instead OR write my own
parser with C + SWIG ?


-pekka-






More information about the Python-list mailing list