Speeding up a regular expression

Andrew Dalke dalke at dalkescientific.com
Tue Oct 23 14:53:46 EDT 2001


Michael Lerner:
I have a text file with a bunch of lines of the form
>
> 1-1.1 2.2 -3.3  4.4     5.5 -6.6
>
>That is, an integer, followed by six floats, with an arbitrary number of
>spaces in between the numbers.

Might the float be of the form 1E+05?  You regexp doesn't handle that.
What about '1.'?  '.9'?  '+9.8'?

Oft times you can speed things if you can use simple string
operations.  You can also speed things up if you know the input
is in the correct format, meaning you don't need to go through
the extra effort of verification.

In your other post you said the '-' of a number may be the only
thing to mark the beginning of that number.  (Ie, no spaces.)

One approach then is to replace all '-' characters with ' -',
which guarantees that all numbers are white-space separated.
Then a string split and conversion to float to see if you really
have numbers.  In other words, something like this would also work,
for a suitable definition of work.  (Assuming I didn't make any
typos.)

def line_you_want(line):
  fields = string.split(string.replace(line, "-", " -"))
  if len(fields) != 7:
    return 0
  try:
    int(fields[0])
    map(float, fields[1:])
  except ValueError:
    # They weren't all floats
    return 0
  return 1

while 1:  # This becomes 'for line in infile:' in Python 2.2
  line = infile.readline()
  if not line:
    break
  if line_you_want(line):
    print repr(line)

This has the advantage too that you could change the function
to return the parsed data

  try:
    return [int(fields[0])] + map(float, fields[1:])
  except ValueError:
    return None  # change the other 'return 0' to return None

and have the driver code use those results,

  results = line_you_want(line)
  if results is None:
    continue
  ...work with the results ...

Mind you, this approach fails if the floating point can itself
a minus sign, as '-1.2E-05'.  That could be fixed too, but it's
no worse than the code you have now.

                    Andrew
                    dalke at dalkescientific.com







More information about the Python-list mailing list