need fast parser for comma/space delimited numbers
Gordon McMillan
gmcm at hypernet.com
Sat Mar 18 11:30:25 EST 2000
Les Schaffer wants speed:
> I have written an application for reading in large amounts of
> space/comma delimited numbers from ASCII text files for statistical
> processing.
...
> Still, the app takes about 5 minutes to parse a typical set of data
> files. I'd like to drop that down to a minute of possible.
We can speed up what you've got, but probably not that much!
...
> here's the core of what i have now. the lines of the ASCII data file
> are already in python as a list of strings (strLines passed to
> grabArrayData() ).
>
>
> def __parseIFF(self, str):
>
> """Grab one int and the rest floats from string array
> str. Return array with first element independent variable and
> rest dependent variables"""
>
> array = [string.atoi(str[0])]
> for item in str[1:] :
> array.append( string.atof(item) )
> return array
First, use "def __parseIFF(self, str, atoi=string.atoi,
atof=string.atof):" and then access those as locals.
Second, benchmark against "int" and "float".
> def __parseFFF(self, str):
>
> """Grab one set of floats from string array str. Return array with
> first element independent variable and rest dependent
> variables"""
>
> return map( string.atof, str )
Same here.
> def __breakStringOnSpace(self, str):
>
> """break one line str containing numbers in a string format on
> whitespace (or comma), return array with the strings representing the
> numbers."""
>
> return filter(None, string.splitfields(str, self.splitStr) )
First, splitfields is obsolete, use "split". Second, special case
the whitespace case, because that would just be "split(str)".
Third, use locals trick.
> def setBreakExp(self, brk):
>
> """set whether we are using commans instead of white-space for
> splitting"""
> self.splitStr = brk
>
> def grabArrayData( self, strLines ):
>
> """ Feed grabDataArray an array of strings containing data,
> strLines.
>
> grabDataArray returns Numeric arrays for the independent and
> dependent variables."""
>
> (mPoints, mValues) = self.__createArrays(strLines)
>
> # self.parse set to either either __parseFFF or __parseIFF in
> # __init__
> parse = self.parse
> breakString = self.__breakStringOnSpace
>
> for i in range( 0, len(strLines) ):
> num = parse( breakString( strLines[i] ) )
For the all floats, all whitespace case, this would just be
num = map(float, split(strLines[i]))
and that might get you the speed you want.
For the comma case, you might try:
s = join(split(strLines[i], ','), ' ')
num = map(float, split(s))
or
t = split(strLines[i], ',')
t = map(strip, t)
num = map(float, t)
With split, join, strip all being string methods optimized by
being default args.
> mPoints[i] = num[0]
> mValues[ 0:self.rows, i] = num[1:]
>
> return mPoints, mValues
- Gordon
More information about the Python-list
mailing list