need fast parser for comma/space delimited numbers

Gordon McMillan gmcm at hypernet.com
Sat Mar 18 11:30:25 EST 2000


Les Schaffer wants speed:

> I have written an application for reading in large amounts of
> space/comma delimited numbers from ASCII text files for statistical
> processing. 
...
> Still, the app takes about 5 minutes to parse a typical set of data
> files. I'd like to drop that down to a minute of possible.

We can speed up what you've got, but probably not that much!
...
> here's the core of what i have now. the lines of the ASCII data file
> are already in python as a list of strings (strLines passed to
> grabArrayData() ).
> 
> 
>     def __parseIFF(self, str):
> 
> 	"""Grab one int and the rest floats from string array
> 	str. Return array with first element independent variable and
> 	rest dependent variables"""
> 
> 	array = [string.atoi(str[0])]
> 	for item in str[1:] :
> 	    array.append( string.atof(item)  )
> 	return  array

First, use "def __parseIFF(self, str, atoi=string.atoi, 
atof=string.atof):" and then access those as locals.

Second, benchmark against "int" and "float".
     
>     def __parseFFF(self, str):
> 
> 	"""Grab one set of floats from string array str. Return array with
> 	first element independent variable and rest dependent
> 	variables"""
> 		
> 	return map( string.atof, str )

Same here.
 
>     def __breakStringOnSpace(self, str):
> 
> 	"""break one line str containing numbers in a string format on
> 	whitespace (or comma), return array with the strings representing the
> 	numbers."""
> 
>         return filter(None, string.splitfields(str, self.splitStr)  )

First, splitfields is obsolete, use "split". Second, special case 
the whitespace case, because that would just be "split(str)". 
Third, use locals trick.
 
>     def setBreakExp(self, brk):
> 	
> 	"""set whether we are using commans instead of white-space for
> 	splitting"""
> 	self.splitStr = brk
>     
>     def grabArrayData( self, strLines ):
> 
> 	""" Feed grabDataArray an array of strings containing data,
> 	strLines.
> 	
> 	grabDataArray returns Numeric arrays for the independent and
> 	dependent variables."""
> 
> 	(mPoints, mValues) = self.__createArrays(strLines)
> 
> 	# self.parse set to either either __parseFFF or __parseIFF in
> 	# __init__
>         parse = self.parse
>         breakString = self.__breakStringOnSpace
> 
> 	for i  in range( 0, len(strLines) ):
> 	    num = parse( breakString( strLines[i] ) )

For the all floats, all whitespace case, this would just be
 num = map(float, split(strLines[i])) 
and that might get you the speed you want.

For the comma case, you might try:
  s = join(split(strLines[i], ','), ' ')
  num = map(float, split(s))
or
  t = split(strLines[i], ',')
  t = map(strip, t)
  num = map(float, t)

With split, join, strip all being string methods optimized by 
being default args.
  

> 	    mPoints[i] = num[0]
>             mValues[ 0:self.rows, i] = num[1:]
> 	    
> 	return mPoints, mValues


- Gordon




More information about the Python-list mailing list