why huge speed difference btwn 1.52 and 2.1?

Tue Jun 5 14:34:20 EDT 2001

Hmmm, I don't see any big speed diff between 1.5 and 2.1 - with the script as
posted, if anything 2.1 has a slight edge on my machine (Pentium II 366, linux).

Robin's original script (modified so it wouldn't close input on blank lines)
python 1.5.2: average 20 seconds
python 2.1:  average 19.5 seconds

This can be improved a lot, regardless of the version of python.

Duncan Booth wrote:
> As far as I can tell your code reads a line in and then looks to see
> whether the line contains a word that ends in a state name or a state
> abbreviation. So if the line is "Today waz blowy, tomorrow may be better"
> is in the input it will be copied to the output files for Arizona and
> Wyoming. Is this correct?

Not exactly, as I read it Robin has a space before and a space or comma after
the state name or abbreviation, and it can be anywhere in the line.

> I would be tempted to rewrite the code, either to not use regular
> expressions at all, or to use a single regular expression for everything.

Me too. And if you do decide to use regexps, you only need to build it once,
before the while loop. 

> If you build one big regular expression that matches all states and state
> abbreviations, then you can extract the match out of the line and use what
> matched as a dictionary key to find the right filename (provided you first
> build a dictionary with both state names and abbreviations as keys mapping
> to the filenames).

Yeah! I thought of building the filename out of the match, but didn't think of
pre-building filenames for all possible matches. It would save some work on
lines that match the regexp. That could be a big win if you match a *lot* of
lines.
In general, you want to do as little work inside the loop as possible.

> Oh, and you upper cased the input, so you don't need a case insensitive
> search.

Yep! One or the other. If you need the line uppercased for some reason, skip the
IGNORECASE; if not, try it both ways and see what's faster.

Here's my time results for my version - I found a file with a bunch of addresses
and copied it to the necessary input filenames, then ran it a few times under
each version of python.

My implementation of Duncan's suggestions:
python 1.5.2: average 5.7 seconds
python 2.1: average 5.1 seconds

What was the problem again? :)

And here's my version of the script - it does everything Duncan suggested except
avoiding regexps altogether. (You could do that by building a list of strings to
match, then looping through this string once for each line of the input file.)

-------------------------------------

import re
import string

states = {
'ALABAMA':'AL',
'ALASKA':'AK',
'ARIZONA':'AZ',
'WISCONSIN':'WI',
'WYOMING':'WY'
}

filenames = {
'AL': 'ALABAMA',
'ALABAMA': 'ALABAMA',
'AK':'ALASKA',
'ALASKA':'ALASKA',
'AZ':'ARIZONA',
'ARIZONA':'ARIZONA',
'WI':'WISCONSIN',
'WISCONSIN':'WISCONSIN',
'WY':'WYOMING',
'WYOMING':'WYOMING'
}

#(STATENAME) OR (STATE ABBREVIATION)
# only needs to be built once
statepattern = " ("
for name, abbr in states.items():
	statepattern =statepattern +name +'|' +abbr +'|'
# take off the last "|"
statepattern = statepattern[:-1]
# close the state group; then match space or comma
statepattern = statepattern +")[ ,]"
## print `statepattern`

stateregex = re.compile(statepattern)

for year in range(1994, 1998):

        f = open('states/USA'+str(year)+'.TXT')
        counter = 1

        while(1):
                print str(year), counter
                counter = counter + 1

                #convert city name to allcaps (db outputs in allcaps)
                line = f.readline()

                #check for EOF
                if not line: break
		# Strip the line. I do this AFTER the above test
		# because it would break on blank lines,
		# not just EOF.
		# Is that what you wanted?
		line = string.upper(string.strip(line))
                # it could be faster to skip this and make a 
                # case-insensitive regexp... or not. try that too.
		result = stateregex.search(line)
		if result:
			state = result.group(1)
			# get the filename
			filename = filenames[state]
			filename = string.replace(filename, ' ', '_')
			g = open('states/'+filename+'/'+str(year)+'.TXT', "a")
			g.write(line + "\n")
			g.close()
	f.close()