why huge speed difference btwn 1.52 and 2.1?
Paul Winkler
paul at calendargalaxy.com
Tue Jun 5 14:34:20 EDT 2001
Hmmm, I don't see any big speed diff between 1.5 and 2.1 - with the script as
posted, if anything 2.1 has a slight edge on my machine (Pentium II 366, linux).
Robin's original script (modified so it wouldn't close input on blank lines)
python 1.5.2: average 20 seconds
python 2.1: average 19.5 seconds
This can be improved a lot, regardless of the version of python.
Duncan Booth wrote:
> As far as I can tell your code reads a line in and then looks to see
> whether the line contains a word that ends in a state name or a state
> abbreviation. So if the line is "Today waz blowy, tomorrow may be better"
> is in the input it will be copied to the output files for Arizona and
> Wyoming. Is this correct?
Not exactly, as I read it Robin has a space before and a space or comma after
the state name or abbreviation, and it can be anywhere in the line.
> I would be tempted to rewrite the code, either to not use regular
> expressions at all, or to use a single regular expression for everything.
Me too. And if you do decide to use regexps, you only need to build it once,
before the while loop.
> If you build one big regular expression that matches all states and state
> abbreviations, then you can extract the match out of the line and use what
> matched as a dictionary key to find the right filename (provided you first
> build a dictionary with both state names and abbreviations as keys mapping
> to the filenames).
Yeah! I thought of building the filename out of the match, but didn't think of
pre-building filenames for all possible matches. It would save some work on
lines that match the regexp. That could be a big win if you match a *lot* of
lines.
In general, you want to do as little work inside the loop as possible.
> Oh, and you upper cased the input, so you don't need a case insensitive
> search.
Yep! One or the other. If you need the line uppercased for some reason, skip the
IGNORECASE; if not, try it both ways and see what's faster.
Here's my time results for my version - I found a file with a bunch of addresses
and copied it to the necessary input filenames, then ran it a few times under
each version of python.
My implementation of Duncan's suggestions:
python 1.5.2: average 5.7 seconds
python 2.1: average 5.1 seconds
What was the problem again? :)
And here's my version of the script - it does everything Duncan suggested except
avoiding regexps altogether. (You could do that by building a list of strings to
match, then looping through this string once for each line of the input file.)
-------------------------------------
import re
import string
states = {
'ALABAMA':'AL',
'ALASKA':'AK',
'ARIZONA':'AZ',
'WISCONSIN':'WI',
'WYOMING':'WY'
}
filenames = {
'AL': 'ALABAMA',
'ALABAMA': 'ALABAMA',
'AK':'ALASKA',
'ALASKA':'ALASKA',
'AZ':'ARIZONA',
'ARIZONA':'ARIZONA',
'WI':'WISCONSIN',
'WISCONSIN':'WISCONSIN',
'WY':'WYOMING',
'WYOMING':'WYOMING'
}
#(STATENAME) OR (STATE ABBREVIATION)
# only needs to be built once
statepattern = " ("
for name, abbr in states.items():
statepattern =statepattern +name +'|' +abbr +'|'
# take off the last "|"
statepattern = statepattern[:-1]
# close the state group; then match space or comma
statepattern = statepattern +")[ ,]"
## print `statepattern`
stateregex = re.compile(statepattern)
for year in range(1994, 1998):
f = open('states/USA'+str(year)+'.TXT')
counter = 1
while(1):
print str(year), counter
counter = counter + 1
#convert city name to allcaps (db outputs in allcaps)
line = f.readline()
#check for EOF
if not line: break
# Strip the line. I do this AFTER the above test
# because it would break on blank lines,
# not just EOF.
# Is that what you wanted?
line = string.upper(string.strip(line))
# it could be faster to skip this and make a
# case-insensitive regexp... or not. try that too.
result = stateregex.search(line)
if result:
state = result.group(1)
# get the filename
filename = filenames[state]
filename = string.replace(filename, ' ', '_')
g = open('states/'+filename+'/'+str(year)+'.TXT', "a")
g.write(line + "\n")
g.close()
f.close()
More information about the Python-list
mailing list