optimization
Nathan Clegg
nathan at islanddata.com
Mon Jul 5 21:12:59 EDT 1999
I'm writing an application in python that receives data in a specified
format, but can't decide which method would be the most efficient. The
data format is simple. The data is a set of (key,value) pairs. One-line
values are of the format:
Key: one line value
Multiline values appear like this:
.KEY
multiline value
multiline value
multiline value
multiline value
.KEY
where each value line is escaped if a period actually falls into the first
column. I want to read these pairs into a dict. The process for the
first case seems simple:
(key, value) = string.split(line, ':', 1)
dict[key] = value
But I'm not sure which approach I should take because of possible
repercussions. Some of my initial thoughts, minus the nitpicky stuff
(like escaping periods and chomping lines):
1)
temp = ''
for line in file.readlines():
if line[0] == '.':
if temp:
temp = ''
else:
temp = line
else:
if temp:
dict[temp] = dict[temp] + '\n' + line
else:
(key, value) = string.split(line, ':', 1)
dict[key] = va lue
This approach is unappealing for two reasons. First, I don't like the
idea of keeping track of too much state (temp). I know it's not much, but
it just seems unnecessary. Second, and more importantly, the multiline
case is likely to be much larger than the singleline case is numerous. I
would rather optimize for it, and I know a = a + b is not the way to do
that.
2)
lines = file.readlines()
while lines:
line = lines[0]; del lines[0]
if line[0] == '.':
index = lines.index(line)
dict[line[1:]] = string.join(lines[:index], '\n')
del lines[:index+1]
else:
(key, value) = string.split(line, ':', 1)
dict[key] = va lue
This approach seems to handle the worst case better, by joining rather
than handling lines one by one. However, I would much rather use a for
loop in thise case than a while. I'm also worried about all the slicing
going on, though individual lines are likely to be 80 characters or less.
3)
As yet uncoded...reading the file into a single string rather than based
on lines, pulling out the multiline stuff based on a regular expression
search (something like \.(\w+)\n.*\.\1\n), but that sounds like it would
suffer too much from regex and string buffer copies.
Any advice or new methods?
Please forgive typos and syntax issues...the above really is pseudocode.
----------------------------------
Nathan Clegg
nathan at islanddata.com
More information about the Python-list
mailing list