looking for speed-up ideas
Andrew Dalke
adalke at mindspring.com
Thu Feb 6 05:18:42 EST 2003
Ram Bhamidipaty wrote:
> It would be impressive if there were a _pure_ python script that could
> deliver the performance of the grep + sort + tail command pipe line.
Ahh, well if you want pure performance and just want the filename,
then there are other ways that might make it faster
- try a memory mapped file instead of reading. (Some OSes
won't like a mmap'ed file >> main memory size)
- this could be combined with a regexp, of the form
r"\nF/(\d+)/([^\n]+)"
combined with a findall
(yeah, it loads all the filenames into memory. But I'm
trying to get performance here, and you've only a few
million names --> less than a couple hundred MBs.)
(sort, btw, can use intermediate files as extra memory)
To try the regexp approach even if mmap doesn't work, you
could also do
pat = re.compile(r"\nF/(\d+)/([^\n]+)")
data = []
min_size = 0
while 1:
s = infile.read(25*1024*1024) + infile.readline()
if not s:
break
for size, name in pat.findall(s):
try:
size = -int(size)
except ValueError:
size = -long(size)
if size > min_size:
data.append( (size, name) )
if len(data) > 1000:
data.sort()
del data[200:]
min_size = -data[0][0]
(Untested!)
The idea is that re parsing might be faster than a split and
tuple assignment.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list