extending file?

Tue Nov 2 08:16:23 EST 2004

Chris Cioffi wrote:

>Hello all,
>
>Are there any docs or examples of extending the file type?  I work
>with EDI messages that are very like text files, just with a few
>quirks. ;-)  and I was wondering if I could strech and twist the built
>in file type to make things a bit faster and more full featured.
>
>Specifically I would need to alter the iterator and ideally the line terminitor.
>
>Chris
>  
>
I'm not going to answer your question, but I hope that doesn't stop 
someone else from chiming in with an answer as I would like an answer as 
well :-)  I noticed your recipe the other day posted in response to Greg 
Lindstrom regarding EDI Tools.  If the recipe that you posted is what 
you are using to parse your EDI messages, you could just change a couple 
of things in it rather than modifying "file" and realize a huge 
performance boost (although, it would be really cool to get readline to 
take a terminating character as an argument, and, therefore, return and 
EDI segment).  Here is a simple, stupid script that takes a filename and 
number as arguments.  It then reads that file <number> bytes at a time:

#!/usr/bin/env python

import sys
import os
import time

filename = sys.argv[1]
numbytes = int(sys.argv[2])
filesize = os.stat(filename)[6]

print "reading file %s (file is %s bytes long) %s bytes at a time" % 
(filename, filesize, numbytes)

start_time = time.time()
i = open(filename, 'r')
ch = "a"
while ch != "":
    ch = i.read(numbytes)
end_time = time.time()
print "read file in %s seconds" % (end_time - start_time)

Here are some results:

[jmjones at qatestrunner tmp]$ ./readit.py mono-list.mbox 1000
reading file mono-list.mbox (file is 88108265 bytes long) 1000 bytes at 
a time
read file in 1.12803697586 seconds
[jmjones at qatestrunner tmp]$ ./readit.py mono-list.mbox 100
reading file mono-list.mbox (file is 88108265 bytes long) 100 bytes at a 
time
read file in 5.3266479969 seconds
[jmjones at qatestrunner tmp]$ ./readit.py mono-list.mbox 10
reading file mono-list.mbox (file is 88108265 bytes long) 10 bytes at a time
read file in 48.9884879589 seconds
[jmjones at qatestrunner tmp]$ ./readit.py mono-list.mbox 1
reading file mono-list.mbox (file is 88108265 bytes long) 1 bytes at a time
read file in 470.724639893 seconds

I wrote an EDI parser at my last job and found that performance kinda 
sucked.  I ran it through a profiler and I can't remember where the 
exact bottleneck was, but from the results above, you can see the huge 
performance penalty for calling file.read() several (tens of) millions 
of times vs. several (tens of) thousands of times.  I found out that you 
take a small performance hit every time you make a function call - can't 
remember where I found that from, but I think I remember reading it 
somewhere.  Anyway, so I made the loop through the EDI file iterate 
fewer times by reading in a larger chunk of file (like 500 bytes or 
something like that) and seeing if the character I wanted was in it.  I 
wound up with this kludgy finite state machine, but I got a huge 
performance boost.  Consequently, I'm thinking of creating a new EDI 
parser from scratch using the finite state machine class (found in this 
chapter: http://gnosis.cx/TPiP/chap4.txt: look for "statemachine.py") in 
David Mertz's excellent book "Text Processing in Python" 
(http://gnosis.cx/TPiP/).  If I do write it, I'll open source it.

The performance tips I found were probably either at:
http://manatee.mojam.com/~skip/python/fastpython.html
by Skip Montanaro or here:
http://www.python.org/doc/essays/list2str.html
by Guido.

But, if you get "file" modified with a custom realine(), you will likely 
have overcome this problem anyway.  I hope someone can answer that 
question for you.  But I fear that writing it in Python may not buy you 
as much performance as either of us would like.

Jeremy Jones
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20041102/eb237c3c/attachment.html>