[Tutor] Re: processing large file

Andrei project5 at redrival.net
Fri Sep 12 01:10:04 EDT 2003


Hello Guillaume,

ALLEON Guillaume wrote:

> I have to read and process a large ASCII file containing a mesh : a list 
> of points and triangles.
> The file is 100 MBytes.

I just tried creating a million of your points. Python occupies 213 MB of memory 
now, but it's no problem. How many do you have? Btw, it also takes a bit of time 
to create them (Athlon 2000+/512 Mb/WinXP).

> I first tried to do it in memory but I think I am running out of memory 

You think or you know? Do you get an error?

> therefore I decide to use the shelve
> module to store my points and elements on disks.

I've never used the shelve module, so I can't comment on that.

> Despite the fact it is slow ... Any hint ? I think I have the same 
> memory problem but I don't understand why
> since  my aPoint should be removed by the gc.
> 
> Have you any idea ?

I can't really read the code because the indentation is impossible to follow. If 
this is not your mail client's fault, get a proper editor for Python and use 4 
spaces per indentation level (proper editors convert the Tab key to 4 spaces).

Generally speaking, you should divide your code into smaller functions and 
profile them. That shows you exactly where the problem is. Profiling is done by 
adding at the bottom of your module this:

if __name__=="__main__":
     if len(sys.argv)>1 and sys.argv[1].lower()=="profile":
         import profile
         profile.run('main()')
     else:
         main()

This is of course assuming that your main function is called main() and that you 
want to pass a command parameter "profile". You can also just always do 
profile.run('main()') when debugging, without the parameter. This is the only 
way of knowing where the speed problems are. I've put some comments in your 
code, but I'm not saying that they are the cause of the speed issue, they're 
more like general remarks. I have also assumed you have a modern Python version.

> import string

You shouldn't use the string module. Use string methods and builtin functions 
instead.

> import os, sys, time, resource, shelve, psyco
> 
> psyco.full()
> 
> class point:
>  def __init__(self,x,y,z):
>    self.x = x
>    self.y = y
>    self.z = z
>   
> def SFMImport(filename):
>  print 'UNV Import ("%s")' % filename
> 
>  db = shelve.open('points.db')
> 
>  file = open(filename, "r")
> 
>  linenumber = 1
>  nbpoints   = 0
>  nbfaces    = 0

linenumbers and nbfaces don't seem to be used

>  pointList = []

neither is pointList

>  faceList  = []
> 
>  line  = file.readline()
>  words = string.split(line)

shortcut: words = file.readline().split()

>  nbpoints = string.atoi(words[1])
>  nbtrias  = string.atoi(words[0])

The atoi function is deprecated. The int() function seems to be about 30% faster 
and does the same thing:

 >>> a = t.Timer("import string; a=string.atoi; [a('345') for i in range(1000000)]")
 >>> a.timeit(1)
3.2120071126358241
 >>> b = t.Timer("import string; [int('345') for i in range(1000000)]")
 >>> b.timeit(1)
2.1893349573758627

I did more tests, but all results were comparable. But then again, we don't know 
whether this is the bottleneck because there's no profile.

>  print "found %s points and %s triangles" % (nbpoints, nbtrias)

I wouldn't print this much info.

>  t1 = time.time()
>  for i in range(nbpoints):
>    line  = file.readline()
>    words = string.split(line)

Shortcut above.

>    x = string.atof(words[1].replace("D","E"))
>    y = string.atof(words[2].replace("D","E"))
>    z = string.atof(words[3].replace("D","E"))

float() should be used instead of atof(). I don't know what the words are, but 
perhaps you can find a way of not doing all that replacing?

>    aPoint = point(x, y, z)
> 
>    as = "point%s" % i
> 
>    if (i%250000 == 0):
>      print "%7d points <%s>" % (i, time.time() - t1)
>      t1 = time.time()
> 
>    db[as] = aPoint
> 
>  print "%s points read in %s seconds" % (nbpoints, time.time() - t1)
>  bd.close()

What's bd?

<snip>


-- 
Yours,

Andrei

=====
Mail address in header catches spam. Real contact info (decode with rot13):
cebwrpg5 at bcrenznvy.pbz. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq gur 
yvfg, fb gurer'f ab arrq gb PP.





More information about the Tutor mailing list