Anyone happen to have optimization hints for this loop?

bruno.desthuilliers at gmail.com bruno.desthuilliers at gmail.com
Wed Jul 9 16:22:15 EDT 2008


On 9 juil, 18:04, dp_pearce <dp_pea... at hotmail.com> wrote:
> I have some code that takes data from an Access database and processes
> it into text files for another application. At the moment, I am using
> a number of loops that are pretty slow. I am not a hugely experienced
> python user so I would like to know if I am doing anything
> particularly wrong or that can be hugely improved through the use of
> another method.
>
> Currently, all of the values that are to be written to file are pulled
> from the database and into a list called "domainVa". These values
> represent 3D data and need to be written to text files using line
> breaks to seperate 'layers'. I am currently looping through the list
> and appending a string, which I then write to file. This list can
> regularly contain upwards of half a million values...
>
> count = 0
> dmntString = ""
> for z in range(0, Z):
>     for y in range(0, Y):
>         for x in range(0, X):
>             fraction = domainVa[count]
>             dmntString += "  "
>             dmntString += fraction
>             count = count + 1
>         dmntString += "\n"
>     dmntString += "\n"
> dmntString += "\n***\n
>
> dmntFile     = open(dmntFilename, 'wt')
> dmntFile.write(dmntString)
> dmntFile.close()
>
> I have found that it is currently taking ~3 seconds to build the
> string but ~1 second to write the string to file, which seems wrong (I
> would normally guess the CPU/Memory would out perform disc writing
> speeds).

Not necessarily - when the dataset becomes too big and your process
has eaten all available RAM, your OS starts swapping, and then it's
getting worse than disk IO.  IOW, for large datasets (for a definition
of 'large' depending on the available resources), you're sometimes
better doing direct disk access - which BTW are usually buffered by
the OS.

> Can anyone see a way of speeding this loop up? Perhaps by changing the
> data format?

Almost everyone told you to use a list and str.join()... Which used to
be a sound advice wrt/ both performances and readability, but nowadays
"only" makes your code more readable (and pythonic...) :

bruno at bibi ~ $ python
Python 2.5.1 (r251:54863, Apr  6 2008, 17:20:35)
[GCC 4.1.2 (Gentoo 4.1.2 p1.0.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import Timer
>>> def dostr():
...     s = ''
...     for i in xrange(10000):
...         s += ' ' + str(i)
...
>>> def dolist():
...     s = []
...     for i in xrange(10000):
...         s.append(str(i))
...     s = ' '.join(s)
...
>>> tstr = Timer("dostr", "from __main__ import dostr")
>>> tlist = Timer("dolist", "from __main__ import dolist")
>>> tlist.timeit(10000000)
1.4280490875244141
>>> tstr.timeit(10000000)
1.4347598552703857
>>>

The list + str.join version is only marginaly faster... But you should
consider this solution even if doesn't that change much to perfs -
readabilty counts, too//

> Is it wrong to append a string and write once, or should
> hold a file open and write at each instance?

Is it really a matter of one XOR the other ? Perhaps you should try a
midway solution, ie building not-too-big chunks as lists, and writing
them to the (opened file) ? This would avoids possible swap and reduce
disk IO. I suggest you try this approach with different list-size /
write ratios, using the timeit module (and eventually the "top"
program on unix or it's equivalent if you're on another platform to
check memory/CPU usage) to find out which ratio works best for a
representative sample of your input data. That's at least what I'd
do...

HTH



More information about the Python-list mailing list