Is there no compression support for large sized strings in Python?

Claudio Grondi claudio.grondi at freenet.de
Thu Dec 1 10:08:13 EST 2005


"Fredrik Lundh" <fredrik at pythonware.com> schrieb im Newsbeitrag
news:mailman.1444.1133442090.18701.python-list at python.org...
> Claudio Grondi wrote:
>
> > What started as a simple test if it is better to load uncompressed data
> > directly from the harddisk or
> > load compressed data and uncompress it (Windows XP SP 2, Pentium4  3.0
GHz
> > system with 3 GByte RAM)
> > seems to show that none of the in Python available compression libraries
> > really works for large sized
> > (i.e. 500 MByte) strings.
> >
> > Test the provided code and see yourself.
> >
> > At least on my system:
> >  zlib fails to decompress raising a memory error
> >  pylzma fails to decompress running endlessly consuming 99% of CPU time
> >  bz2 fails to compress running endlessly consuming 99% of CPU time
> >
> > The same works with a 10 MByte string without any problem.
> >
> > So what? Is there no compression support for large sized strings in
Python?
>
> you're probably measuring windows' memory managment rather than the com-
> pression libraries themselves (Python delegates all memory allocations
>256 bytes
> to the system).
>
> I suggest using incremental (streaming) processing instead; from what I
can tell,
> all three libraries support that.
>
> </F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
  lowerIndx = indx
  upperIndx = indx+1048576
  if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)

lstCompressBz2.append(objBZ2Compressor.compress(strSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()

:-)

so I suppose, that the decompression problems can also be solved that way,
but  :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio





More information about the Python-list mailing list