Is there no compression support for large sized strings in Python?

Claudio Grondi claudio.grondi at freenet.de
Thu Dec 1 08:45:20 EST 2005


What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4  3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
  zlib fails to decompress raising a memory error
  pylzma fails to decompress running endlessly consuming 99% of CPU time
  bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?
Am I doing something the wrong way here?
Is there any and if yes, what is the theoretical upper limit of string size
which can be processed by each of the compression libraries?

The only limit I know about is 2 GByte for the python.exe process itself,
but this seems not to be the actual problem in this case.
There are also some other strange effects when trying to create large
strings using following code:
m = 'm'*1048576
# str1024MB = 1024*m # fails with memory error, but:
str512MB_01 = 512*m # works ok
# str512MB_02 = 512*m # fails with memory error, but:
str256MB_01 = 256*m # works ok
str256MB_02 = 256*m # works ok
etc. . etc. and so on
down to allocation of each single MB in separate string to push python.exe
to the experienced upper limit
of memory reported by Windows task manager available to python.exe of
2.065.352 KByte.
Is the question why did the str1024MB = 1024*m instruction fail,
when the memory is apparently there and the target size of 1 GByte can be
achieved
out of the scope of this discussion thread, or is this the same problem
causing
the compression libraries to fail? Why is no memory error raised then?

Any hints towards understanding what is going on and why and/or towards a
workaround are welcome.

Claudio

============================================================
# HDvsArchiveUnpackingSpeed_WriteFiles.py

strSize10MB = '1234567890'*1048576 # 10 MB
strSize500MB = 50*strSize10MB
fObj = file(r'c:\strSize500MB.dat', 'wb')
fObj.write(strSize500MB)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.zlib', 'wb')
import zlib
strSize500MBCompressed = zlib.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.pylzma', 'wb')
import pylzma
strSize500MBCompressed = pylzma.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.bz2', 'wb')
import bz2
strSize500MBCompressed = bz2.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

print
print '  Created files: '
print '    %s \n    %s \n    %s \n    %s' %(
   r'c:\strSize500MB.dat'
  ,r'c:\strSize500MBCompressed.zlib'
  ,r'c:\strSize500MBCompressed.pylzma'
  ,r'c:\strSize500MBCompressed.bz2'
)

raw_input(' EXIT with Enter /> ')

============================================================
# HDvsArchiveUnpackingSpeed_TestSpeed.py
import time

startTime = time.clock()
fObj = file(r'c:\strSize500MB.dat', 'rb')
strSize500MB = fObj.read()
fObj.close()
print
print '  loading uncompressed data from file: %7.3f
seconds'%(time.clock()-startTime,)

startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.zlib', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import zlib
try:
  startTime = time.clock()
  strSize500MB = zlib.decompress(strSize500MBCompressed)
  print 'decompressing zlib data:           %7.3f
seconds'%(time.clock()-startTime,)
except:
  print 'decompressing zlib data FAILED'


startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.pylzma', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import pylzma
try:
  startTime = time.clock()
  strSize500MB = pylzma.decompress(strSize500MBCompressed)
  print 'decompressing pylzma data:         %7.3f
seconds'%(time.clock()-startTime,)
except:
  print 'decompressing pylzma data FAILED'


startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.bz2', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import bz2
try:
  startTime = time.clock()
  strSize500MB = bz2.decompress(strSize500MBCompressed)
  print 'decompressing bz2 data:            %7.3f
seconds'%(time.clock()-startTime,)
except:
  print 'decompressing bz2 data FAILED'

raw_input(' EXIT with Enter /> ')





More information about the Python-list mailing list