[issue7551] SystemError/MemoryError/OverflowErrors on encode() a unicode string

Mon Dec 21 10:24:34 CET 2009

Marc-Andre Lemburg <mal at egenix.com> added the comment:

All string length calculations in Python 2.4 are done using ints
which are 32-bit, even on 64-bit platforms.

Since UTF-8 can use up to 4 bytes per Unicode code point, the encoder
overallocates the needed chunk of memory to len*4 bytes. This
will go straight over the 2GB limit the 32-bit int imposes if
you try to encode a 512M code point Unicode string.

The reason for using ints to represent string length is simple:
no one really expected that someone would work with 2GB strings
in memory at the time the string API was designed (large hard
drives had around 2GB at that time) - strings of such size are
simply not supported by Python 2.4.

BTW: I wouldn't really count on Python 2.4 working properly on
64-bit platforms. A lot of issues were fixed in Python 2.5
related to 32/64-bit differences.

----------
nosy: +lemburg
title: SystemError/MemoryError/OverflowErrors on encode() a unicode string -> SystemError/MemoryError/OverflowErrors on encode() a	unicode string

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7551>
_______________________________________