Chardet, file, ... and the Flexible String Representation

Tue Sep 10 11:36:33 EDT 2013

On Mon, Sep 9, 2013, at 10:28, wxjmfauth at gmail.com wrote:
*time performance differences*
> 
> Comment: Such differences never happen with utf.

Why is this bad? Keeping in mind that otherwise they would all be almost
as slow as the UCS-4 case.

> >>> sys.getsizeof('a')
> 26
> >>> sys.getsizeof('€')
> 40
> >>> sys.getsizeof('\U0001d11e')
> 44
> 
> Comment: 18 bytes more than latin-1
> 
> Comment: With utf, a char (in string or not) never exceed 4 

A string is an object and needs to store the length, along with any
overhead relating to object headers. I believe there is also an appended
null character. Also, ASCII strings are stored differently from Latin-1
strings.

>>> sys.getsizeof('a'*999)
1048 = 49 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\xa4'*999)
1072 = 74 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\u20ac'*999)
2072 = 76 bytes overhead, 2 bytes per character.
>>> sys.getsizeof('\U0001d11e'*999)
4072 = 80 bytes overhead, 4 bytes per character.

(I bet sys.getsizeof('\xa4') will return 38 on your system, so 44 is
only six bytes more, not 18)

If we did not have the FSR, everything would be 4 bytes per character.
We might have less overhead, but a string only has to be 25 characters
long before the savings from the shorter representation outweigh even
having _no_ overhead, and every four bytes of overhead reduces that
number by one. And you have a 32-bit python build, which has less
overhead than mine - in yours, strings only have to be seven characters
long for the FSR to be worth it. Assume the minimum possible overhead is
two words for the object header, a size, and a pointer - i.e. sixteen
bytes, compared to the 25 you've demonstrated for ASCII, and strings
only need to be _two_ characters long for the FSR to be a better deal
than always using UCS4 strings.

The need for four-byte-per-character strings would not go away by
eliminating the FSR, so you're basically saying that everything should
be constrained to the worst-case performance scenario.