A few questiosn about encoding

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jun 23 12:30:40 EDT 2013


On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote:

> utf-8: how many bytes to hold an "a" in memory? one byte.
> 
> flexible string representation: how many bytes to hold an "a" in memory?
> One byte? No, two. (Funny, it consumes more memory to hold an ascii char
> than ascii itself)

Incorrect. Python strings have overhead because they are objects, so 
let's see the difference adding a single character makes:

# Python 3.3, with the hated flexible string representation:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
1

# Python 3.2:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
4


How about a French é character? Of course, ASCII cannot store it *at 
all*, but let's see what Python can do:


# The hated Python 3.3 again:
py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
1


# And Python 3.2:
py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
4



> utf-8: In a series of bytes implementing the encoded code points
> supposed to hold a string, picking a byte and finding to which encoded
> code point it belongs is a no prolem.

Incorrect. UTF-8 is unsuitable for random access, since it has variable-
width characters, anything from 1 to 4 bytes. So you cannot just jump 
directly to character 1000 in a block of text, you have to inspect each 
byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.


> flexible string representation: In a series of bytes implementing the
> encoded code points supposed to hold a string, picking a byte and
> finding to which encoded code point it belongs is ... impossible !

Incorrect. It is absolutely trivial. Each string is marked as either 1-
byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one 
character. If it is a 2-byte string, then it is just like Python 3.2 
narrow build, and each two bytes is a character. If it is a 4-byte 
string, then it is just like Python 3.2 wide build, and each four bytes 
is a character. Within a single string, the number of bytes per character 
is fixed, and random access is easy and fast.



-- 
Steven



More information about the Python-list mailing list