A few questiosn about encoding
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Tue Jun 25 16:16:33 EDT 2013
Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit :
> On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote:
>
>
>
> > utf-8: how many bytes to hold an "a" in memory? one byte.
>
> >
>
> > flexible string representation: how many bytes to hold an "a" in memory?
>
> > One byte? No, two. (Funny, it consumes more memory to hold an ascii char
>
> > than ascii itself)
>
>
>
> Incorrect. Python strings have overhead because they are objects, so
>
> let's see the difference adding a single character makes:
>
>
>
> # Python 3.3, with the hated flexible string representation:
>
> py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
>
> 1
>
>
>
> # Python 3.2:
>
> py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
>
> 4
>
>
>
>
>
> How about a French é character? Of course, ASCII cannot store it *at
>
> all*, but let's see what Python can do:
>
>
>
>
>
> # The hated Python 3.3 again:
>
> py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
>
> 1
>
>
>
>
>
> # And Python 3.2:
>
> py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
>
> 4
>
>
>
>
>
>
>
> > utf-8: In a series of bytes implementing the encoded code points
>
> > supposed to hold a string, picking a byte and finding to which encoded
>
> > code point it belongs is a no prolem.
>
>
>
> Incorrect. UTF-8 is unsuitable for random access, since it has variable-
>
> width characters, anything from 1 to 4 bytes. So you cannot just jump
>
> directly to character 1000 in a block of text, you have to inspect each
>
> byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.
>
>
>
>
>
> > flexible string representation: In a series of bytes implementing the
>
> > encoded code points supposed to hold a string, picking a byte and
>
> > finding to which encoded code point it belongs is ... impossible !
>
>
>
> Incorrect. It is absolutely trivial. Each string is marked as either 1-
>
> byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one
>
> character. If it is a 2-byte string, then it is just like Python 3.2
>
> narrow build, and each two bytes is a character. If it is a 4-byte
>
> string, then it is just like Python 3.2 wide build, and each four bytes
>
> is a character. Within a single string, the number of bytes per character
>
> is fixed, and random access is easy and fast.
>
>
>
>
>
>
>
> --
>
> Steven
:-)
More information about the Python-list
mailing list