A few questiosn about encoding

Fábio Santos fabiosantosart at gmail.com
Sun Jun 9 08:18:08 EDT 2013


On 9 Jun 2013 11:49, "Νικόλαος Κούρας" <nikos.gr33k at gmail.com> wrote:
>
> A few questiosn about encoding please:
>
> >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
> >> values up to 256?
>
> >Because then how do you tell when you need one byte, and when you need
> >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
> >characters, with ordinal values 0x4C and 0xFA, or one character with
> >ordinal value 0x4CFA?
>
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant
up to 256, not above 256.
>
>
> >> UTF-8 and UTF-16 and UTF-32
> >> I though the number beside of UTF- was to declare how many bits the
> >> character set was using to store a character into the hdd, no?
>
> >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
> >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
> >values to make a surrogate pair.
>
> A surrogate pair is like itting for example Ctrl-A, which means is a
combination character that consists of 2 different characters?
> Is this what a surrogate is? a pari of 2 chars?
>
>
> >UTF-8 uses 8-bit values, but sometimes
> >it combines two, three or four of them to represent a single code-point.
>
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
127 )
> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?
(since ordinal >  65000 )
>
> The amount of bytes needed to store a character solely depends on the
character's ordinal value in the Unicode table?
> --
> http://mail.python.org/mailman/listinfo/python-list

In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
to 4 bytes. A utf-32 always takes 4 bytes.

The process of encoding bytes to characters is called encoding. The
opposite is decoding. This is all made transparent in python with the
encode() and decode() methods. You normally don't care about this kind of
things.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20130609/56476aca/attachment.html>


More information about the Python-list mailing list