A few questiosn about encoding

Cameron Simpson cs at zip.com.au
Fri Jun 14 06:14:54 EDT 2013


On 14Jun2013 09:59, Nikos as SuperHost Support <support at superhost.gr> wrote:
| On 14/6/2013 4:00 πμ, Cameron Simpson wrote:
| >On 13Jun2013 17:19, Nikos as SuperHost Support <support at superhost.gr> wrote:
| >| A code-point and the code-point's ordinal value are associated into
| >| a Unicode charset. They have the so called 1:1 mapping.
| >|
| >| So, i was under the impression that by encoding the code-point into
| >| utf-8 was the same as encoding the code-point's ordinal value into
| >| utf-8.
| >|
| >| So, now i believe they are two different things.
| >| The code-point *is what actually* needs to be encoded and *not* its
| >| ordinal value.
| >
| >Because there is a 1:1 mapping, these are the same thing: a code
| >point is directly _represented_ by the ordinal value, and the ordinal
| >value is encoded for storage as bytes.
| 
| So, you are saying that:
| 
| chr(16474).encode('utf-8')   #being the code-point encoded
| 
| ord(chr(16474)).encode('utf-8')     #being the code-point's ordinal
| encoded which gives an error.
| 
| that shows us that a character is what is being be encoded to utf-8
| but the character's ordinal cannot.
| 
| So, whay you say "....and the ordinal value is encoded for storage
| as bytes." ?

No, I mean conceptually, there is no difference between a codepoint
and its ordinal value. They are the same thing.

Inside Python itself, a character (a string of length 1; there is
no separate character type) is a distinct type. Interally, the
characters in a string are stored numericly. As Unicode codepoints,
as their ordinal values.

It is a meaningful idea to store a Python string encoded into bytes
using some text encoding scheme (utf-8, iso-8859-7, what have you).

It is not a meaningful thing to store a number "encoded" without
some more context. The .encode() method that accepts an encoding
name like "utf-8" is specificly an encoding procedure FOR TEXT.

So strings have such a method, and integers do not.

When you write:

  chr(16474)

you receive a _string_, containing the single character whose ordinal
is 16474. It is meaningful to transcribe this string to bytes using
a text encoding procedure like 'utf-8'.

When you write:

  ord(chr(16474))

you get an integer. Because ord() is the reverse of chr(), you get
the integer 16474.

Integers do not have .encode() methods that accept a _text_ encoding
name like 'utf-8' because integers are not text.

| >| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| >| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
| >|
| >| But byte objects are represented as '\x' instead of the
| >| aforementioned '0x'. Why is that?
| >
| >You're confusing a "string representation of a single number in
| >some base (eg 2 or 16)" with the "string-ish representation of a
| >bytes object".
| 
| >>> bin(16474)
| '0b100000001011010'
| that is a binary format string representation of number 16474, yes?

Yes.

| >>> hex(16474)
| '0x405a'
| that is a hexadecimal format string representation of number 16474, yes?

Yes.

| WHILE:
| b'abc\x1b\n' = a string representation of a byte, which in turn is a
| series of integers, so that makes this a string representation of
| integers, is this correct?

A "bytes" Python object. So not "a byte", 5 bytes.
It is a string representation of the series of byte values,
ON THE PREMISE that the bytes may well represent text.
On that basis, b'abc\x1b\n' is a reasonable way to display them.

In other contexts this might not be a sensible way to display these
bytes, and then another format would be chosen, possibly hand
constructed by the programmer, or equally reasonable, the hexlify()
function from the binascii module.

| \x1b = ESC character

Considering the bytes to be representing characters, then yes.

| \ = for seperating bytes

No, \ to introduce a sequence of characters with special meaning.

Normally a character in a b'...' item represents the byte value
matching the character's Unicode ordinal value. But several characters
are hard or confusing to place literally in a b'...' string. For
example a newline character or and escape character.

'a' means 65.
'\n' means 10 (newline, hence the 'n').
'\x1b' means 33 (escape, value 27, value 0x1b in hexadecimal).
And, of course, '\\' means a literal slosh, value 92.

| x = to flag that the following bytes are going to be represented as
| hex values? whats exactly 'x' means here? character perhaps?

A slosh followed by an 'x' means there will be 2 hexadecimal digits
to follow, and those two digits represent the byte value.

So, yes.

| Still its not clear into my head what the difference of '0x1b' and
| '\x1b' is:

They're the same thing in two similar but slightly different formats.

0x1b is a legitimate "bare" integer value in Python.

\x1b is a sequence you find inside strings (and "byte" strings, the
b'...' format).

| i think:
| 0x1b = an integer represented in hex format

Yes.

| \x1b = a character represented in hex format

Yes.

| >| How can i view this byte's object representation as hex() or as bin()?
| >
| >See above. A bytes is a _sequence_ of values. hex() and bin() print
| >individual values in hexadecimal or binary respectively.
| 
| >>> for value in b'\x97\x98\x99\x27\x10':
| ...     print(value, hex(value), bin(value))
| ...
| 151 0x97 0b10010111
| 152 0x98 0b10011000
| 153 0x99 0b10011001
| 39 0x27 0b100111
| 16 0x10 0b10000
| 
| 
| >>> for value in b'abc\x1b\n':
| ...     print(value, hex(value), bin(value))
| ...
| 97 0x61 0b1100001
| 98 0x62 0b1100010
| 99 0x63 0b1100011
| 27 0x1b 0b11011
| 10 0xa 0b1010
| 
| 
| Why these two give different values when printed?

97 is in base 10 (9*10+7=97), but the notation '\x97' is base 16, so 9*16+7=151.

Cheers,
-- 
Cameron Simpson <cs at zip.com.au>

I'm Bubba of Borg.  Y'all fixin' to be assimilated.



More information about the Python-list mailing list