A few questiosn about encoding

Cameron Simpson cs at zip.com.au
Thu Jun 13 21:00:37 EDT 2013


On 13Jun2013 17:19, Nikos as SuperHost Support <support at superhost.gr> wrote:
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
| 
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
| 
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.

Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.

| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
| 
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?

You're confusing a "string representation of a single number in
some base (eg 2 or 16)" with the "string-ish representation of a
bytes object".

The former is just notation for writing a number in different bases, eg:

  27    base 10
  1b    base 16
  33    base 8
  11011 base 2

A common convention, and the one used by hex(), oct() and bin() in
Python, is to prefix the non-base-10 representations with "0x" for
base 16, "0o" for base 8 ("o"ctal) and "0b" for base 2 ("b"inary):

  27
  0x1b
  0o33
  0b11011

This allows the human reader or a machine lexer to decide what base
the number is written in, and therefore to figure out what the
underlying numeric value is.

Conversely, consider the bytes object consisting of the values [97,
98, 99, 27, 10]. In ASCII (and UTF-8 and the iso-8859-x encodings)
these may all represent the characters ['a', 'b', 'c', ESC, NL].
So when "printing" a bytes object, which is a sequence of small integers representing
values stored in bytes, it is compact to print:

  b'abc\x1b\n'

which is ['a', 'b', 'c', chr(27), newline].

The slosh (\) is the common convention in C-like languages and many
others for representing special characters not directly represents
by themselves. So "\\" for a slosh, "\n" for a newline and "\x1b"
for character 27 (ESC).

The bytes object is still just a sequence on integers, but because
it is very common to have those integers represent text, and very
common to have some text one want represented as bytes in a direct
1:1 mapping, this compact text form is useful and readable. It is
also legal Python syntax for making a small bytes object.

To demonstrate that this is just a _representation_, run this:

  >>> [ i for i in b'abc\x1b\n' ]
  [97, 98, 99, 27, 10]

at an interactive Python 3 prompt. See? Just numbers.

| To encode a number we have to turn it into a string first.
| 
| "16474".encode('utf-8')
| b'16474'
| 
| That 'b' stand for bytes.

Syntactic details. Read this:
  http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

| How can i view this byte's object representation as hex() or as bin()?

See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively. You could
do this:

  for value in b'16474':
    print(value, hex(value), bin(value))

Cheers,
-- 
Cameron Simpson <cs at zip.com.au>

Uhlmann's Razor: When stupidity is a sufficient explanation, there is no need
                 to have recourse to any other.
                        - Michael M. Uhlmann, assistant attorney general
                          for legislation in the Ford Administration



More information about the Python-list mailing list