[Tutor] Why difference between printing string & typing its object reference at the prompt?

eryksun eryksun at gmail.com
Thu Oct 11 10:40:52 CEST 2012


On Wed, Oct 10, 2012 at 9:23 PM, boB Stepp <robertvstepp at gmail.com> wrote:
>
>>     >>> aꘌꘌb = True
>>     >>> aꘌꘌb
>>     True
>>
>>     >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6)
>>     >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ
>>     (1, 2, 3, 4, 5)
>
> Is doing this considered good programming practice?


The examples were meant to highlight the absurdity of using letter
modifiers and number letters in identifiers. I should have clearly
stated that I think these names are bad.


>> bytes have string methods as a convenience, such as find, split, and
>> partition. They also have the method decode(), which uses a specified
>> encoding such as "utf-8" to create a string from an encoded bytes
>> sequence.
>
> What is the intended use of byte types?


bytes objects are important for low-level data processing, such as
file and socket I/O. The fundamental addressable value in a computer
is a byte (at least for all common, modern computers). When you write
a string to a file or socket, it has to be encoded as a sequence of
bytes.

For example, consider the character "𝟡" (MATHEMATICAL DOUBLE-STRUCK
DIGIT NINE) with decimal code 120801 (0x1d71e in hexadecimal):

    >>> ord("𝟡")
    120801

Three common ways to encode this character are as UTF-32, UTF-16, and UTF-8.

The UTF-32 encoding is the UCS4 format used by strings in main memory
on a "wide" build (Python 3.3 uses a more efficient scheme that uses
1, 2, or 4 bytes as required).

    >>> s.encode("utf-32")
    b'\xff\xfe\x00\x00\xe1\xd7\x01\x00'

The "utf-32" string encoder also includes a byte order mark (BOM) in
the first 4 bytes of the encoded sequence (0xfffe0000). The order of
the BOM determines that this is a little-endian, 4-byte encoding.

http://en.wikipedia.org/wiki/Endianness

You can use int.from_bytes() to verify that b'\xe1\xd7\x01\x00' is the
number 120801 stored as 4 bytes in little-endian order:

    >>> int.from_bytes(b'\xe1\xd7\x01\x00', 'little')
    120801

or crunch the numbers in a generator expression:

    >>> sum(x * 256**i for i,x in enumerate(b'\xe1\xd7\x01\x00'))
    120801

UTF-32 is an inefficient way to represent Unicode. Characters in the
BMP, which are by far the most common, only require at most 2 bytes.

UTF-16 uses 2 bytes for BMP codes, like the original UCS2, and a
4-byte surrogate-pair encoding for characters in the supplementary
planes. Here's the character "𝟡" encoded as UTF-16:

    >>> list(map(hex, s.encode('utf-16')))
    ['0xff', '0xfe', '0x35', '0xd8', '0xe1', '0xdf']

Again there's a BOM, 0xfffe, which describes the order and number of
bytes per code (i.e. 2 bytes, little endian). The character itself is
stored as the surrogate pair [0xd835, 0xdfe1]. You can read more about
surrogate pair encoding in the UTF-16 Wikipedia article:

http://en.wikipedia.org/wiki/UTF-16

A "narrow" build of Python uses UCS2 + surrogates. It's not quite
UTF-16 since it doesn't treat a surrogate pair as a single character
for iteration, string length, and indexing. Python 3.3 eliminates
narrow builds.

Another common encoding is UTF-8. This maps each code to 1-4 bytes,
without requiring a BOM (though the 3-byte BOM 0xefbbbf can be used
when saving to a file). Since ASCII is so common, and since on many
systems backward compatibility with ASCII is required, UTF-8 includes
ASCII as a subset. In other words, codes below 128 are stored
unmodified as a single byte. Non-ASCII codes are encoded as 2-4 bytes.
See the UTF-8 Wikipedia article for the details:

http://en.wikipedia.org/wiki/UTF-8#Description

The character "𝟡" requires 4 bytes in UTF-8:

    >>> s = "𝟡"
    >>> sb = s.encode("utf-8")
    >>> sb
    b'\xf0\x9d\x9f\xa1'
    >>> list(sb)
    [240, 157, 159, 161]

If you iterate over the encoded bytestring, the numbers 240, 157, 159,
and 161 -- taken separately -- have no special significance. Neither
does the length of 4 tell you how many characters are in the
bytestring. With a decoded string, in contrast, you know how many
characters it has (assuming you've normalized to "NFC" format) and can
iterate through the characters in a simple for loop.

If your terminal/console uses UTF-8, you can write the UTF-8 encoded
bytes directly to the stdout buffer:

    >>> sys.stdout.buffer.write(b'\xf0\x9d\x9f\xa1' + b'\n')
    𝟡
    5

This wrote 5 bytes: 4 bytes for the "𝟡" character, plus b'\n' for a newline.


Strings in Python 2

In Python 2, str is a bytestring. Iterating over a 2.x str yields
single-byte characters. However, these generally aren't 'characters'
at all (this goes back to the C programming language "char" type), not
unless you're working with a single-byte encoding such as ASCII or
Latin-1. In Python 2, unicode is a separate type and unicode literals
require a u prefix to distinguish them from bytestrings, just as bytes
literals in Python 3 require a b prefix to distinguish them from
strings.

Python 2.6 and 2.7 alias str to the name "bytes", and they support the
b prefix in literals. These were added to ease porting to Python 3,
but bear in mind that it's still a classic bytestring, not a bytes
object. For example, in 2.x you can use ord() with an item of a
bytestring, such as ord(b"ABC"[0]), but this won't work in 3.x because
b"ABC"[0] returns the integer 65. On the other hand, ord(b"A") does
work in 3.x.

Python 2.6 also added "__future__.unicode_literals" to make string
literals default to unicode without having to use the u prefix.
bytestrings then require the b prefix.


More information about the Tutor mailing list