[Tutor] Why difference between printing string & typing its object reference at the prompt?
eryksun
eryksun at gmail.com
Thu Oct 11 10:40:52 CEST 2012
On Wed, Oct 10, 2012 at 9:23 PM, boB Stepp <robertvstepp at gmail.com> wrote:
>
>> >>> aꘌꘌb = True
>> >>> aꘌꘌb
>> True
>>
>> >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6)
>> >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ
>> (1, 2, 3, 4, 5)
>
> Is doing this considered good programming practice?
The examples were meant to highlight the absurdity of using letter
modifiers and number letters in identifiers. I should have clearly
stated that I think these names are bad.
>> bytes have string methods as a convenience, such as find, split, and
>> partition. They also have the method decode(), which uses a specified
>> encoding such as "utf-8" to create a string from an encoded bytes
>> sequence.
>
> What is the intended use of byte types?
bytes objects are important for low-level data processing, such as
file and socket I/O. The fundamental addressable value in a computer
is a byte (at least for all common, modern computers). When you write
a string to a file or socket, it has to be encoded as a sequence of
bytes.
For example, consider the character "𝟡" (MATHEMATICAL DOUBLE-STRUCK
DIGIT NINE) with decimal code 120801 (0x1d71e in hexadecimal):
>>> ord("𝟡")
120801
Three common ways to encode this character are as UTF-32, UTF-16, and UTF-8.
The UTF-32 encoding is the UCS4 format used by strings in main memory
on a "wide" build (Python 3.3 uses a more efficient scheme that uses
1, 2, or 4 bytes as required).
>>> s.encode("utf-32")
b'\xff\xfe\x00\x00\xe1\xd7\x01\x00'
The "utf-32" string encoder also includes a byte order mark (BOM) in
the first 4 bytes of the encoded sequence (0xfffe0000). The order of
the BOM determines that this is a little-endian, 4-byte encoding.
http://en.wikipedia.org/wiki/Endianness
You can use int.from_bytes() to verify that b'\xe1\xd7\x01\x00' is the
number 120801 stored as 4 bytes in little-endian order:
>>> int.from_bytes(b'\xe1\xd7\x01\x00', 'little')
120801
or crunch the numbers in a generator expression:
>>> sum(x * 256**i for i,x in enumerate(b'\xe1\xd7\x01\x00'))
120801
UTF-32 is an inefficient way to represent Unicode. Characters in the
BMP, which are by far the most common, only require at most 2 bytes.
UTF-16 uses 2 bytes for BMP codes, like the original UCS2, and a
4-byte surrogate-pair encoding for characters in the supplementary
planes. Here's the character "𝟡" encoded as UTF-16:
>>> list(map(hex, s.encode('utf-16')))
['0xff', '0xfe', '0x35', '0xd8', '0xe1', '0xdf']
Again there's a BOM, 0xfffe, which describes the order and number of
bytes per code (i.e. 2 bytes, little endian). The character itself is
stored as the surrogate pair [0xd835, 0xdfe1]. You can read more about
surrogate pair encoding in the UTF-16 Wikipedia article:
http://en.wikipedia.org/wiki/UTF-16
A "narrow" build of Python uses UCS2 + surrogates. It's not quite
UTF-16 since it doesn't treat a surrogate pair as a single character
for iteration, string length, and indexing. Python 3.3 eliminates
narrow builds.
Another common encoding is UTF-8. This maps each code to 1-4 bytes,
without requiring a BOM (though the 3-byte BOM 0xefbbbf can be used
when saving to a file). Since ASCII is so common, and since on many
systems backward compatibility with ASCII is required, UTF-8 includes
ASCII as a subset. In other words, codes below 128 are stored
unmodified as a single byte. Non-ASCII codes are encoded as 2-4 bytes.
See the UTF-8 Wikipedia article for the details:
http://en.wikipedia.org/wiki/UTF-8#Description
The character "𝟡" requires 4 bytes in UTF-8:
>>> s = "𝟡"
>>> sb = s.encode("utf-8")
>>> sb
b'\xf0\x9d\x9f\xa1'
>>> list(sb)
[240, 157, 159, 161]
If you iterate over the encoded bytestring, the numbers 240, 157, 159,
and 161 -- taken separately -- have no special significance. Neither
does the length of 4 tell you how many characters are in the
bytestring. With a decoded string, in contrast, you know how many
characters it has (assuming you've normalized to "NFC" format) and can
iterate through the characters in a simple for loop.
If your terminal/console uses UTF-8, you can write the UTF-8 encoded
bytes directly to the stdout buffer:
>>> sys.stdout.buffer.write(b'\xf0\x9d\x9f\xa1' + b'\n')
𝟡
5
This wrote 5 bytes: 4 bytes for the "𝟡" character, plus b'\n' for a newline.
Strings in Python 2
In Python 2, str is a bytestring. Iterating over a 2.x str yields
single-byte characters. However, these generally aren't 'characters'
at all (this goes back to the C programming language "char" type), not
unless you're working with a single-byte encoding such as ASCII or
Latin-1. In Python 2, unicode is a separate type and unicode literals
require a u prefix to distinguish them from bytestrings, just as bytes
literals in Python 3 require a b prefix to distinguish them from
strings.
Python 2.6 and 2.7 alias str to the name "bytes", and they support the
b prefix in literals. These were added to ease porting to Python 3,
but bear in mind that it's still a classic bytestring, not a bytes
object. For example, in 2.x you can use ord() with an item of a
bytestring, such as ord(b"ABC"[0]), but this won't work in 3.x because
b"ABC"[0] returns the integer 65. On the other hand, ord(b"A") does
work in 3.x.
Python 2.6 also added "__future__.unicode_literals" to make string
literals default to unicode without having to use the u prefix.
bytestrings then require the b prefix.
More information about the Tutor
mailing list