python 3.3 repr

Fri Nov 15 10:08:28 EST 2013

On Sat, Nov 16, 2013 at 1:43 AM, Robin Becker <robin at reportlab.com> wrote:
> ..........
>
>> I'm still stuck on Python 2, and while I can understand the controversy
>> ("It breaks my Python 2 code!"), this seems like the right thing to have
>> done.  In Python 2, unicode is an add-on.  One of the big design drivers in
>> Python 3 was to make unicode the standard.
>>
>> The idea behind repr() is to provide a "just plain text" representation of
>> an object.  In P2, "just plain text" means ascii, so escaping non-ascii
>> characters makes sense.  In P3, "just plain text" means unicode, so escaping
>> non-ascii characters no longer makes sense.
>>
>
> unfortunately the word 'printable' got into the definition of repr; it's
> clear that printability is not the same as unicode at least as far as the
> print function is concerned. In my opinion it would have been better to
> leave the old behaviour as that would have eased the compatibility.

"Printable" means many different things in different contexts. In some
contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet
each of those characters is perfectly displayable in its natural form.
Under IDLE, non-BMP characters can't be displayed (or at least, that's
how it has been; I haven't checked current status on that one). On
Windows, the console runs in codepage 437 by default (again, I may be
wrong here), so anything not representable in that has to be escaped.
My Linux box has its console set to full Unicode, everything working
perfectly, so any non-control character can be printed. As far as
Python's concerned, all of that is outside - something is "printable"
if it's printable within Unicode, and the other hassles are matters of
encoding. (Except the first one. I don't think there's an encoding
"g-rated".)

> The python gods don't count that sort of thing as important enough so we get
> the mess that is the python2/3 split. ReportLab has to do both so it's a
> real issue; in addition swapping the str - unicode pair to bytes str doesn't
> help one's mental models either :(

That's fixing, in effect, a long-standing bug - of a sort. The name
"str" needs to be applied to the most normal string type. As of Python
3, that's a Unicode string, which is as it should be. In Python 2, it
was the ASCII/bytes string, which still fit the description of "most
normal string type", but that means that Python 2 programs are
Unicode-unaware by default, which is a flaw. Hence the Py3 fix.

> Things went wrong when utf8 was not adopted as the standard encoding thus
> requiring two string types, it would have been easier to have a len function
> to count bytes as before and a glyphlen to count glyphs. Now as I understand
> it we have a complicated mess under the hood for unicode objects so they
> have a variable representation to approximate an 8 bit representation when
> suitable etc etc etc.

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

There are languages that do what you describe. It's very VERY easy to
break stuff. What happens when you slice a string?

>>> foo = "asdf"
>>> foo[:2],foo[2:]
('as', 'df')

>>> foo = "q\u1234zy"
>>> foo[:2],foo[2:]
('qሴ', 'zy')

Looks good to me. I split a four-character string, I get two
one-character strings. If that had been done in UTF-8, either I would
need to know "don't split at that boundary, that's between bytes in a
character", or else the indexing and slicing would have to be done by
counting characters from the beginning of the string - an O(n)
operation, rather than an O(1) pointer arithmetic, not to mention that
it'll blow your CPU cache (touching every part of a potentially-long
string) just to find the position.

The only reliable way to manage things is to work with true Unicode.
You can completely ignore the internal CPython representation; what
matters is that in Python (any implementation, as long as it conforms
with version 3.3 or later) lets you index Unicode codepoints out of a
Unicode string, without differentiating between those that happen to
be ASCII, those that fit in a single byte, those that fit in two
bytes, and those that are flagged RTL, because none of those
considerations makes any difference to you.

It takes some getting your head around, but it's worth it - same as
using git instead of a Windows shared drive. (I'm still trying to push
my family to think git.)

ChrisA