The Case Against Python 3

Chris Angelico rosuav at gmail.com
Fri Nov 25 04:31:10 EST 2016


On Fri, Nov 25, 2016 at 7:29 PM, Mark Summerfield <list at qtrac.plus.com> wrote:
> The article has a section called:
>
>     "Statically Typed Strings"
>
> The title is wrong of course because Python uses dynamic typing. But his chief complaint seems to be that you can't mix strings and bytes in Python 3. That's a deliberate design choice that several Python core developers have explained. Essentially they are saying that you can't concatenate a bunch of raw bytes with a string in the same way that you can't add a set to a list -- and this makes perfect sense because raw bytes could be just bytes, or they could be a representation of text in which case by specifying the encoding (i.e., converting them to a string) the concatenation can take place. And this is in keeping with Python's core philosphy of being explicit.
>

It's worse than that. Look at his comparison of Py3 and Py2. I've
shortened them some to highlight the part I'm looking at:

x = bytes("hello", 'utf-8')
y = "hello"
def addstring(a, b):
    return a + b

addstring(x, y)
# TypeError

==========

def addstring(a, b):
    return a + b
x = "hello"
y = bytes("hello")
addstring(x, y)
# 'hellohello'

==========

He clearly does not understand the difference between bytes and text,
as has been proven earlier, but this demonstrates that he doesn't even
understand the difference between Python's data types. The first
example is trying to add a bytestring to a Unicode string; the second
is actually adding two byte strings. He could have given a demo of how
Python 2 lets you join str and unicode, but it would have spoiled his
Py2 code by putting u'hellohello' into his output, and making Py3
actually look better. Can't have that.

Then he says:
> If they're going to require beginners to struggle with the difference between bytes and Unicode the least they could do is tell people what variables are bytes and what variables are strings.
>

The trouble is, by the time you're adding bytes and text, you're not
looking at variables any more. You're looking at objects. I don't
think he's properly understood Python's object model.

Here, have some FUD:
> Strings are also most frequently received from an external source, such as a network socket, file, or similar input. This means that Python 3's statically typed strings and lack of static type safety will cause Python 3 applications to crash more often and have more security problems when compared with Python 2.
>

What security problems? Any evidence of that?

On the face of it, without any actual specific examples, which of
these would you expect to be more security-problem-prone: mixing data
types, or throwing exceptions? In a web application, an exception can
be caught at a high level, logged, and handled by kicking a 500 back
to the client. In other applications, there may be an equivalent, or
you just terminate the server (client gets disconnected) and start up
again. At worst, this means that someone can exploit the whole "crash
and restart" thing as a way to DOS you. Here, let me walk you through
some different numeric types, and you tell me which ones are equal and
which aren't - and the security implications of that:

1) 1e2 == 100 ?
2) 1e2 == "100" ?
3) "1e2" == 100 ?
4) "1e2" == "100" ?

#1 makes perfect sense. Python says, yes, this is the case. (Not all
languages will; 1e2 is a floating-point literal, 100 is an integer,
and it's conceivable to keep them separate.)

#2 is acceptable to languages with "sloppy comparison" and "strict
comparison" operators, like ECMAScript/JavaScript. The number 100 is
(non-strictly) equal to the string "100".

#3 depends on whether sloppy comparisons are done by converting to
string or converting to number. ECMAScript treats them as equal, but
I'm just as happy with that being false (actually, probably slightly
happier).

#4 makes no sense to any sane programmer [1], which must be why PHP
chose to have that one be true.

Security implications of two different hexadecimal strings comparing
equal.... that can't have any bearing on passwords now, can it...

Is b"hello" == u"hello" ever a security consideration? If it is, my
money is on the exception being the *more* secure option.

Straight-up false:
> The point being that character encoding detection and negotiation is a solved problem.

Nope, nope it isn't. One of my hobbies is collecting movie subtitles
in various languages [2]. They generally come to me in eight-bit
encodings with no declaration. Using only internal evidence, chardet
has about a 66% hit rate at a pass mark of "readable enough that I can
figure out the language", and a much lower hit rate at "actually the
correct encoding". With better heuristics (maybe a set of rules
specifically aimed at reading subtitle files), that could probably get
as far as 100% readable and 75% correct, but it is *never* going to be
perfect, because *the input is ambiguous*.

If other languages appear to have gotten this right, it's probably
because they either enforce a single encoding (eg UTF-8), or just
ignore the whole problem, assuming that someone else will have to deal
with it.

Mark says:
> He's right! The % formatting was kept to help port old code, the new .format() which is far more versatile is a bit verbose, so finally they've settled on f-strings. So, you do need to know that all three exist (e.g., for maintaining code), but you can easily choose the style that you prefer and just use that.
>

Not strictly true. An f-string is a special construct that isn't as
flexible as the other formatting types. You can read a bracey or
percent-marked string from a file, then interpolate it with values at
run time. You can't do that with an f-string, short of messing around
with eval (which is not as pretty as just msg.format(...), plus you
have to trust your external file as if it were code). This has strong
implications for i18n/l10n, and some other situations as well. So the
other string formatting facilities aren't ever going to die, and you
really should learn one of them.

I do agree that you're welcome to teach just one of them, though, and
worry about the other if and when it ever comes up. Personally, I
quite like percent-formatting, because it's the same as can be used in
a lot of other languages (including shell scripting, via GNU printf),
but brace-formatting lets you reorder the parameters, so it has
flexibility that can be important for i18n. So my conclusion would
probably be: Use f-strings for the simple cases where you'd be using a
literal, and then have a glance at each of the other two, so you know
they're there when the time comes.

He concludes that Py3 is still unusable because he keeps trying to
port code and failing. That says, to me, that he needs to take a step
back and learn about the fundamental difference between text and
bytes, and that might mean learning a bit of a language like Russian
or Japanese, where text obviously can't be squeezed into ASCII or into
a typical US-English eight-bit character set. For my part, though, I
can attest that *not one* of my students has had a problem with Py3,
ever since the course switched over. And that includes three (so far)
who, after a month and a half of learning JavaScript, are given five
days to learn Python and do something useful with it. Five days. They
start on Monday, and by Friday close-of-business, they demonstrate
what they've learned (in a group where all the students have learned
something in a week - eg Angular.js, Socket.io, React Native, Ruby,
mobile app design, etc), and it goes into the portfolio. Now, if
Python 3 were impossible to learn, you would expect these people to
struggle. They don't. In fact, as I was mentoring one of them, I kept
telling him "scope it back, scope it back, you have only X days to
finish this" - but he charged ahead and did everything anyway. Python
is pretty easy to learn; all three of my flex week students had moved
beyond messing with the language before the end of Monday, and were
onto actual productive work on the project.

And it's probably even easier for a perfectly new programmer to
understand. You ignore bytes altogether until you start working with
networks (even with disks, you can read and write in text mode) or
actual binary data (graphic file formats or something). Text behaves
the way you'd expect text to. You can use English variable names - but
you can also use French, or Swedish, or Russian, or Japanese, because
Python doesn't restrict you to ASCII. It does exactly what you'd
expect. You just have to expect based on a human's outlook, rather
than a C programmer's.

ChrisA

[1] I am, however, open to the argument that computer programmers are
by definition not sane.
[2] eg for Disney's "Frozen":
https://github.com/Rosuav/LetItTrans/tree/master/entire



More information about the Python-list mailing list