"More About Unicode in Python 2 and 3"

Sun Jan 5 08:55:03 EST 2014

On Mon, Jan 6, 2014 at 12:22 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
> If anyone wants Python 3 uptake improved, the best thing would be to either
> explain to Armin how he missed the easy way to do what he wants (seems
> unlikely), or advocate to the core devs why they should change things to
> improve this situation.

I'm not sure that there is an "easy way". See, here's the deal. If all
your data is ASCII, you can shut your eyes to the difference between
bytes and text and Python 2 will work perfectly for you. Then some day
you'll get a non-ASCII character come up (or maybe you'll get all of
Latin-1 "for free" and it's when you get a non-Latin-1 character -
same difference), and you start throwing in encode() and decode()
calls in places. But you feel like you're fixing little problems with
little solutions, so it's no big deal.

Making the switch to Python 3 forces you to distinguish bytes from
text, even when that text is all ASCII. Suddenly that's a huge job, a
huge change through all your code, and it's all because of this switch
to Python 3. The fact that you then get the entire Unicode range "for
free" doesn't comfort people who are dealing with URLs and are
confident they'll never see anything else (if they *do* see anything
else, it's a bug at the far end). Maybe it's the better way, but like
trying to get people to switch from MS Word onto an open system, it's
far easier to push for Open Office than for LaTeX. Getting your head
around a whole new way of thinking about your data is work, and people
want to be lazy. (That's not a bad thing, by the way. Laziness means
schedules get met.)

So what can be done about it? Would it be useful to have a type that
represents an ASCII string? (Either 'bytes' or something else, it
doesn't matter what.) I'm inclined to say no, because as of the
current versions, encoding/decoding UTF-8 has (if I understand
correctly) been extremely optimized in the specific case of an
all-ASCII string; so the complaint that there's no "string formatting
for bytes" could be resolved by simply decoding to str, then encoding
to bytes. I'd look on that as having two costs, a run-time performance
cost and a code readability cost, and then look at reducing each of
them - but without blurring the bytes/text distinction. Yes, that
distinction is a cost. It's like any other mental cost, and it just
has to be paid. The only way to explain it is that Py2 has the "cost
gap" between ASCII (or Latin-1) and the rest of Unicode, but Py3 puts
that cost gap before ASCII, and then gives you all of Unicode for the
same low price (just $19.99 a month, you won't even notice the
payments!).

Question, to people who have large Py2 codebases that manipulate
mostly-ASCII text. How bad would it be to your code to do this:

# Py2: build a URL
url = "http://my.server.name/%s/%s" % (path, fn)

# Py3: build a URL as bytes
def B(s):
    if isinstance(s, str): return s.encode()
    return s.decode()

url = B(B(b"http://my.server.name/%s/%s") % (path, fn))

? This little utility function lets you do the formatting as text
(let's assume the URL pattern comes from somewhere else, or you'd just
strip off the b'' prefix), while still mostly working with bytes. Is
it an unacceptable level of code clutter?

ChrisA