Cult-like behaviour [was Re: Kindness]

Marko Rauhamaa marko at pacujo.net
Sun Jul 15 09:57:11 EDT 2018


Chris Angelico <rosuav at gmail.com>:

> On Sun, Jul 15, 2018 at 9:17 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Remind me how it's such a mistake to treat that string as text all the
> way through?

Many times you need to make tricky ontological conversion decisions when
all you should need to do is relay the information.

> Python's Unicode type is an accurate representation of a Unicode text
> string, just as Python's float type is an accurate representation of
> IEEE 754 floating-point. Just as floats are not reals, so too is
> Unicode not perfectly able to represent all human text, and has to
> mess around with things like combining characters. It's not 100%
> perfect

The floating-point argument is a diversion.

The competing solutions are

 1. byte strings carrying UTF-8

 2. code point strings carrying UTF-32

The latter solution was supposed to relieve the programmer from the
downsides of the former. It turns out it does no such thing.

>> Here's the deal: text strings are irrelevant for most modern
>> programming needs. Most software is middleware between the human and
>> the terminal device. Carrying opaque octet strings from end to end is
>> often the most correct and least problematic thing to do.
>
> Uhh, so the human uses byte/octet strings? You can argue that the
> terminal device is fundamentally byte-oriented, but if you do, I'm
> going to dispute the use of the definite article, and say that *many*
> terminal devices are byte-oriented as of today. There's no fundamental
> reason for that to remain the case, and even today, we have
> fundamentally text-oriented terminal devices. I know this because I
> maintain one (okay, it's called a "MUD client" rather than a "terminal
> device", but it's basically the same thing).

The human user uses keyboards with character shapes painted on the keys,
icons to tap or click on, display devices with recognizable pixel
patterns and other audio-visual mechanisms. Then there's a lot of
middleware that routes information and carries it in over distances in
chunks of octets.

>> On the other hand, Python3's code point strings mess things up for no
>> added value. You still can't upcase or downcase strings.
>
> Not entirely sure what the .upper() and .lower() methods do, then.
> Case conversion of arbitrary text strings is hard, but Python
> definitely gives you as good as you'll ever get without actually
> stipulating, not just the language, but the context.

So we agree.

>> You still can't sort strings.
>
> Strings are intrinsically totally ordered in a mostly-sane way. If you
> want anything more than that, you have to stipulate the language.
> Python offers this in the 'locale' module, with strcoll and strxfrm.

So we agree.

>> You still can't perform random access on strings.
>
> Say what?

You can't look up the nth glyph in O(1).

>> You still don't know how long your string is.
>
> How long is a piece of string?
>
> 1) Do you count code points? len(x)
> 2) Do you count code units? len(x.encode("..."))
> 3) Do you count base characters, ignoring combining characters?
> 4) Do you count pixels of display width?
> 5) Do you count advancement (like pixels, but negative for RTL text)?
>
> Two of them are easy. Two require font metrics (so they're the job of
> a display engine). Only #3 is moderately hard, and you could do that
> with a one-liner by checking the Unicode categories. But it isn't very
> useful except to "prove" that Python sucks.

I didn't say Python sucked. I said Python3's str objects are inferior to
Python2's str objects.

As you say yourself, Python3's str objects don't actually solve any of
the real problems that Python2's str objects (with UTF-8 inside) have.

> I'm fairly sure you have no clue about Unicode or Python, but I'll
> give you the benefit of the doubt and assume you're merely trolling.

Are you even aware of the ad hominems?


Marko



More information about the Python-list mailing list