PEP 393 vs UTF-8 Everywhere

Sun Jan 22 01:47:35 EST 2017

On Sunday 22 January 2017 06:58, Tim Chase wrote:

> Right.  It gets even weirder (edge-case'ier) when dealing with
> combining characters:
> 
> 
>>>> s = "man\N{COMBINING TILDE}ana"
>>>> for i, c in enumerate(s): print("%i: %s" % (i, c))
> ...
> 0: m
> 1: a
> 2: n
> 3:˜
> 4: a
> 5: n
> 6: a
>>>> ''.join(reversed(s))
> 'anãnam'
> 
> Offsetting s[3:] produces a (sub)string that begins with a combining
> character that doesn't have anything preceding it to combine with.

That doesn't matter. Unicode is a universal character set, not a universal 
*grapheme* set. But even speaking about characters is misleading: Unicode's 
"characters" (note the scare quotes) are abstract code points which can 
represent at least:

- letters of alphabets
- digits
- punctuation marks
- ideographs
- line drawing symbols
- emoji
- noncharacters

Since it doesn't promise to only provide graphemes (I can write "$\N{COMBINING 
TILDE}" which is not a valid grapheme in any human language) it doesn't matter 
if you end up with lone combining characters. Or rather, it does matter, but 
fixing that is not Unicode's responsibility. That should become a layer built 
on top of Unicode.

>> It's like seek() and tell() on text files: you cannot seek to
>> arbitrary positions, but only to the opaque positions returned by
>> tell. That's unacceptable for strings.
> 
> I'm still unclear on *why* this would be considered unacceptable for
> strings.

Sometimes you want to slice at a particular index which is *not* an opaque 
position returned by find().

text[offset + 1:]

Of for that matter:

middle_character = text[len(text)//2]

Forbidding those sorts of operations are simply too big a break with previous 
versions.

> It makes sense when dealing with byte-strings, since they
> contain binary data that may need to get sliced at arbitrary
> offsets.  But for strings, slicing only makes sense (for every
> use-case I've been able to come up with) in the context of known
> offsets like you describe with tell().

I'm sorry, I find it hard to believe that you've never needed to add or 
subtract 1 from a given offset returned by find() or equivalent.

> The cost of not using opaque
> tell()like offsets is, as you describe, slicing in the middle of
> characters.

>> You could avoid that error by increasing the offset by the right
>> amount:
>> 
>> stuff = text[offset + len("ф".encode('utf-8'):]
>> 
>> which is awful. I believe that's what Go and Julia expect you to do.
> 
> It may be awful, but only because it hasn't been pythonified.

No, it's awful no matter what. It makes it painful to reason about which code 
points will be picked up by a slice. What's the length of...?

text[offset:offset+5]

In current Python, that's got to be five code points (excluding the edge cases 
of slicing past the end of the string). But with opaque indexes, that could be 
anything from 1 to 5 code points.

> If the
> result from calling .find() on a string returns a "StringOffset"
> object, then it would make sense that its __add__/__radd__ methods
> would accept an integer and to such translation for you.

At cost of predictability.

>> You can avoid this by having the interpreter treat the Python-level
>> indexes as opaque "code point offsets", and converting them to and
>> from "byte offsets" as needed. That's not even very hard. But it
>> either turns every indexing into O(N) (since you have to walk the
>> string to count which byte represents the nth code point)
> 
> The O(N) cost has to be paid at some point, but I'd put forth that
> other operations like .find() already pay that O(N) cost and can
> return an opaque "offset token" that can be subsequently used for O(1)
> indexing (multiple times if needed).

Sure -- but only at the cost of blowing out the complexity and memory 
requirements of the string, which completely negates the point in using UTF-8 
in the first place.

-- 
Steven
"Ever since I learned about confirmation bias, I've been seeing 
it everywhere." - Jon Ronson