PEP 393 vs UTF-8 Everywhere

Sat Jan 21 13:54:02 EST 2017

Chris Angelico writes:

> On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote:
>> Steve D'Aprano writes:
>>
>> [snip]
>>
>>> You could avoid that error by increasing the offset by the right
>>> amount:
>>>
>>> stuff = text[offset + len("ф".encode('utf-8'):]
>>>
>>> which is awful. I believe that's what Go and Julia expect you to do.
>>
>> Julia provides a method to get the next index.
>>
>> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
>>     while offset <= endof(text)
>>         print(text[offset], ".")
>>         offset = nextind(text, offset)
>>     end
>>     println()
>> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.
>
> This implies that regular iteration isn't good enough, though.

It doesn't. Here's the straightforward iteration over the whole string:

let text = "ἐπὶ οἴνοπα πόντον"
    for c in text
        print(c, ".")
    end
    println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

One can also join any iterable whose elements can be converted to
strings, and characters can:

let text = "ἐπὶ οἴνοπα πόντον"
    println(join(text, "."), ".")
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

And strings, trivially, can:

let text = "ἐπὶ οἴνοπα πόντον"
    println(join(split(text), "."), ".")
end # prints: ἐπὶ.οἴνοπα.πόντον.

> Here's a function that creates a numbered list:
>
> def print_list(items):
>     width = len(str(len(items)))
>     for idx, item in enumerate(items, 1):
>         print("%*d: %s" % (width, idx, item))
>
> In Python, this will happily accept anything that is iterable and has
> a known length. Could be a list or tuple, obviously, but can also just
> as easily be a dict view (keys or items), a range object, or.... a
> string. It's perfectly acceptable to enumerate the characters of a
> string. And enumerate() itself is implemented entirely generically.

I'll skip the formatting - I don't know off-hand how to do it - but keep
the width calculation, and I cut the character iterator short at 10
items to save some space. There, it's much the same in Julia:

let text = "ἐπὶ οἴνοπα πόντον"

    function print_list(items)
        width = endof(string(length(items)))
        println("width = ", width)
        for (idx, item) in enumerate(items)
            println(idx, '\t', item)
        end
    end

    print_list(take(text, 10))
    print_list([text, text, text])
    print_list(split(text))
end

That prints this:

width = 2
1	ἐ
2	π
3	ὶ
4	 
5	ο
6	ἴ
7	ν
8	ο
9	π
10	α
width = 1
1	ἐπὶ οἴνοπα πόντον
2	ἐπὶ οἴνοπα πόντον
3	ἐπὶ οἴνοπα πόντον
width = 1
1	ἐπὶ
2	οἴνοπα
3	πόντον

> If you have to call nextind() to get the next character, you've made
> it impossible to do any kind of generic operation on the text. You
> can't do a windowed view by slicing while iterating, you can't have a
> "lag" or "lead" value, you can't do any of those kinds of simple and
> obvious index-based operations.

Yet Julia does with ease many things that you seem to think it cannot
possibly do at all. The iteration system works on types that have
methods for certain generic functions. For strings, the default is to
iterate over something like its characters; I think another iterator
over valid indexes is available, or wouldn't be hard to write; it could
be forward or backward, and in Julia many of these things are often
peekable by default (because the iteration protocol itself does not have
state - see below at "more magic").

The usual things work fine:

let text = "ἐπὶ οἴνοπα πόντον"
    foreach(print, enumerate(zip(text, split(text))))
end # prints: (1,('ἐ',"ἐπὶ"))(2,('π',"οἴνοπα"))(3,('ὶ',"πόντον"))

How is that bad?

More magic:

let text = "ἐπὶ οἴνοπα πόντον"
    let ever = cycle(split(text))
        println(first(ever))
        println(first(ever))
        for n in 2:6
            println(join(take(ever, n), " "))
end end end

This prints the following. The cycle iterator, ever, produces an
endless repetition of the three words, but it doesn't have state like
Python iterators do, so it's possible to look at the first word twice
(and then five more times).

ἐπὶ
ἐπὶ
ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον
ἐπὶ οἴνοπα πόντον ἐπὶ
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα πόντον

> Oh, and Python 3.3 wasn't the first programming language to use this
> flexible string representation. Pike introduced an extremely similar
> string representation back in 1998:
>
> https://github.com/pikelang/Pike/commit/db4a4

Ok. Is GitHub that old?

> So yes, UTF-8 has its advantages. But it also has its costs, and for a
> text processing language like Pike or Python, they significantly
> outweigh the benefits.

I process text in my work but I really don't use character indexes much
at all. Rather split, join, startswith, endswith, that kind of thing,
and whether a string contains some character or substring anywhere.