How to waste computer memory?

Fri Mar 18 14:22:59 EDT 2016

Steven D'Aprano writes:

> And I don't understand this meme that indexing strings is not
> important.  Have people never (say) taken a slice of a string, or a
> look-ahead, or something similar?
>
> i = mystring.find(":")
> next_char = mystring[i+1]

The point is that O(1) indexing and slicing *can be done* in UTF-8! Once
you've found the byte index of that ':', the next index can be computed
locally.

Here's a UTF-8 string in Julia, with some 2-byte characters. Indexing is
by byte (1-based).

julia> s = "Heinäpaalin kierittäminen turvaköyttä pitkin on ältsin vaikeeta.";

The index of the 'ö' is found in O(n) time, same as you find the ":" in
Python:

julia> search(s, 'ö')
35

There are methods for finding the next valid index, in O(1) time:

julia> nextind(s, 35)
37

julia> next(s, 35)
('ö',37)

The character at a known index is found in O(1) time:

julia> s[37]
'y'

The index of the previous character is found in O(1) time:

julia> prevind(s, 35)
34

The index of the first space after the 'ö' is again found in O(n) time,
just like it would be in Python:

julia> search(s, ' ', 35)
42

The index of the character preceding that space is found in O(1) time;
this one is a two-byte character, so it's not the byte at 41 but at 40:

julia> prevind(s, 42)
40

And slicing involves no search:

julia> s[34:40]
"köyttä"

> # Strip the first and last chars from a string 
> mystring[1:-1]

julia> s[nextind(s, start(s)):prevind(s, endof(s))]
"einäpaalin kierittäminen turvaköyttä pitkin on ältsin vaikeeta"

That's a bit wordy but there's no O(n) search involved.

(I may not know the simplest way to do this in Julia. Or the above may
be the simplest way out of the box. Not sure. And not too worried.)

> But I think that for a programming language which may be dealing with
> strings of multiple tens of megabytes in size, I'm skeptical that
> UTF-8 doesn't cause a performance hit for at least some operations.

Possibly. I've never had an unstructured string of such size. Who works
with such?

I am processing hundreds of such text files right now. Thousands. Why
would anyone know the millionth character in any of them? It's just not
meaningful. The meaningful questions involve positions that are already
somehow identified.

>>> It's not the only drawback, either. If you want to know anything
>>> about the characters in the string that you're looking at, you need
>>> to know their codepoints.
>> 
>> Nonsense. That depends on what you want to know about it. You can
>> extract a single character from a string, as a string, without
>> knowing anything about it except what range the first byte is in. You
>> can use this string directly as an index to a hash table containing
>> information such as unicode properties, names, etc.
>
> I don't understand your comment. If I give you the index of the
> character, how do you know where its first byte is? With UTF-8,
> character i can be anywhere between byte i and 4*i.

If you give me the byte index, then the character is at that index.