How to waste computer memory?

Fri Mar 18 11:31:17 EDT 2016

On Fri, Mar 18, 2016, at 11:17, Ian Kelly wrote:
> > Just to play devil's advocate, here, why is it so bad for indexing to be
> > O(n)? Some simple caching is all that's needed to prevent it from making
> > iteration O(n^2), if that's what you're worried about.
> 
> What kind of caching do you have in mind?

The byte index of the last character index accessed. When accessing a
new index greater than that one, start from there (_maybe_ also support
iterating backwards in this way, if accessing an index that is much
closer to the cached one than to zero). And even that's only necessary
if you actually _care_ about forward iteration by character indices
(i.e. "for i in range(len(s))"), rather than writing it off as bad
coding style.

 If you're just going to
> index the string, then that's at least an extra byte per character,
> which mostly kills the memory savings that is usually the goal of
> using UTF-8 in the first place.
> 
> It's not the only drawback, either. If you want to know anything about
> the characters in the string that you're looking at, you need to know
> their codepoints.

Nonsense. That depends on what you want to know about it. You can
extract a single character from a string, as a string, without knowing
anything about it except what range the first byte is in. You can use
this string directly as an index to a hash table containing information
such as unicode properties, names, etc.

> If the string is simple UCS-2, that's easy. Just
> take the two bytes and cast them as a 16-bit integer (assuming that
> the endianness of the string matches the machine). If the string is
> UTF-8 then it has to be decoded, so you need to figure out exactly how
> many bytes are in this particular character, and then from those
> determine which bits you need and then mash those bits together to
> form the actual integer codepoint.

I think you're overestimating the actual performance impact of this, but
okay...

> Now think about doing that over and
> over again in the context of a lexicographical sort.

Er... UTF-8 byte lexicographical order is isomorphic to codepoint
lexicographical order. Which is more than can be said for UTF-16.