How to waste computer memory?

Ian Kelly ian.g.kelly at gmail.com
Fri Mar 18 11:17:32 EDT 2016


On Fri, Mar 18, 2016 at 8:56 AM, Random832 <random832 at fastmail.com> wrote:
> On Fri, Mar 18, 2016, at 03:00, Ian Kelly wrote:
>> jmf has been asked this before, and as I recall he seems to feel that
>> UTF-8 should be used for all purposes, ignoring the limitations of
>> that encoding such as that indexing becomes a O(n) operation.
>
> Just to play devil's advocate, here, why is it so bad for indexing to be
> O(n)? Some simple caching is all that's needed to prevent it from making
> iteration O(n^2), if that's what you're worried about.

What kind of caching do you have in mind? If you're just going to
index the string, then that's at least an extra byte per character,
which mostly kills the memory savings that is usually the goal of
using UTF-8 in the first place.

It's not the only drawback, either. If you want to know anything about
the characters in the string that you're looking at, you need to know
their codepoints. If the string is simple UCS-2, that's easy. Just
take the two bytes and cast them as a 16-bit integer (assuming that
the endianness of the string matches the machine). If the string is
UTF-8 then it has to be decoded, so you need to figure out exactly how
many bytes are in this particular character, and then from those
determine which bits you need and then mash those bits together to
form the actual integer codepoint. Now think about doing that over and
over again in the context of a lexicographical sort.



More information about the Python-list mailing list