How to waste computer memory?

Michael Torrie torriem at gmail.com
Fri Mar 18 10:59:27 EDT 2016


On 03/18/2016 02:26 AM, Jussi Piitulainen wrote:
> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
> promising. Indexing is by bytes (1-based in Julia) but the value at a
> valid index is the whole UTF-8 character at that point, and an invalid
> index raises an exception.

This seems to me to be a leaky abstraction.  Julia's approach is
interesting, but it strikes me as somewhat broken as it pretends to do
O(1) indexing, but in reality it's still O(n) because you still have to
iterate through the bytes until you find, say, the nth time that doesn't
raise an exception.  Except for dealing with the ASCII subset of UTF-8,
I can't really see any time when grabbing whatever resides at the nth
byte of a UTF-8 string would be useful.

> I work with text all the time, but I don't think I ever _need_ arbitrary
> access to an nth character. What I require is access to the start and
> end of a string, searching, and splitting. These all seem compatible
> with using UTF-8 representations. Same with iterating over the string
> (forward or backward).

Indeed, this is the argument from the web site
http://utf8everywhere.org.  Their argument is that often individual
unicode code points don't make sense by themselves, so there's no point
in chopping up a Unicode string. Many unicode strings only make sense if
you start at the beginning and read and interpret the code points as you
go. Hence UTF-8's requirement that you have to always start at the
beginning you want to find the nth code point is not a burden.

I guess whether or not you need to find the nth character depends on the
strength of string functions.  If I searched a string for a particular
delimiter, I could see it being useful to get whatever is just past the
delimiter, for example.  Though Python's split() method eliminates the
need to do that by hand.





More information about the Python-list mailing list