How to waste computer memory?

Fri Mar 18 16:15:59 EDT 2016

On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Michael Torrie <torriem at gmail.com>:
>
>> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote:
>>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
>>> promising. Indexing is by bytes (1-based in Julia) but the value at a
>>> valid index is the whole UTF-8 character at that point, and an
>>> invalid index raises an exception.
>>
>> This seems to me to be a leaky abstraction.
>
> It may be that Python's Unicode abstraction is an untenable illusion
> because the underlying reality is 8-bit and there's no way to hide it
> completely.
>

The underlying reality is 1-bit. Or maybe the underlying reality is
actually electrical signals that don't even have a clear definition of
"bits" and bounce between two states for a few fractions of a second
before settling. And maybe someone's implementing Python on the George
Banks Kite CPU, which consists of two cents' worth of paper and
string, on which text is actually represented by glyph. They're all
equally valid notions of "underlying reality".

Text is an abstract concept, just as numbers are. You fundamentally
cannot represent the notion of "three" in a computer; what you'll
generally do is encode that in some way. C does this by encoding that
in a machine word, then storing the machine word in memory, either
least significant byte lowest in memory, or the other way around.
Congratulations, C! You've already made two conflicting encodings for
integers, and you still have to predeclare a maximum representable
value. If you go for arbitrary-precision integers, there are a whole
lot more ways to encode them. GMP has a bunch of tweakables like
"number of nail bits", or you can go for a simple variable-length
integer that has seven bits of payload per byte and sets the high bit
if there are more bytes to read (and again, you have to figure out
whether that's little-endian or big-endian), or you can go for a more
complex scheme.

Python's Unicode abstraction *never* leaks information about how it's
stored in memory [1] [2]; a Unicode string in Python consists of a
series of codepoints in a well-defined order. This is exactly what you
would expect of a system in which codepoints are fundamental objects
that can truly be represented directly; if you can prove, from within
Python, that the interpreter uses bytes to represent text, I'd be
extremely surprised.

ChrisA

[1] Not since 3.3, at least. 2.7 narrow builds (eg on Windows) can
leak the UTF-16 level, but not further than that.
[2] Well, you might be able to figure stuff out based on timings. Only
in cryptography have I ever heard performance treated as a leak.