How to waste computer memory?

Fri Mar 18 12:58:08 EDT 2016

On Sat, 19 Mar 2016 02:26 am, Marko Rauhamaa wrote:

> Michael Torrie <torriem at gmail.com>:
> 
>> On 03/18/2016 02:26 AM, Jussi Piitulainen wrote:
>>> I think Julia's way of dealing with its strings-as-UTF-8 [2] is more
>>> promising. Indexing is by bytes (1-based in Julia) but the value at a
>>> valid index is the whole UTF-8 character at that point, and an
>>> invalid index raises an exception.
>>
>> This seems to me to be a leaky abstraction.
> 
> It may be that Python's Unicode abstraction is an untenable illusion
> because the underlying reality is 8-bit and there's no way to hide it
> completely.
>
> There's no problem providing pure Unicode strings. Things get iffy when
> Python's OS abstraction pretends sys.stdin is text or filenames are
> strings.

The abstraction only breaks because of historical reasons.

In Linux and Unix systems, the underlying file system actually allows any
arbitrary byte strings (with a small number of restrictions, such as
disallowing ASCII NUL and / (slash) bytes. But modern applications try to
pretend that the file system is actually UTF-8. That would work fine if
people *only* accessed the file system with such tools that used UTF-8. But
they don't.

On Windows, the file system is either UTF-16 or UCS-2, I'm not sure which.
But the NTFS file system itself enforces that all file names are valid
UTF-16 (or the other one). Since all valid UTF-16 strings are valid Unicode
(by definition), there's no problem there.

However, the problem on Windows is not the underlying file system, but the
Explorer interface. It still uses old legacy encodings, and localises them
in different countries, so it is invariable that people end up with
mojibake file names.

>> Julia's approach is interesting, but it strikes me as somewhat broken
>> as it pretends to do O(1) indexing, but in reality it's still O(n)
> 
> If the underlying encoding is 8-bit, converting it to an O(1) structure
> would still be O(n).

Yes, but you only need to do that once, on input, or at most twice, on input
and output, not on every operation.

-- 
Steven