How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sat Mar 19 04:14:40 EDT 2016


On Sat, 19 Mar 2016 08:08 am, Chris Angelico wrote:

> On Sat, Mar 19, 2016 at 8:02 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> Chris Angelico <rosuav at gmail.com>:
>>> On Sat, Mar 19, 2016 at 2:26 AM, Marko Rauhamaa <marko at pacujo.net>
>>> wrote:
>>>> It may be that Python's Unicode abstraction is an untenable illusion
>>>> because the underlying reality is 8-bit and there's no way to hide it
>>>> completely.
>>>
>>> The underlying reality is 1-bit. Or maybe the underlying reality is
>>> actually electrical signals that don't even have a clear definition of
>>> "bits" and bounce between two states for a few fractions of a second
>>> before settling. And maybe someone's implementing Python on the George
>>> Banks Kite CPU, which consists of two cents' worth of paper and
>>> string, on which text is actually represented by glyph. They're all
>>> equally valid notions of "underlying reality".
>>>
>>> Text is an abstract concept, just as numbers are.
>>
>> The question is how tenable the illusion is. If the OS gave the
>> appropriate guarantees (say, all pathnames are encoded Unicode strings),
>> the abstraction could be maintained. Unfortunately, the legacy shines
>> through making you wonder if Python has overreached prematurely with its
>> Unicode HAL.
> 
> The problem is not Python's Unicode strings, then. The problem is the
> notion that path names are text. If they're text, they should be
> exclusively text (although, for low-level efficiency, they're more
> likely to be defined as "valid UTF-8 sequences" rather than "sequences
> of Unicode codepoints"); since they're not, they are fundamentally
> bytes. But that's not a problem with Python - it's a problem with the
> file system.


One thing that NTFS gets right is that all path names are guaranteed to be
well-formed, valid Unicode. I believe that they are stored in UTF-16, and
unlike the ext file systems used on Linux, they are not arbitrary bytes.

I believe that HFS+ on Apple Macs goes one step further and guarantees that
paths are always fully normalised, so that it's impossible to have (e.g.)
two files ã (U+00E3 LATIN SMALL LETTER A WITH TILDE) and ã (U+0061 LATIN
SMALL LETTER A + U+0303 COMBINING TILDE) in the same directory.

Unfortunately, backwards compatibility is holding Linux file systems back...



-- 
Steven




More information about the Python-list mailing list