How to waste computer memory?

Marko Rauhamaa marko at pacujo.net
Fri Mar 18 18:03:14 EDT 2016


Chris Angelico <rosuav at gmail.com>:

> On Sat, Mar 19, 2016 at 8:28 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> The file system does not have a problem. Python has a problem because it
>> tries to present pathnames as Unicode strings, which isn't always
>> possible.
>
> But what does a file name *mean*?

A Linux/UNIX file name is an extended ASCII string, where the
interpretation of bytes in the range 128..255 are left ambiguous. That's
the legacy of the early 1980's. At that time 8-bit bytes were standard,
and the parity nonsense was virtually gone.

C, Emacs and the OS supported those bytes without a problem but treated
them as some sort of control characters (they were represented with the
octal \nnn notation).

Some systems used the upper byte range for block graphics (CP/M).

Some systems used the upper byte range to represent Hebrew letters
(Atari).

Then came ISO-8859-x and the locales (yuck!). Sun scrambled to make
SunOS "8-bit clean". ISO-8859-1 was widely taken as the default for the
Civilized World. Pathnames reflected that colonialist mindset.

ISO-8859-1 was the state of the art around 1995 (HTML). UCS-2 was the
avant-garde adopted by Windows and Java. UTF-8 came later, and Linux
luckily avoided the UCS-2 mess.

All that "extended ASCII" legacy is still the reaily on Linux and won't
go away in the foreseeable future. I suppose OSX is the only mainstream
operating system that had the full benefit of hindsight. And even they
messed it up with case-insensitive pathnames.

> If I were building an entire OS ecosystem from scratch today, I'd
> probably do a lot of things with a hybrid system of documented meaning
> atop implementation-detail APIs. In this particular case, I would
> define the API in terms of byte sequences, but clearly documenting
> that these byte sequences are to be understood to mean text strings,
> and thus must be valid UTF-8.

UTF-8 shouldn't have anything to do with the abstract pathnames (which
should be normalized Unicode). Also, special-casing '\0' and '/' is
lame. Why can't I have "Results 1/2016" as a filename?


Marko



More information about the Python-list mailing list