How to waste computer memory?

Chris Angelico rosuav at gmail.com
Fri Mar 18 17:39:26 EDT 2016


On Sat, Mar 19, 2016 at 8:28 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> The problem is not Python's Unicode strings, then. The problem is the
>> notion that path names are text. If they're text, they should be
>> exclusively text (although, for low-level efficiency, they're more
>> likely to be defined as "valid UTF-8 sequences" rather than "sequences
>> of Unicode codepoints"); since they're not, they are fundamentally
>> bytes. But that's not a problem with Python - it's a problem with the
>> file system.
>
> The file system does not have a problem. Python has a problem because it
> tries to present pathnames as Unicode strings, which isn't always
> possible.

But what does a file name *mean*? If it has no meaning, we should
simply use a hierarchical tree of IDs. The point of a file *name* is
that it has meaning to a human, which implies that they carry text,
not bytes. So I maintain that the problem here is with the file
system; it permits (for historical reasons) arbitrary byte sequences.

If I were building an entire OS ecosystem from scratch today, I'd
probably do a lot of things with a hybrid system of documented meaning
atop implementation-detail APIs. In this particular case, I would
define the API in terms of byte sequences, but clearly documenting
that these byte sequences are to be understood to mean text strings,
and thus must be valid UTF-8. It's still efficient (moving bytes
around the kernel is easier than having heaps of text<->bytes
transitions), but it allows future changes to depend on all non-broken
usage fitting this pattern.

ChrisA



More information about the Python-list mailing list