[Python-Dev] When should pathlib stop being provisional?

Steven D'Aprano steve at pearwood.info
Tue Apr 5 22:51:55 EDT 2016


On Wed, Apr 06, 2016 at 10:02:30AM +1000, Chris Angelico wrote:

> My personal view on the text/bytes debate is that a path is
> fundamentally a human concept, and consists therefore of text. The
> fact that some file systems store (at the low level) bytes and some
> store (I think) UTF-16 code units should be immaterial; path
> components exist for people. We can smuggle unrecognized bytes around,
> but ultimately, those bytes came from characters at some point - we
> just don't know the encoding. So a Path object has no relationship
> with bytes, only with str.

That might be usually true in practice, but it is incorrect in 
principle. Paths in POSIX systems like Linux are fundamentally 
byte-strings with only two restrictions: \0 and \x2f are forbidden.

The fact that paths in Linux mostly happen to look like English words 
(often heavily abbreviated) is a historical accident. The file system 
itself supported paths containing (say) \xff even back in the days when 
text was pure US-ASCII and bytes over \x7f had no textual meaning, and 
these days paths still support sequences of bytes that have no human 
meaning in any encoding.

I don't know if this makes the tiniest lick of difference for Pathlib. I 
would be perfectly content if we stuck with the design decision that 
Pathlib can only represent paths representable as Unicode strings, and 
left weird POSIX filenames to the legacy byte-string interface.


-- 
Steve


More information about the Python-Dev mailing list