Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Albert Hopkins marduk at letterboxes.org
Tue Nov 30 22:22:01 EST 2010


On Wed, 2010-12-01 at 02:14 +0000, MRAB wrote:
> If the filenames are to be shown to a user then there needs to be a
> mapping between bytes and glyphs. That's an encoding. If different
> users use different encodings then exchange of textual data becomes
> difficult.

That's presentation, that's separate.  Indeed, I have my user encoding
set to UTF-8, and if there is a filename that's not valid utf-8 then my
GUI (GNOME will show "(invalid encoding)" and even allow me to rename it
and my shell (bash) will show '?' next to the invalid "characters" (and
make it a little more challenging to rename ;)).  And I can freely copy
these "invalid" files across different (Unix) systems, because the OS
doesn't care about encoding.

But that's completely different from the actual name of the file.  Unix
doesn't care about presentation in filenames. It just cares about the
data.  There are not "glyphs" in Unix, only in the UI that runs on top
of it.

Or to put it another way, Unix's filename encoding is RAW-DATA.  It's
not "textual" data.  The fact that most filenames contain mainly
human-readable text is a convenient convention, but not required or
enforced by the OS.

>  That's where encodings which can be used globally come in.
> By the time Python 4 is released I'd be surprised if Unix hadn't
> standardised on a single encoding like UTF-8. 

I have serious doubts about that.  At least in the Linux world the
kernel wants to stay out of encoding debates (except where it has to
like Window filesystems). But the point is that:

The world does not revolve around Python.  Unix filenames have been
encoding-agnostic long before Python was around.  If Python3 does not
support this then it's a regression on Python's part.





More information about the Python-list mailing list