Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Nobody nobody at nowhere.com
Tue Nov 30 20:28:50 EST 2010


On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:

>> I think this is wrong.  In Unix there is no concept of filename
>> encoding.  Filenames can have any arbitrary set of bytes (except '/' and
>> '\0').   But the filesystem itself neither knows nor cares about
>> encoding.
> 
> I think you misunderstood what I was trying to say. If you write a list of 
> filenames into files.txt, and use an encoding (ISO-8859-1, say) other than 
> that used by the shell to display file names (on Linux typically UTF-8 these 
> days) and then write a Python script exist.py that reads filenames and 
> checks for the files' existence, 

I think you misunderstood.

In the Unix kernel, there aren't any encodings. Strings of bytes are
/just/ strings of bytes. A text file containing a list of filenames
doesn't /have/ an encoding. The filenames passed to API functions don't
/have/ an encoding.

This is why Unix filenames are case-sensitive: because there isn't any
"case". The number 65 has no more in common with the number 97 than it
does with the number 255. The fact that 65 is the ASCII code for "A" while
97 is the ASCII code for "a" doesn't come into it. Case-insensitive
filenames require knowledge of the encoding in order to determine when
filenames are "equivalent". DOS/Windows tried this and never really got it
right (it works fine on a standalone system, or within later versions of
a Windows-only ecosystem, but becomes a nightmare when files get
transferred between systems via older or non-Microsoft channels).

Python 3.x's decision to treat filenames (and environment variables) as
text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
Python 2.x is still around when Python 4 is released.




More information about the Python-list mailing list