Python 3 encoding question: Read a filename from stdin, subsequently open that filename

MRAB python at mrabarnett.plus.com
Tue Nov 30 21:14:09 EST 2010


On 01/12/2010 01:28, Nobody wrote:
> On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:
>
>>> I think this is wrong.  In Unix there is no concept of filename
>>> encoding.  Filenames can have any arbitrary set of bytes (except '/' and
>>> '\0').   But the filesystem itself neither knows nor cares about
>>> encoding.
>>
>> I think you misunderstood what I was trying to say. If you write a list of
>> filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
>> that used by the shell to display file names (on Linux typically UTF-8 these
>> days) and then write a Python script exist.py that reads filenames and
>> checks for the files' existence,
>
> I think you misunderstood.
>
> In the Unix kernel, there aren't any encodings. Strings of bytes are
> /just/ strings of bytes. A text file containing a list of filenames
> doesn't /have/ an encoding. The filenames passed to API functions don't
> /have/ an encoding.
>
> This is why Unix filenames are case-sensitive: because there isn't any
> "case". The number 65 has no more in common with the number 97 than it
> does with the number 255. The fact that 65 is the ASCII code for "A" while
> 97 is the ASCII code for "a" doesn't come into it. Case-insensitive
> filenames require knowledge of the encoding in order to determine when
> filenames are "equivalent". DOS/Windows tried this and never really got it
> right (it works fine on a standalone system, or within later versions of
> a Windows-only ecosystem, but becomes a nightmare when files get
> transferred between systems via older or non-Microsoft channels).
>
> Python 3.x's decision to treat filenames (and environment variables) as
> text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
> Python 2.x is still around when Python 4 is released.
>
If the filenames are to be shown to a user then there needs to be a
mapping between bytes and glyphs. That's an encoding. If different
users use different encodings then exchange of textual data becomes
difficult. That's where encodings which can be used globally come in.
By the time Python 4 is released I'd be surprised if Unix hadn't
standardised on a single encoding like UTF-8.



More information about the Python-list mailing list