Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Peter Otten __peter__ at web.de
Tue Nov 30 12:53:14 EST 2010


Albert Hopkins wrote:

> On Tue, 2010-11-30 at 11:52 +0100, Peter Otten wrote:
> Dan Stromberg wrote:
>> 
>> > I've got a couple of programs that read filenames from stdin, and
> then
>> > open those files and do things with them.  These programs sort of do
>> > the *ix xargs thing, without requiring xargs.
>> > 
>> > In Python 2, these work well.  Irrespective of how filenames are
>> > encoded, things are opened OK, because it's all just a stream of
>> > single byte characters.
>> 
>> I think you're wrong. The filenames' encoding as they are read from stdin
>> must be the same as the encoding used by the file system. If the file 
>> system expects UTF-8 and you feed it ISO-8859-1 you'll run into errors.
>> 
> I think this is wrong.  In Unix there is no concept of filename
> encoding.  Filenames can have any arbitrary set of bytes (except '/' and
> '\0').   But the filesystem itself neither knows nor cares about
> encoding.

I think you misunderstood what I was trying to say. If you write a list of 
filenames into files.txt, and use an encoding (ISO-8859-1, say) other than 
that used by the shell to display file names (on Linux typically UTF-8 these 
days) and then write a Python script exist.py that reads filenames and 
checks for the files' existence, 

$ python3 exist.py < files.txt

will report that a file

b'\xe4\xf6\xfc.txt' 

doesn't exist. The user looking at his editor with the encoding set to 
ISO-8859-1 seeing the line

äöü.txt

and then going to the console typing

$ ls
äöü.txt

will be confused even though everything is working correctly. 
The system may be shuffling bytes, but the user thinks in codepoints and 
sometimes assumes that codepoints and bytes are the same.

> You always have to know either
>> 
>> (a) both the file system's and stdin's actual encoding, or
>> (b) that both encodings are the same.
>> 
>> 
> If this is true, then I think that it is wrong to do in Python3.  Any
> language should be able to deal with the filenames that the host OS
> allows.
> 
> Anyway, going on with the OP.. can you open stdin so that you can accept
> arbitrary bytes instead of strings and then open using the bytes as the
> filename? 

You can access the underlying stdin.buffer that feeds you the raw bytes with 
no attempt to shoehorn them into codepoints. You can use filenames that are 
not valid in the encoding that the system uses to display filenames:

$ ls
$ python3
Python 3.1.1+ (r311:74480, Nov  2 2009, 15:45:00)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open(b"\xe4\xf6\xfc.txt", "w") as f:
...     f.write("hello\n")
...
6
>>>
$ ls
???.txt

> I don't have that much experience with Python3 to say for sure.

Me neither.

Peter




More information about the Python-list mailing list