Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Peter Otten __peter__ at web.de
Tue Nov 30 05:52:27 EST 2010


Dan Stromberg wrote:

> I've got a couple of programs that read filenames from stdin, and then
> open those files and do things with them.  These programs sort of do
> the *ix xargs thing, without requiring xargs.
> 
> In Python 2, these work well.  Irrespective of how filenames are
> encoded, things are opened OK, because it's all just a stream of
> single byte characters.

I think you're wrong. The filenames' encoding as they are read from stdin 
must be the same as the encoding used by the file system. If the file system 
expects UTF-8 and you feed it ISO-8859-1 you'll run into errors.

You always have to know either

(a) both the file system's and stdin's actual encoding, or 
(b) that both encodings are the same.

If byte strings work you are in situation (b) or just lucky. I'd guess the 
latter ;)
 
> In Python 3, I'm finding that I have encoding issues with characters
> with their high bit set.  Things are fine with strictly ASCII
> filenames.  With high-bit-set characters, even if I change stdin's
> encoding with:
> 
> import io
> STDIN = io.open(sys.stdin.fileno(), 'r', encoding='ISO-8859-1')

I suppose you can handle (b) with

STDIN = sys.stdin.buffer

or

STDIN = io.TextIOWrapper(sys.stdin.buffer,
                         encoding=sys.getfilesystemencoding())

in Python 3. I'd prefer the latter because it makes your assumptions 
explicit. (Disclaimer: I'm not sure whether I'm using the io API as Guido 
intended it)

> ...even with that, when I read a filename from stdin with a
> single-character Spanish n~, the program cannot open that filename
> because the n~ is apparently internally converted to two bytes, but
> remains one byte in the filesystem.  I decided to try ISO-8859-1 with
> Python 3, because I have a Java program that encountered a similar
> problem until I used en_US.ISO-8859-1 in an environment variable to
> set the JVM's encoding for stdin.
> 
> Python 2 shows the n~ as 0xf1 in an os.listdir('.').  Python 3 with an
> encoding of ISO-8859-1 wants it to be 0xc3 followed by 0xb1.
> 
> Does anyone know what I need to do to read filenames from stdin with
> Python 3.1 and subsequently open them, when some of those filenames
> include characters with their high bit set?
> 
> TIA!





More information about the Python-list mailing list