Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Dan Stromberg drsalists at gmail.com
Mon Dec 6 00:01:58 EST 2010


Ultimately I switched to reading the filenames from file descriptor 0
using os.read(); this gave back bytes in 3.x, strings of single-byte
characters in 2.x - which are similar enough for my purposes, and
eliminated the filesystem encoding(s) question nicely.

I rewrote readline0
(http://stromberg.dnsalias.org/cgi-bin/viewvc.cgi/readline0/trunk/?root=svn)
for 2.x and 3.x to facilitate reading null-terminated strings from
stdin.  It's in better shape now anyway - more OOP than functional,
and with a bunch of unit tests.  The module now works on CPython 2.x,
CPython 3.x and PyPy 1.4 from the same code.

On Mon, Nov 29, 2010 at 9:26 PM, Dan Stromberg <drsalists at gmail.com> wrote:
> I've got a couple of programs that read filenames from stdin, and then
> open those files and do things with them.  These programs sort of do
> the *ix xargs thing, without requiring xargs.
>
> In Python 2, these work well.  Irrespective of how filenames are
> encoded, things are opened OK, because it's all just a stream of
> single byte characters.
>
> In Python 3, I'm finding that I have encoding issues with characters
> with their high bit set.  Things are fine with strictly ASCII
> filenames.  With high-bit-set characters, even if I change stdin's
> encoding with:
>
>       import io
>       STDIN = io.open(sys.stdin.fileno(), 'r', encoding='ISO-8859-1')
>
> ...even with that, when I read a filename from stdin with a
> single-character Spanish n~, the program cannot open that filename
> because the n~ is apparently internally converted to two bytes, but
> remains one byte in the filesystem.  I decided to try ISO-8859-1 with
> Python 3, because I have a Java program that encountered a similar
> problem until I used en_US.ISO-8859-1 in an environment variable to
> set the JVM's encoding for stdin.
>
> Python 2 shows the n~ as 0xf1 in an os.listdir('.').  Python 3 with an
> encoding of ISO-8859-1 wants it to be 0xc3 followed by 0xb1.
>
> Does anyone know what I need to do to read filenames from stdin with
> Python 3.1 and subsequently open them, when some of those filenames
> include characters with their high bit set?
>
> TIA!
>



More information about the Python-list mailing list