Python 3 encoding question: Read a filename from stdin, subsequently open that filename

Thu Dec 2 18:12:27 EST 2010

On Thu, 02 Dec 2010 12:17:53 +0100, Peter Otten wrote:

>> This was actually a critical flaw in Python 3.0, as it meant that
>> filenames which weren't valid in the locale's encoding simply couldn't be
>> passed via argv or environ. 3.1 fixed this using the "surrogateescape"
>> encoding, so now it's only an annoyance (i.e. you can recover the original
>> bytes once you've spent enough time digging through the documentation).
> 
> Is it just that you need to harden your scripts against these byte sequences 
> or do you actually encounter them? If the latter, can you give some 
> examples?

Assume that you have a Python3 script which takes filenames on the
command-line. If any of the filenames contain byte sequences which
aren't valid in the locale's encoding, the bytes will be decoded to
characters in the range U+DC00 to U+DCFF.

To recover the original bytes, you need to use 'surrogateescape' as the
error handling method when decoding, e.g.:

	enc = sys.getfilesystemencoding()
	argv_bytes = [arg.encode(enc, 'surrogateescape') for arg in sys.argv]

Otherwise, it will complain about not being able to encode the surrogate
characters.

Similarly for os.environ.

For anything else, you can just use sys.setfilesystemencoding('iso-8859-1')
at the beginning of the script. Decoding as ISO-8859-1 will never fail,
and encoding as ISO-8859-1 will give you the original bytes.

But argv and environ are decoded before your script can change the
encoding, so you need to know the "trick" to undo them if you want to
write a robust Python 3 script which works with byte strings in an
encoding-agnostic manner (i.e. a traditional Unix script).