Non-unicode file names

Thomas Jollans tjol at tjol.eu
Wed Aug 8 18:16:45 EDT 2018


On *nix, file names are bytes. In real life, we prefer to think of file
names as strings. How non-ASCII file names are created is determined by
the locale, and on most systems these days, every locale uses UTF-8 and
everybody's happy. Of course this doesn't mean you'll never run into and
old directory tree from the pre-UTF8 age using some other encoding, and
it doesn't prevent people from doing silly things in file names.

Python deals with this tolerably well: by convention, file names are
strings, but you can use bytes for file names if you wish. The docs [1]
warn you about the situation.

[1] https://docs.python.org/3/library/os.path.html

If Python runs into a non-UTF8 (better: non-decodable) file name and has
to return a str, it uses surrogate escape codes. So far so good. Right?

This leads to the unfortunate situation that you can't always print()
file names, as print() is strict and refuses to toy with surrogates.

To be more explicit, the script

    print(__file__)

will fail depending on the file name. This feels wrong... (though every
bit of behaviour is correct)

(The situation can't arise on Windows, and Python 2 will pretend nothing
happened in true UNIX style)

Demo script to try at home below.

-- Thomas


# -*- coding: UTF-8 -*-
from __future__ import unicode_literals, print_function

import sys
import os.path
import subprocess
import tempfile
import shutil

script = 'print(__file__)\n'

file_names = ['🐪.py', '€.py', '€.py'.encode('latin9')]

PY = sys.executable

tmpdir = tempfile.mkdtemp()

for fn in file_names:
    if isinstance(fn, bytes):
        path = os.path.join(tmpdir.encode('ascii'), fn)
    else:
        path = os.path.join(tmpdir, fn)

    print('► creating', path)
    with open(path, 'w') as fp:
        fp.write(script)

    print('► running', PY, path)
    status = subprocess.call([PY, path])
    print('► exited with status', status)

print('► cleaning up')
shutil.rmtree(tmpdir)

# End of script
#######################################################################
# Output from Python 3.6.5 on Linux (Ubuntu 18.04)::
#
#     ► creating /tmp/tmp_a4h5n22/🐪.py
#     ► running /usr/bin/python3 /tmp/tmp_a4h5n22/🐪.py
#     /tmp/tmp_a4h5n22/🐪.py
#     ► exited with status 0
#     ► creating /tmp/tmp_a4h5n22/€.py
#     ► running /usr/bin/python3 /tmp/tmp_a4h5n22/€.py
#     /tmp/tmp_a4h5n22/€.py
#     ► exited with status 0
#     ► creating b'/tmp/tmp_a4h5n22/\xa4.py'
#     ► running /usr/bin/python3 b'/tmp/tmp_a4h5n22/\xa4.py'
#     Traceback (most recent call last):
#       File "/tmp/tmp_a4h5n22/\udca4.py", line 1, in <module>
#         print(__file__)
#     UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4'
in position 17: surrogates not allowed
#     ► exited with status 1
#     ► cleaning up
#
# Python 2.7.15rc1 on Linux (Ubuntu):
#
#     ► creating /tmp/tmp_U_LPp/🐪.py
#     ► running /usr/bin/python2 /tmp/tmp_U_LPp/🐪.py
#     /tmp/tmp_U_LPp/🐪.py
#     ► exited with status 0
#     ► creating /tmp/tmp_U_LPp/€.py
#     ► running /usr/bin/python2 /tmp/tmp_U_LPp/€.py
#     /tmp/tmp_U_LPp/€.py
#     ► exited with status 0
#     ► creating /tmp/tmp_U_LPp/�.py
#     ► running /usr/bin/python2 /tmp/tmp_U_LPp/�.py
#     /tmp/tmp_U_LPp/�.py
#     ► exited with status 0
# ► cleaning up
#
# Python 3.7.0 on Windows 10::
#
#     ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py
#     ► running
C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe
C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\�
#     �.py
#     C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py
#     ► exited with status 0
#     ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py
#     ► running
C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe
C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€
#     .py
#     C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py
#     ► exited with status 0
#     ► creating
b'C:\\Users\\tjol\\AppData\\Local\\Temp\\tmpzprwnyc2\\\xa4.py'
#     Traceback (most recent call last):
#       File ".\bytes_file_names2.py", line 25, in <module>
#         with open(path, 'w') as fp:
#     UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in
position 45: invalid start byte
#
# Python 2.7.15 on Windows 10:
#
#     Traceback (most recent call last):
#       File ".\bytes_file_names2.py", line 24, in <module>
#         print('Ôû║ creating', path)
#       File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
#         return codecs.charmap_encode(input,errors,encoding_map)
#     UnicodeEncodeError: 'charmap' codec can't encode character
u'\u25ba' in position 0: character maps to <undefined>






More information about the Python-list mailing list