Non-unicode file names

Thomas Jollans tjol at tjol.eu
Wed Aug 8 20:14:33 EDT 2018


On 09/08/18 01:48, MRAB wrote:
> On 2018-08-08 23:16, Thomas Jollans wrote:
>> On *nix, file names are bytes. In real life, we prefer to think of file
>> names as strings. How non-ASCII file names are created is determined by
>> the locale, and on most systems these days, every locale uses UTF-8 and
>> everybody's happy. Of course this doesn't mean you'll never run into and
>> old directory tree from the pre-UTF8 age using some other encoding, and
>> it doesn't prevent people from doing silly things in file names.
>>
>> Python deals with this tolerably well: by convention, file names are
>> strings, but you can use bytes for file names if you wish. The docs [1]
>> warn you about the situation.
>>
>> [1] https://docs.python.org/3/library/os.path.html
>>
>> If Python runs into a non-UTF8 (better: non-decodable) file name and has
>> to return a str, it uses surrogate escape codes. So far so good. Right?
>>
>> This leads to the unfortunate situation that you can't always print()
>> file names, as print() is strict and refuses to toy with surrogates.
>>
>> To be more explicit, the script
>>
>>      print(__file__)
>>
>> will fail depending on the file name. This feels wrong... (though every
>> bit of behaviour is correct)
>>
>> (The situation can't arise on Windows, and Python 2 will pretend nothing
>> happened in true UNIX style)
>>
>> Demo script to try at home below.
>>
> [snip]
> 
> Is it true that Unix filenames can contain control characters, e.g. \x07?
> 
> When happens when you print them out?
> 
> I think it's not just a problem with surrogate escapes.

Not a problem (or: not an exception), as those are ASCII and thus UTF-8.

Python 3.6.5 (default, Apr  1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('\x07.py', 'w') as fp:
...     fp.write('print(__file__)\n')
...
16
>>> import sys; import subprocess
>>> subprocess.call([sys.executable, '\x07.py'])
.py
0
>>>

As you might expect, it beeped when printing '\x07.py' (and showed .py)




More information about the Python-list mailing list