Non-unicode file names

MRAB python at mrabarnett.plus.com
Wed Aug 8 19:48:34 EDT 2018


On 2018-08-08 23:16, Thomas Jollans wrote:
> On *nix, file names are bytes. In real life, we prefer to think of file
> names as strings. How non-ASCII file names are created is determined by
> the locale, and on most systems these days, every locale uses UTF-8 and
> everybody's happy. Of course this doesn't mean you'll never run into and
> old directory tree from the pre-UTF8 age using some other encoding, and
> it doesn't prevent people from doing silly things in file names.
> 
> Python deals with this tolerably well: by convention, file names are
> strings, but you can use bytes for file names if you wish. The docs [1]
> warn you about the situation.
> 
> [1] https://docs.python.org/3/library/os.path.html
> 
> If Python runs into a non-UTF8 (better: non-decodable) file name and has
> to return a str, it uses surrogate escape codes. So far so good. Right?
> 
> This leads to the unfortunate situation that you can't always print()
> file names, as print() is strict and refuses to toy with surrogates.
> 
> To be more explicit, the script
> 
>      print(__file__)
> 
> will fail depending on the file name. This feels wrong... (though every
> bit of behaviour is correct)
> 
> (The situation can't arise on Windows, and Python 2 will pretend nothing
> happened in true UNIX style)
> 
> Demo script to try at home below.
> 
[snip]

Is it true that Unix filenames can contain control characters, e.g. \x07?

When happens when you print them out?

I think it's not just a problem with surrogate escapes.



More information about the Python-list mailing list