Making safe file names

Tue May 7 23:40:50 EDT 2013

On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote:

> On 2013.05.07 19:14, Dave Angel wrote:
>> You also need to decide how to handle Unicode characters, since they're
>> different for different OS.  In Windows on NTFS, filenames are in
>> Unicode, while on Unix, filenames are bytes.  So on one of those, you
>> will be encoding/decoding if your code is to be mostly portable.
>
> Characters outside whatever sys.getfilesystemencoding() returns won't be
> allowed. If the user's locale settings don't support Unicode, my program
> will be far from the only one to have issues with it. Any problem
> reports that arise from a user moving between legacy encodings will
> generally be ignored. I haven't yet decided how I will handle artist
> names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is
> just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in
> their locale settings).

There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire 
Unicode range, unlike other encodings like Latin-1 or ASCII.

Well, that is to say, there may be characters that are not (yet) handled 
at all by Unicode, but there are no known legacy encodings that support 
such characters.

To a first approximation, Unicode covers the entire set of characters in 
human use, and for those which it does not, there is always the private 
use area. So for example, if you wish to record the Artist Formerly Known 
As "The Artist Formerly Known As Prince" as Love Symbol, you could pick 
an arbitrary private use code point, declare that for your application 
that code point means Love Symbol, and use that code point as the artist 
name. You could even come up with a custom font that includes a rendition 
of that character glyph.

However, there are byte combinations which are not valid UTF-8, which is 
a different story. If you're receiving bytes from (say) a file name, they 
may not necessarily make up a valid UTF-8 string. But this is not an 
issue if you are receiving data from something guaranteed to be valid 
UTF-8.

>> Don't forget that ls and rm may not use the same encoding you're using.
>> So you may not consider it adequate to make the names legal, but you
>> may also want they easily typeable in the shell.
>
> I don't understand. I have no intention of changing Unicode characters.

Of course you do. You even talk below about Unicode characters like * 
and ? not being allowed on NTFS systems.

Perhaps you are thinking that there are a bunch of characters over here 
called "plain text ASCII characters", and a *different* bunch of 
characters with funny accents and stuff called "Unicode characters". If 
so, then you are labouring under a misapprehension, and you should start 
off by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

then come back with any questions.

> This is not a Unicode issue since (modern) file systems will happily
> accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
>  \ / : * ? " < > | @ and the NUL character

These are all Unicode characters too. Unicode is a subset of ASCII, so 
anything which is ASCII is also Unicode.

> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
> and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be
> escaped.

If you have an artist with control characters in their name, like newline 
or carriage return or NUL, I think it is fair to just drop the control 
characters and then give the artist a thorough thrashing with a halibut.

Does your mapping really need to be guaranteed reversible? If you have an 
artist called "JoeBlow", and another artist called "Joe\0Blow", and a 
third called "Joe\nBlow", does it *really* matter if your application 
conflates them?

-- 
Steven