Encoding of file names

Tom Anderson twic at urchin.earth.li
Fri Dec 9 06:05:28 EST 2005


On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:

> utabintarbo wrote:
>
>> Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>
>
> For all those who followed this thread, here is some more explanation:
>
> Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled 
> 50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a 
> vertical line in the middle, plus a line from that going left) into a 
> file name. How he managed to do that, I can only guess: most likely, the 
> Samba installation assumes that the file system encoding on the Solaris 
> box is some IBM code page (say, CP 437 or CP 850). If so, the byte on 
> disk would be \xb4. Where this came from, I have to guess further: 
> perhaps it is ACUTE ACCENT from ISO-8859-*.
>
> Anyway, when he used listdir() to get the contents of the directory, 
> Windows applies the CP_ACP encoding (known as "mbcs" in Python). For 
> reasons unknown to me, the US and several European versions of XP map 
> this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for 
> U+2524, but not for U+2592).
>
> So when he then applies isfile to that file name, \xa6 is mapped to 
> U+00A6, which then isn't found on the Samba side.
>
> So while Unicode here is the solution, the problem is elsewhere; most 
> likely in a misconfiguration of the Samba server (which assumes some 
> encoding for the files on disk, yet the AIX application uses a different 
> encoding).

Isn't the key thing that Windows is applying a non-roundtrippable 
character encoding? If i've understood this right, Samba and Windows are 
talking in unicode, with these (probably quite spurious, but never mind) 
U+25xx characters, and Samba is presenting a quite consistent view of the 
world: there's a file called "double bucky backlash grey box" in the 
directory listing, and if you ask for a file called "double bucky backlash 
grey box", you get it. Windows, however, maps that name to the 8-bit 
string "double bucky blackslash vertical bar", but when you pass *that* 
back to it, it gets encoded as the unicode string "double bucky backslash 
vertical bar", which Sambda then doesn't recognise.

I don't know what Windows *should* do here. I know it shouldn't do this - 
this leads to breaking of some very basic invariants about files and 
directories, and so the kind of confusion utabintarbo suffered. The 
solution is either to apply an information-preserving encoding (UTF-8, 
say), or to refuse to do it at all (ie, raise an error if there are 
unencodable characters), neither of which are particularly beautiful 
solutions. I think Windows is in a bit of a rock/hard place situation 
here, poor thing.

Incidentally, for those who haven't come across CP_ACP before, it's not 
yet another character encoding, it's a pseudovalue which means 'the 
system's current default character set'.

tom

-- 
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies


More information about the Python-list mailing list