Encoding of file names
Tom Anderson
twic at urchin.earth.li
Fri Dec 9 06:05:28 EST 2005
On Thu, 8 Dec 2005, "Martin v. Löwis" wrote:
> utabintarbo wrote:
>
>> Fredrik, you are a God! Thank You^3. I am unworthy </ass-kiss-mode>
>
> For all those who followed this thread, here is some more explanation:
>
> Apparently, utabintarbo managed to get U+2592 (MEDIUM SHADE, a filled
> 50% grayish square) and U+2524 (BOX DRAWINGS LIGHT VERTICAL AND LEFT, a
> vertical line in the middle, plus a line from that going left) into a
> file name. How he managed to do that, I can only guess: most likely, the
> Samba installation assumes that the file system encoding on the Solaris
> box is some IBM code page (say, CP 437 or CP 850). If so, the byte on
> disk would be \xb4. Where this came from, I have to guess further:
> perhaps it is ACUTE ACCENT from ISO-8859-*.
>
> Anyway, when he used listdir() to get the contents of the directory,
> Windows applies the CP_ACP encoding (known as "mbcs" in Python). For
> reasons unknown to me, the US and several European versions of XP map
> this to \xa6, VERTICAL BAR (I can somewhat see that as meaningful for
> U+2524, but not for U+2592).
>
> So when he then applies isfile to that file name, \xa6 is mapped to
> U+00A6, which then isn't found on the Samba side.
>
> So while Unicode here is the solution, the problem is elsewhere; most
> likely in a misconfiguration of the Samba server (which assumes some
> encoding for the files on disk, yet the AIX application uses a different
> encoding).
Isn't the key thing that Windows is applying a non-roundtrippable
character encoding? If i've understood this right, Samba and Windows are
talking in unicode, with these (probably quite spurious, but never mind)
U+25xx characters, and Samba is presenting a quite consistent view of the
world: there's a file called "double bucky backlash grey box" in the
directory listing, and if you ask for a file called "double bucky backlash
grey box", you get it. Windows, however, maps that name to the 8-bit
string "double bucky blackslash vertical bar", but when you pass *that*
back to it, it gets encoded as the unicode string "double bucky backslash
vertical bar", which Sambda then doesn't recognise.
I don't know what Windows *should* do here. I know it shouldn't do this -
this leads to breaking of some very basic invariants about files and
directories, and so the kind of confusion utabintarbo suffered. The
solution is either to apply an information-preserving encoding (UTF-8,
say), or to refuse to do it at all (ie, raise an error if there are
unencodable characters), neither of which are particularly beautiful
solutions. I think Windows is in a bit of a rock/hard place situation
here, poor thing.
Incidentally, for those who haven't come across CP_ACP before, it's not
yet another character encoding, it's a pseudovalue which means 'the
system's current default character set'.
tom
--
Women are monsters, men are clueless, everyone fights and no-one ever
wins. -- cleanskies
More information about the Python-list
mailing list