[Python-Dev] zipfile and unicode filenames

"Martin v. Löwis" martin at v.loewis.de
Sun Jun 10 18:45:51 CEST 2007


> I don't think always encoding them to utf-8 (and using bit 11 of
> flag_bits) is a good idea, since there's a chance to create archives
> that won't be correctly readable by programs not supporting this bit
> (it's no secret that currently some programs just assume that
> filenames are encoded using one of system encodings).

I think it is also fairly uniformly agreed that these programs are
incorrect; the official encoding of file names in a zip file is
Windows/DOS code page 437.

> This is too
> complex and hazy to implement. Even if I know what is the situation on
> Windows (i.e. using OEM, also called DOS encoding, but I'm not sure
> how to determine its codec name from within python apart from calling
> GetConsoleCP), I'm totally unaware of the situation on other operating
> systems.

I don't think that the situation on Windows is that the OEM code page
should be used. Instead, CP 437 should be used, independent of the OEM
code page.

>> The tricky question is what to do when reading in zipfiles with
>> non-ASCII characters (and yes, I understand that in your case
>> there were only ASCII characters in the file names).
> 
> I don't think it should be changed.

In Python 3, it will certainly change, since the string type
will be unicode-based. It probably should not change for the
rest of 2.x.

> Current zipfile seems to officially support ascii filenames only
> anyway

That's not true. You can use any byte string as the file name
that you want, including non-ASCII strings encoded in CP437.

> +        filename = str(self.filename)

That would be incorrect, as it relies on the system encoding,
which shouldn't be relied upon. Plus, it would allow arbitrary
non-string things as filenames. What it should do instead
(IMO) is to encode in CP437. Bonus points if it falls back
to the UTF-8 feature of zip files if encoding as CP437 fails.

Regards,
Martin


More information about the Python-Dev mailing list