[Python-Dev] zipfile and unicode filenames

Alexey Borzenkov snaury at gmail.com
Sun Jun 10 12:40:19 CEST 2007


> > Also note that I'm trying to ask if zipfile should be improved, how it
> > should be improved, and this possible improvement is not even for me
> > (because now I know how zipfile behaves and I will work correctly with
> > it, but someone else might stumble upon this very unexpectedly).
> If you want to come up with a patch: sure. The zipfile module should
> handle Unicode strings, encoding them in the encoding that the ZIP
> specification defines (both the formal one, and the
> informal-defined-by-pkwares-implementation).

I don't think always encoding them to utf-8 (and using bit 11 of
flag_bits) is a good idea, since there's a chance to create archives
that won't be correctly readable by programs not supporting this bit
(it's no secret that currently some programs just assume that
filenames are encoded using one of system encodings). This is too
complex and hazy to implement. Even if I know what is the situation on
Windows (i.e. using OEM, also called DOS encoding, but I'm not sure
how to determine its codec name from within python apart from calling
GetConsoleCP), I'm totally unaware of the situation on other operating
systems.

> The tricky question is what to do when reading in zipfiles with
> non-ASCII characters (and yes, I understand that in your case
> there were only ASCII characters in the file names).

I don't think it should be changed.

> Ok, now I understand. If filename is a Unicode string, header is
> converted using the system encoding; depending on the exact value
> of header and depending on the system encoding, this may cause
> a decoding error.
>
> This bug has been reported as http://bugs.python.org/1170311

I see. Well, that's all easier now then, as I can just create a patch
for an already existing bug.

> > Because that's not supposed to work sanely when self.filename is
> > unicode I'm asking if the right behavior would be to a) disallow
> > unicode filenames in zipfile.ZipInfo, b) automatically convert
> > filename to str in zipfile.ZipInfo, c) leave everything as it is.
> The correct behavior would be b); the difficult details are what
> encoding to use.

Current zipfile seems to officially support ascii filenames only
anyway, so the patch can be as simple as this:

Index: Lib/zipfile.py
===================================================================
--- Lib/zipfile.py	(revision 55850)
+++ Lib/zipfile.py	(working copy)
@@ -252,12 +252,13 @@
             self.extract_version = max(45, self.extract_version)
             self.create_version = max(45, self.extract_version)

+        filename = str(self.filename)
         header = struct.pack(structFileHeader, stringFileHeader,
                  self.extract_version, self.reserved, self.flag_bits,
                  self.compress_type, dostime, dosdate, CRC,
                  compress_size, file_size,
-                 len(self.filename), len(extra))
-        return header + self.filename + extra
+                 len(filename), len(extra))
+        return header + filename + extra

     def _decodeExtra(self):
         # Try to decode the extra field.

This doesn't introduce new features, just enforces filenames to be
ascii (or whatever default encoding is)
encodable.


More information about the Python-Dev mailing list