[New-bugs-announce] [issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

John Goerzen report at bugs.python.org
Tue Nov 19 21:52:22 EST 2019


New submission from John Goerzen <jgoerzen at users.sourceforge.net>:

The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames.  As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" displays:

00000000  74 65 73 74 f7 2e 74 78  74 0a                    |test..txt.|
0000000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls | hd" now displays:

00000000  74 65 73 74 e2 89 88 2e  74 78 74 0a              |test....txt.|
0000000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications.  It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so.  It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly.  I'm attaching this zip file for reference.

At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness.

----------
files: t.zip
messages: 357023
nosy: jgoerzen
priority: normal
severity: normal
status: open
title: zipfile: Corrupts filenames containing non-UTF8 characters
type: behavior
Added file: https://bugs.python.org/file48724/t.zip

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38861>
_______________________________________


More information about the New-bugs-announce mailing list