[issue40172] ZipInfo corrupts file names in some old zip archives

Tue Mar 22 02:31:22 EDT 2022

Gregory P. Smith <greg at krypto.org> added the comment:

Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile module produces modern zipfiles when writing by setting the utf-8 flag and storing the filename as utf-8 when it is not ASCII.  This is desirable for use with all normal zip implementations in the past 10-15 years.

When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the pkware zip appnotes.txt "spec".  So our reading is correct as well, even for very old files.

This is being strict in what we produce an lenient in what we accept.  caveats?  yes:

If someone does need to produce zipfiles for use with ancient software that does not support utf-8, that also does not identify the unknown utf-8 flag as an error condition, it will interpret the name in a corrupt manner for non-ascii names.

Similarly, even if written with cp437 names (as PR 19335 would do), in old zip system implementations where the implementation blindly uses the users locale encoding instead of cp437, it will always see corrupt data in that scenario. (aka mojibake?)

These are not what I'd expect to be normal use cases. Do you have a common practical example of a need for this?

(The PR on issue28080 provides a way to _read_ legacy zip files that used a codec other than cp437 if you know what it was.)

---

https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also be of interest regarding the zip format.

----------
nosy: +gregory.p.smith

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40172>
_______________________________________