[issue16310] zipfile: allow surrogates in filenames

Sat Mar 16 04:44:45 CET 2013

Toshio Kuratomi added the comment:

I found some "standards" docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
"D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437."
[..]
"D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification."
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes.  If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437.   However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.

----------
nosy: +a.badger

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue16310>
_______________________________________