Unicode and Zipfile problems

vincent wehren vincent at visualtrans.de
Wed Nov 5 14:21:53 EST 2003

"vincent wehren" <vincent at visualtrans.de> schrieb im Newsbeitrag
news:bobbhj$j9n$1 at news4.tilbu1.nb.home.nl...
| "Gerson Kurz" <gerson.kurz at t-online.de> schrieb im Newsbeitrag
| news:3fa8d5ee.304218 at news.t-online.de...
| | AAAAAAAARG I hate the way python handles unicode. Here is a nice
| | problem for y'all to enjoy: say you have a variable thats unicode
| |
| | directory = u"c:\temp"
| |
| | Its unicode not because you want it to, but because its for example
| | read from _winreg which returns unicode.
| |
| | You do an os.listdir(directory). Note that all filenames returned are
| | now unicode. (Change introduced I believe in 2.3).
| Wrong.
| That's only true if type(directory) gives you <type 'unicode'>
| If you call str(directory) before doing os.listdir(directory)
| you (in most cases) want even notice and can continue doing what you want

And when I say "in most cases", I mean all those cases where "directory"
doesn't have characters that map to a single-byte value outside of the ASCII
range. In other cases you'll just go :

directory =

before calling os.listdir(directory)



| do
| just fine - plus, and that's the good part - you can forget about
| those hacks you suggest later and which some would consider *evil*.
| It'll save yourself some time too.
| Hey, and leave my Swahili friends alone will ya! ;)
| HTH,
| Vincent Wehren
| |
| | You add the filenames to a zipfile.ZipFile object. Sometimes, you will
| | get this exception:
| |
| | Traceback (most recent call last):
| |   File "collect_trace_info.py", line 65, in CollectTraceInfo
| |     z.write(pathname)
| |   File "C:\Python23\lib\zipfile.py", line 416, in write
| |     self.fp.write(zinfo.FileHeader())
| |   File "C:\Python23\lib\zipfile.py", line 170, in FileHeader
| |     return header + self.filename + self.extra
| | UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position
| | 12:
| | ordinal not in range(128)
| |
| | After you have regained your composure, you find the reason: "header"
| | is a struct.pack() generated byte string. self.filename is however a
| | unicode string because it is returned by os.listdir as unicode. If
| | "header" generates anything above 0x7F - which can but need not
| | happen, depending on the type of file you have an exception waiting
| | for yourself - sometimes. Great. (The same will probably occur if
| | filename contains chars > 0x7F). The problem does not occur if you
| | have "str" type filenames, because then no backandforth conversion is
| | being made.
| |
| | There is a simple fix, before calling z.write() byte-encode it. Here
| | is a sample code:
| |
| | import os, zipfile, win32api
| |
| | def test(directory):
| |     z =
| |
| |     for filename in os.listdir(directory):
| |         z.write(os.path.join(directory, filename))
| |     z.close()
| |
| | if __name__ == "__main__":
| |     test(unicode(win32api.GetSystemDirectory()))
| |
| | Note: It might work on your system, depending on the types of files.
| | To fix it, use
| |
| | z.write(os.path.join(directory, filename).encode("latin-1"))
| |
| | But to my thinking, this is a bug in zipfile.py, really.
| |
| | Now, could anybody please just write a
| | "i-don't-care-if-my-app-can-display-klingon-characters" raw byte
| | encoding which doesn't throw any assertions and doesn't care whether
| | or not the characters are in the 0x7F range? Its ok if I cannot port
| | my batchscripts to swaheli, really.
| |
| |

More information about the Python-list mailing list