Python3.1: gzip encoding with UTF-8 fails

Diez B. Roggisch deets at nospam.web.de
Sun Dec 20 11:52:24 EST 2009


Johannes Bauer schrieb:
> Hello group,
> 
> with this following program:
> 
> #!/usr/bin/python3
> import gzip
> x = gzip.open("testdatei", "wb")
> x.write("ä")
> x.close()
> 
> I get a broken .gzip file when decompressing:
> 
> $ cat testdatei |gunzip
> ä
> gzip: stdin: invalid compressed data--length error
> 
> As it only happens with UTF-8 characters, I suppose the gzip module

UTF-8 is not unicode. Even if the source-encoding above is UTF-8, I'm 
not sure what is used to encode the unicode-string when it's written.

> writes a length of 1 in the gzip file header (one character "ä"), but
> then actually writes 2 characters (0xc3 0xa4).
> 
> Is there a solution?

What about writinga bytestring by explicitly decoding the string to 
utf-8 first?

x.write("ä".encode("utf-8"))


Diez



More information about the Python-list mailing list