distinction between unzipping bytes and unzipping a file

Fri Jan 9 18:07:10 EST 2009

On Jan 10, 8:56 am, webcomm <rya... at gmail.com> wrote:
> On Jan 9, 4:12 pm, "Chris Mellon" <arka... at gmail.com> wrote:
>
> > It would really help if you could post a sample file somewhere.
>
> Here's a sample with some dummy data from the web service:http://webcomm.webfactional.com/htdocs/data.zip
>
> That's the zip created in this line of my code...
> f = open('data.zip', 'wb')

Your original problem is identical to that already reported by Chris
Mellon (gratuitous \0 bytes appended to the real archive contents).
Here's the output of the diagnostic gadget that I posted a few minutes
ago:
..........................................................
C:\downloads>python zip_susser_v2.py data.zip
archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
                        signature : 'PK\x05\x06'
                    this_disk_num : 0
             central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
  central_dir_overall_num_entries : 1
                 central_dir_size : 50
               central_dir_offset : 844
                     comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'
...................................

>
> If I open the file it contains as unicode in my text editor (EditPlus)
> on Windows XP, there is ostensibly nothing wrong with it.  It looks
> like valid XML.

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:

>>> buff = open('data', 'rb').read()
>>> buff[:100]
'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'
>>> buff[:100].decode('utf_16_le')
u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>>>

>  But if I return it to my browser with python+django,
> there are bad characters every other character

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

>
> If I unzip it like this...
> popen("unzip data.zip")
> ...then the bad characters are 'FFFD' characters as described and
> pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

Yup, you've somehow pushed your utf_16_le-encoded data through some
decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
"big fat Unicode version of the question mark".

>
> If I unzip it like this...
> getzip('data.zip', ignoreable=30000)
> ...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
> ...then the bad characters are \x00 characters.

Hmmm ... shouldn't make a difference how you extracted 'data' from
'data.zip'.

Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

Cheers,
John