distinction between unzipping bytes and unzipping a file

John Machin sjmachin at lexicon.net
Sat Jan 10 16:18:14 EST 2009


On Jan 11, 6:15 am, webcomm <rya... at gmail.com> wrote:
> On Jan 9, 6:07 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
> > God^H^H^HGates intended:
>
> > >>> buff = open('data', 'rb').read()
> > >>> buff[:100]
>
> > '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
> > \x00<\x00B\x0
> > 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
> > \x000\x00.\x000\x000\x000\x000\x0
> > 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
> > \x00S\x00t\x0
> > 0a\x00t\x00'
> > >>> buff[:100].decode('utf_16_le')
>
> There it is.  Thanks.
>
> > u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>
> > >  But if I return it to my browser with python+django,
> > > there are bad characters every other character
>
> > Please consider that we might have difficulty guessing what "return it
> > to my browser with python+django" means. Show actual code.
>
> I did stop and consider what code to show.  I tried to show only the
> code that seemed relevant, as there are sometimes complaints on this
> and other groups when someone shows more than the relevant code.  You
> solved my problem with decode('utf_16_le').  I can't find any
> description of that encoding on the WWW... and I thought *everything*
> was on the WWW.  :)

Try searching using the official name UTF-16LE ... looks like a blind
spot in the approximate matching algorithm(s) used by the search engine
(s) that you tried :-(

> I didn't know the data was utf_16_le-encoded because I'm getting it
> from a service.  I don't even know if *they* know what encoding they
> used.  I'm not sure how you knew what the encoding was.

Actually looked at the raw data. Pattern appeared to be an alternation
of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
is meaningful: => UTF16-LE.

> > Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html
>
> Probably wouldn't hurt,

Definitely won't hurt. Could even help.

> though reading that HOWTO wouldn't have given
> me the encoding, I don't think.

It wasn't intended to give you the encoding. Just read it.

Cheers,
John



More information about the Python-list mailing list