Need help on UNICODE conversion

Erik Max Francis max at alcyone.com
Sun Sep 7 01:45:39 EDT 2003


Bernd Preusing wrote:

> Yes, sorry. Cut & paste was not possible, so I wrote it down
> with some errors, very tired and frustrated :-(
> I had tried to attach a small screenshot, but this is no binary news
> group...
> 
> My first fault was to cut off the first 7 bytes, but I had to
> eliminate 8.

Yeah, it looks like it's a NUL-terminated string ('UNICODE\0') followed
by UTF-16 encoded data.

> I had to cut off the beginning, which is "UNICODE\x00".
> The remainder means "Kommentar Unicode *äöüÄÖÜß*"
> (this contains german umlauts at the end)
> 
> Now I have a string
> ustring = "\x00K\x00o\x00m....."

I reconstructed this as:

>>> s
'\x00K\x00o\x00m\x00m\x00e\x00n\x00t\x00a\x00r\x00
\x00U\x00n\x00i\x00c\x00o\x00d\x00e\x00
\x00*\x00\xe4\x00\xf6\x00\xfc\x00\xc4\x00\xd6\x00\xdc\x00\xdf\x00*\x00\r\x00\n\x00\r\x00\n'

> us2 = unicode(ustring, "utf_16")
> yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
> position 48-49: illegal encoding
> 
> Strange, because that position is at "00 dc" and not earlier!?

In these kind of situations, you can use the 'replace' errors directive
to maybe see what's going on:

>>> unicode(s, 'utf-16', 'replace')
u'\u4b00\u6f00\u6d00\u6d00\u6500\u6e00\u7400\u6100\u7200\u2000\u5500\u6e00\u6900\u6300\u6f00\u6400\u6500\u2000\u2a00\ue400\uf600\ufc00\uc400\ud600\ufffd\ufffd\u2a00\u0d00\u0a00\u0d00\u0a00'

Oops!  Those aren't Unicode codes for Latin numbers, so there's a byte
ordering problem.  Since it's encoding a K as '\x00K', that means that
it's big endian UTF-16, so prepend the proper byte order marker and
voila:

>>> u = unicode(codecs.BOM_UTF16_BE + u, 'utf-16')
>>> u
u'Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n'

... which I can convert to Latin-1 and print to then see the umlauts and
the double S.

> Thaks again

You bet.

-- 
   Erik Max Francis && max at alcyone.com && http://www.alcyone.com/max/
 __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/  \ Give me chastity, but not yet.
\__/  St. Augustine




More information about the Python-list mailing list