Need help on UNICODE conversion
Erik Max Francis
max at alcyone.com
Sun Sep 7 01:45:39 EDT 2003
Bernd Preusing wrote:
> Yes, sorry. Cut & paste was not possible, so I wrote it down
> with some errors, very tired and frustrated :-(
> I had tried to attach a small screenshot, but this is no binary news
> group...
>
> My first fault was to cut off the first 7 bytes, but I had to
> eliminate 8.
Yeah, it looks like it's a NUL-terminated string ('UNICODE\0') followed
by UTF-16 encoded data.
> I had to cut off the beginning, which is "UNICODE\x00".
> The remainder means "Kommentar Unicode *äöüÄÖÜß*"
> (this contains german umlauts at the end)
>
> Now I have a string
> ustring = "\x00K\x00o\x00m....."
I reconstructed this as:
>>> s
'\x00K\x00o\x00m\x00m\x00e\x00n\x00t\x00a\x00r\x00
\x00U\x00n\x00i\x00c\x00o\x00d\x00e\x00
\x00*\x00\xe4\x00\xf6\x00\xfc\x00\xc4\x00\xd6\x00\xdc\x00\xdf\x00*\x00\r\x00\n\x00\r\x00\n'
> us2 = unicode(ustring, "utf_16")
> yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
> position 48-49: illegal encoding
>
> Strange, because that position is at "00 dc" and not earlier!?
In these kind of situations, you can use the 'replace' errors directive
to maybe see what's going on:
>>> unicode(s, 'utf-16', 'replace')
u'\u4b00\u6f00\u6d00\u6d00\u6500\u6e00\u7400\u6100\u7200\u2000\u5500\u6e00\u6900\u6300\u6f00\u6400\u6500\u2000\u2a00\ue400\uf600\ufc00\uc400\ud600\ufffd\ufffd\u2a00\u0d00\u0a00\u0d00\u0a00'
Oops! Those aren't Unicode codes for Latin numbers, so there's a byte
ordering problem. Since it's encoding a K as '\x00K', that means that
it's big endian UTF-16, so prepend the proper byte order marker and
voila:
>>> u = unicode(codecs.BOM_UTF16_BE + u, 'utf-16')
>>> u
u'Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n'
... which I can convert to Latin-1 and print to then see the umlauts and
the double S.
> Thaks again
You bet.
--
Erik Max Francis && max at alcyone.com && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \ Give me chastity, but not yet.
\__/ St. Augustine
More information about the Python-list
mailing list