Unicode conversion
Edward K. Ream
edream at tds.net
Thu Oct 3 09:10:44 EDT 2002
My app presently will write Unicode in any format the user desires as long
as it is UTF-8 ;-)
Here is the code that I use to translate from the UTF-8 delivered by the Tk
Text widget to the desired encoding:
print `xml_encoding`
# Tk always uses utf-8 encoding.
print `s`,"tk"
s = s.encode("utf-8") # result is a string.
print `s`,"utf-8"
s = s.decode(xml_encoding) # result is unicode.
s = s.encode(xml_encoding) # result is a string.
print `s`,`xml_encoding`
If I start with:
aAßÉd
a
U+0102(Latin Capital Letter A with Breve)
U+00df(Latin Small Letter Sharp S)
U+00c9(Latin Capital Letter E with Acute)
d
and delete the trailing d the output is:
u'a\u0102\xdf\xc9\n' tk
'a\xc4\x82\xc3\x9f\xc3\x89\n' utf-8
'a\xc4\x82\xc3\x9f\xc3\x89\n' 'ISO-8859-1'
As you can see, the result of the two "encodes" are identical. My app writes
the result of the second encode to the file. Viewing a file (say with MS
Word) with these characters works properly only if UTF-8 is used. Weird
characters appear when the desired ISO-8859-1 encoding is used.
BTW, with out the first encode/decode pair I can take exceptions in the last
encode.
Can anyone explain what is happening and what I should be doing? I'm totally
confused. Thanks.
Edward
--------------------------------------------------------------------
Edward K. Ream email: edream at tds.net
Leo: Literate Editor with Outlines
Leo: http://personalpages.tds.net/~edream/front.html
--------------------------------------------------------------------
More information about the Python-list
mailing list