utf8 and ftplib
Fredrik Lundh
fredrik at pythonware.com
Mon Jun 20 08:27:17 EDT 2005
Richard Lewis wrote:
> OK, I'm still not getting this unicode business.
obviously.
> <document>
> <a>aàáâã</a>
> <e>eèéêë</e>
> <i>iìíîï</i>
> <o>oòóôõ</o>
> <u>oùúûü</u>
> </document>
>
> (If testing, make sure you save this as utf-8 encoded.)
why? that XML snippet doesn't include any UTF-8-encoded characters.
:::
> file = codecs.open(sys.argv[1], "r", "utf-8")
> document = parse(file)
> file.close()
why do you insist on decoding the stream you pass to the XML parser,
when you've already been told that you shouldn't do that? change this
to:
document = parse(sys.argv[1])
> print document.toxml(encoding="utf-8")
this converts the document to UTF-8, and prints it to stdout. if you get
gibberish, your stdout wants some other encoding. if you get "capital-
A-with-tilde" gibberish, your stdout expects ISO-8859-1.
try changing this to:
print document.toxml(encoding=sys.stdout.encoding)
> out_str = unicode2charrefs(document.toxml(encoding="utf-8"))
this converts the document to UTF-8, and then translates the *encoded*
data to character references as if the document had been encoded as ISO-
8859-1. this makes no sense at all, and results in an XML document full
of "capital-A-with-tilde" gibberish.
> i.e., does anyone else get two byte sequences beginning with
> capital-A-with-tilde instead of the expected characters?
since you've requested UTF-8 output, "capital A with tilde" is the expected
result if you're directing output to an ISO-8859-1 stream.
> the output file is still wrong.
well, you're messing it up all by yourself. getting rid of all the codecs and
unicode2charrefs nonsense will fix this:
document = parse(sys.argv[1]) # parser decodes
... manipulate document ...
file = open(..., "w")
file.write(document.toxml(encoding="utf-8")) # writer encodes
file.close()
</F>
More information about the Python-list
mailing list