Newbie question: Unicode hiccup on reading file i just wrote

Mon Jan 30 17:33:25 EST 2006

Darcy schrieb:
> hi all, i have a newbie problem arising from writing-then-reading a 
> unicode file, and i can't work out what syntax i need to read it in.
> 
> the syntax i'm using now (just using quick hack tmp files):
> BEGIN
> f=codecs.open("tt.xml","r","utf8")
> fwrap=codecs.EncodedFile(f,"ascii","utf8")
> try:
>     ss=u''
>     ss=fwrap.read()
>     print ss
>     ## rrr=xml.dom.minidom.parseString(f.read()) # originally
> finally:
>     f.close()
> END
> 
> barfs with this error:
> BEGIN
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in 
> position 5092: ordinal not in range(128)
> END
> 
> any ideas?

Your doing things triple-time, which is this time not even half as good:

The

f=codecs.open("tt.xml","r","utf8")

gives you a file that will return unicode objects when reading. And

fwrap=codecs.EncodedFile(f,"ascii","utf8")

will wrap a normal, non-encoding-aware file to become an encoding aware 
one. The result is that reading reading from the former already yields a 
unicode object that is passed to the second wrapper. It will silently 
pass the unicode-object - but it's useless.

And then you try and pass that unicode object of yours to the minidom. 
But guess what, the minicom parser expects a (byte) string, as it reads 
the mandatory xml encoding header and will decode the contents itself. 
So, the passed unicode object is converted to a string beforehand, 
yielding the exception you see.

Just don't do any fancy encoding stuff at all, a simple

rrr=xml.dom.minidom.parseString(open("tt.xml").read())

should do.

Diez