the tostring and XML methods in ElementTree

Sun May 7 18:11:58 EDT 2006

mirandacascade at yahoo.com wrote:
> Question 1: assuming the following:
>  a) beforeCtag.text gets assigned a value of 'I\x92m confused'
>  b) afterRoot is built using the XML() method where the input to the
> XML() method is the results of a tostring() method from beforeRoot
> Are there any settings/arguments that could have been modified that
> would have resulted in afterCtag.text being of type <type 'str'> and
> afterCtag.text when printed displays:
>  I'm confused
>
> ?

str type (also known as byte string) is only suitable for ascii text.
chr(0x92) is outside of ascii so you should use unicode strings or
you\x92ll be confused :)

>>> print u"I\u2019m not confused"
I'm not confused

> Question 2: Does the fact that resultToStr is equal to resultToStr2
> mean that an encoding of utf-8 is the defacto default when no encoding
> is passed as an argument to the tostring method, or does it only mean
> that in this particular example, they happened to be the same?

No. Dejure default encoding is ascii, defacto people try to change it,
but it's not a good idea. I'm not sure how you got the strings to be
the same, but it's definately host-specific result, when I repeat your
interactive session I get different resultToStr at this point:

>>> afterRoot = ElementTree.XML(resultToStr)
>>> resultToStr
'<beforeRoot><C>I’m confused</C></beforeRoot>'

> 3) would it be possible to construct a statement of the form
>
> newResult = afterCtag.text.encode(?? some argument ??)
>
> where newResult was the same as beforeCtag.text?  If so, what should
> the argument be to the encode method?

Dealing with unicode doesn't require you to pollute your code with
encode methods, just open the file using codecs module and then write
unicode strings directly:

import codecs
fileHandle = codecs.open('c:/output1.text', 'w',"utf-8")
fileHandle.write(u"I\u2019m not confused, because I'm using unicode")

> 4) what is the second character in encodedCtagtext (the character with
> an ordinal value of 194)?

That is byte with value 194, it's not a character. It is part of
unicode code point U+0092 when it is encoded in utf-8

>>> '\xc2\x92'.decode("utf-8")
u'\x92'

This code point actually has no name, so you shouldn't produce it:

>>> import unicodedata
>>> unicodedata.name('\xc2\x92'.decode("utf-8"))

Traceback (most recent call last):
  File "<pyshell#40>", line 1, in -toplevel-
    unicodedata.name('\xc2\x92'.decode("utf-8"))
ValueError: no such name