ElementTree: How to return only unicode?

Sun Mar 15 05:48:02 EDT 2009

Torsten Bronger wrote:
> Hallöchen!

und zurück!

> Stefan Behnel writes:
> 
>> Torsten Bronger wrote:
>>
>>> [...]
>>>
>>> My problem is that if there is only ASCII, these methods return
>>> ordinary strings instead of unicode.  So sometimes I get str,
>>> sometimes I get unicode.  Can one change this globally so that
>>> they only return unicode?
>> That's a convenience measure to reduce memory and processing
>> overhead.
> 
> But is this really worth the inconsistency of having partly str and
> partly unicode, given that the common origin is unicode XML data?

Yes. It's no difference in almost all use cases, as long as you assume Py2
string handling semantics. In Py3, you will always get Unicode strings anyway.

>> Could you explain why this is a problem for you?
> 
> I feed ElementTree's output to functions in the unicodedata module.
> And they want unicode input.  While it's not a big deal to write
> e.g. unicodedata.category(unicode(my_character)), I find this rather
> wasteful.

I just looked at the code. It seems that you can use your own
XMLTreeBuilder subclass and overwrite the "._fixtext()" method like this:

    def _fixtext(self, text):
        return text

Then pass an instance of that as "parser" when parsing in ElementTree. That
should do the trick.

Stefan