ElementTree.fromstring(unicode_html)

globophobe globophobe at gmail.com
Fri Jan 25 21:11:28 EST 2008


This is likely an easy problem; however, I couldn't think of
appropriate keywords for google:

Basically, I have some raw data that needs to be preprocessed before
it is saved to the database e.g.

In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
\u3044\r\n'

I need to turn this into an elementtree, but some of the data is
japanese whereas the rest is html. This string contains a <br />.

In [2]: e = ET.fromstring('<data>%s</data>' % unicode_html)
In [2]: e.text
Out[3]: u'\u3055\u3080\u3044\uff0f\n\u3064\u3081\u305f\u3044\n'
In [4]: len(e)
Out[4]: 0

How can I decode the unicode html <br /> into a string that
ElementTree can understand?




More information about the Python-list mailing list