ElementTree.fromstring(unicode_html)

Fri Jan 25 22:08:42 EST 2008

On Jan 26, 1:11 pm, globophobe <globoph... at gmail.com> wrote:
> This is likely an easy problem; however, I couldn't think of
> appropriate keywords for google:
>
> Basically, I have some raw data that needs to be preprocessed before
> it is saved to the database e.g.
>
> In [1]: unicode_html = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f
> \u3044\r\n'
>
> I need to turn this into an elementtree, but some of the data is
> japanese whereas the rest is html. This string contains a <br />.

>>> import unicodedata as ucd
>>> s = u'\u3055\u3080\u3044\uff0f\r\n\u3064\u3081\u305f\u3044\r\n'
>>> [ucd.name(c) if ord(c) >= 128 else c for c in s]
['HIRAGANA LETTER SA', 'HIRAGANA LETTER MU', 'HIRAGANA LETTER I',
'FULLWIDTH SOLIDUS', u'\r', u'\n', 'HIRAGANA LETTER TU', 'HIRAGANA
LETTER ME', 'HIRAGANA LETTER TA', 'HIRAGANA LETTER I', u'\r', u'\n']
>>>

Where in there is the <br /> ??