HTMLParser ignores unicode entities

Thomas Guettler zopestoller at thomas-guettler.de
Tue Dec 17 07:31:09 EST 2002


Hi!

MS-Excel exports cyrillish characters encoded as entities.

Example:
  Отпадъци

HTMLParser ignores these entities.

In the archives I found the following solution:

def get_html_entities():
     import htmlentitydefs
     myentitydefs = htmlentitydefs.entitydefs.copy()
     for k,v in myentitydefs.items():
         #print "in myentities:", k, v
         if v.startswith('&#'):
             v = int(v[2:-1])
         else:
             v = ord(v)
             myentitydefs[k] = unichr(v)
     return myentitydefs

class MSExcelHTMLParser(htmllib.HTMLParser):
     entitydefs=get_html_entities()

This only works for the HTML entities.
I could add the entities for all unicode characters,
but there are a lot. I don't think that's the best
solution.

Does someone know how I can parse HTML files containing
unicode entities?

  thomas




More information about the Python-list mailing list