HTML Encoded Translation

Fredrik Lundh fredrik at pythonware.com
Tue Oct 17 14:26:17 EDT 2006


Dave wrote:

> How can I translate this:
> 
> gi
> 
> to this:
> 
> "gi"

the easiest way is to run it through an HTML or XML parser (depending on 
what the source is).  or you could use something like this:

     import re

     def fix_charrefs(text):
         def fixup(m):
             text = m.group(0)
             try:
                 if text[:3] == "&#x":
                     return unichr(int(text[3:-1], 16))
                 else:
                     return unichr(int(text[2:-1]))
             except ValueError:
                 pass
             return text # leave as is
         return re.sub("&#?\w+;", fixup, text)

     >>> fix_charrefs("gi")
     'gi'

also see:

     http://effbot.org/zone/re-sub.htm#strip-html

> I've tried urllib.unencode and it doesn't work.

those are HTML/XML character references, not encoded URL characters.

</F>




More information about the Python-list mailing list