HTML Encoded Translation

Dave davidworley at gmail.com
Tue Oct 17 15:25:49 EDT 2006


Got it, great. This worked like a charm. I knew I was barking up the
wrong tree with urllib, but I didn't know which tree to bark up...

Thanks!

Fredrik Lundh wrote:
> Dave wrote:
>
> > How can I translate this:
> >
> > gi
> >
> > to this:
> >
> > "gi"
>
> the easiest way is to run it through an HTML or XML parser (depending on
> what the source is).  or you could use something like this:
>
>      import re
>
>      def fix_charrefs(text):
>          def fixup(m):
>              text = m.group(0)
>              try:
>                  if text[:3] == "&#x":
>                      return unichr(int(text[3:-1], 16))
>                  else:
>                      return unichr(int(text[2:-1]))
>              except ValueError:
>                  pass
>              return text # leave as is
>          return re.sub("&#?\w+;", fixup, text)
>
>      >>> fix_charrefs("gi")
>      'gi'
>
> also see:
>
>      http://effbot.org/zone/re-sub.htm#strip-html
>
> > I've tried urllib.unencode and it doesn't work.
>
> those are HTML/XML character references, not encoded URL characters.
> 
> </F>




More information about the Python-list mailing list