HTML Encoded Translation
Dave
davidworley at gmail.com
Tue Oct 17 15:25:49 EDT 2006
Got it, great. This worked like a charm. I knew I was barking up the
wrong tree with urllib, but I didn't know which tree to bark up...
Thanks!
Fredrik Lundh wrote:
> Dave wrote:
>
> > How can I translate this:
> >
> > gi
> >
> > to this:
> >
> > "gi"
>
> the easiest way is to run it through an HTML or XML parser (depending on
> what the source is). or you could use something like this:
>
> import re
>
> def fix_charrefs(text):
> def fixup(m):
> text = m.group(0)
> try:
> if text[:3] == "&#x":
> return unichr(int(text[3:-1], 16))
> else:
> return unichr(int(text[2:-1]))
> except ValueError:
> pass
> return text # leave as is
> return re.sub("&#?\w+;", fixup, text)
>
> >>> fix_charrefs("gi")
> 'gi'
>
> also see:
>
> http://effbot.org/zone/re-sub.htm#strip-html
>
> > I've tried urllib.unencode and it doesn't work.
>
> those are HTML/XML character references, not encoded URL characters.
>
> </F>
More information about the Python-list
mailing list