URLs and ampersands

Wed Aug 6 03:41:12 EDT 2008

Matthew Woodcraft <mattheww at chiark.greenend.org.uk> wrote:

> Gabriel Genellina wrote:
>> Steven D'Aprano wrote:
> 
>>> I have searched for, but been unable to find, standard library
>>> functions that escapes or unescapes URLs. Are there any such
>>> functions?
> 
>> Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
> 
> I don't see a cgi.unescape in the standard library.
> 
> I don't think xml.sax.saxutils.unescape will be suitable for Steven's
> purpose, because it doesn't process numeric character references (which
> are both legal and seen in the wild in /href/ attributes).
> 

Here's the code I use. It handles decimal and hex entity references as well 
as all html named entities.

import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    if isinstance(s, str):
        s = s.decode(encoding)
    return EntityPattern.sub(unescape, s)

-- 
Duncan Booth http://kupuguy.blogspot.com