html escape sequences

Leif K-Brooks eurleif at ecritters.biz
Fri Mar 18 06:46:20 EST 2005


Will McGugan wrote:
> I'd like to replace html escape sequences, like &nbsp and &#39 with 
> single characters. Is there a dictionary defined somewhere I can use to 
> replace these sequences?

How about this?

import re
from htmlentitydefs import name2codepoint

_entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));')

def _repl_func(match):
     if match.group(1): # Numeric character reference
         return unichr(int(match.group(2)))
     else:
         return unichr(name2codepoint[match.group(3)])

def handle_html_entities(string):
     return _entity_re.sub(_repl_func, string)



More information about the Python-list mailing list