html escape sequences

Will McGugan news at NOwillmcguganSPAM.com
Fri Mar 18 06:53:27 EST 2005


Leif K-Brooks wrote:
> Will McGugan wrote:
> 
>> I'd like to replace html escape sequences, like &nbsp and &#39 with 
>> single characters. Is there a dictionary defined somewhere I can use 
>> to replace these sequences?
> 
> 
> How about this?
> 
> import re
> from htmlentitydefs import name2codepoint
> 
> _entity_re = re.compile(r'&(?:(#)(\d+)|([^;]+));')
> 
> def _repl_func(match):
>     if match.group(1): # Numeric character reference
>         return unichr(int(match.group(2)))
>     else:
>         return unichr(name2codepoint[match.group(3)])
> 
> def handle_html_entities(string):
>     return _entity_re.sub(_repl_func, string)

muchas gracias!

Will McGugan



More information about the Python-list mailing list