SPAM-LOW: Re: BeautifulSoup vs. loose & chars

Andreas Lysdal andelys at riddergarn.dk
Tue Dec 26 13:19:25 EST 2006



Duncan Booth skrev:
> "Felipe Almeida Lessa" <felipe.lessa at gmail.com> wrote:
>
>   
>> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
>>     
>>> So do you want to remove "&" or replace them with "&" ? If you
>>> want to replace it try the following;
>>>       
>> I think he wants to replace them, but just the invalid ones. I.e.,
>>
>> This & this & that
>>
>> would become
>>
>> This & this & that
>>
>>
>> No, i don't know how to do this efficiently. =/...
>> I think some kind of regex could do it.
>>
>>     
>
> Since he's asking for valid xml as output, it isn't sufficient just to
> ignore entity definitions: HTML has a lot of named entities such as
>   but xml only has a very limited set of predefined named entities.
> The safest technique is to convert them all to numeric escapes except
> for the very limited set also guaranteed to be available in xml. 
>
> Try this:
>
> from cgi import escape
> import re
> from htmlentitydefs import name2codepoint
> name2codepoint = name2codepoint.copy()
> name2codepoint['apos']=ord("'")
>
> EntityPattern =
> re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') 
>
> def decodeEntities(s, encoding='utf-8'): 
>     def unescape(match):
> 	code = match.group(1)
>         if code:
>             return unichr(int(code, 10))
>         else:
>             code = match.group(2)
>             if code:
>                 return unichr(int(code, 16))
> 	    else:
>                 return unichr(name2codepoint[match.group(3)])
>     return EntityPattern.sub(unescape, s)
>
>   
>>>> escape(
>>>>         
>     decodeEntities("This & this & that é")).encode(
>         'ascii', 'xmlcharrefreplace') 
> 'This & this & that é'
>
>
> P.S. apos is handled specially as it isn't technically a
> valid html entity (and Python doesn't include it in its entity
> list), but it is an xml entity and recognised by many browsers so some
> people might use it in html.
>   
Hey i fund this site: 
http://www.htmlhelp.com/reference/html40/entities/symbols.html

I hope that its what you mean.

/Scripter47


 





More information about the Python-list mailing list