unescape HTML entities

Tue Oct 31 02:38:28 EST 2006

Thanks a lot for all the answers!
Ray

Frederic Rentsch wrote:
> Rares Vernica wrote:
>> Hi,
>>
>> How can I unescape HTML entities like " "?
>>
>> I know about xml.sax.saxutils.unescape() but it only deals with "&", 
>> "<", and ">".
>>
>> Also, I know about htmlentitydefs.entitydefs, but not only this 
>> dictionary is the opposite of what I need, it does not have " ".
>>
>> It has to be in python 2.4.
>>
>> Thanks a lot,
>> Ray
>>
> One way is this:
> 
>  >>> import SE                                                      # 
> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>  >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
> HTM2ISO.se is included
> 'output_file_name'
> 
> For repeated translations the SE object would be assigned to a variable:
> 
>  >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
> 
> SE objects take and return strings as well as file names which is useful 
> for translating string variables, doing line-by-line translations and 
> for interactive development or verification. A simple way to check a 
> substitution set is to use its definitions as test data. The following 
> is a section of the definition file HTM2ISO.se:
> 
> test_string = '''
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> '''
> 
>  >>> print HTM_Decoder (test_string)
> 
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> 
> Another feature of SE is modularity.
> 
>  >>> strip_tags = '''
>    ~<(.|\x0a)*?>~=(9)               # one tag to one tab
>    ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
> |                                   # run
>    "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
>    ~\t+~=(32)                       # one or more tabs to one space
>    ~\x20\t+~=(32)                   # one space and one or more tabs to 
> one space
>    ~\t+\x20~=(32)                   # one or more tab and one space to 
> one space
> '''
> 
>  >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
> Order doesn't matter
> 
> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
> together with HTM2ISO.se:
> 
>  >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
> Order doesn't matter
> 
> Or, if you have two SE objects, one for stripping tags and one for 
> decoding the ampersands, you can nest them like this:
> 
>  >>> test_string = "<p class=MsoNormal 
> style='line-height:110%'><i>René</i> est un garçon qui 
> paraît plus âgé. </p>"
> 
>  >>> print Tag_Stripper (HTM_Decoder (test_string))
>   René est un garçon qui paraît plus âgé.
> 
> Nesting works with file names too, because file names are returned:
> 
>  >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
> 'output_file_name'
> 
> 
> Frederic
> 
> 
>