unescape HTML entities

Wed Nov 1 19:32:52 EST 2006

Hi,

Nice module!

I downloaded 2.3 and I started to play with it. The file names have 
funny names, they are all caps, including extension.

For example the main module file is "SE.PY". Is you try "import SE" it 
will not work as Python expects the file extension to be "py".

Thanks,
Ray

Frederic Rentsch wrote:
> Rares Vernica wrote:
>> Hi,
>>
>> How can I unescape HTML entities like " "?
>>
>> I know about xml.sax.saxutils.unescape() but it only deals with "&", 
>> "<", and ">".
>>
>> Also, I know about htmlentitydefs.entitydefs, but not only this 
>> dictionary is the opposite of what I need, it does not have " ".
>>
>> It has to be in python 2.4.
>>
>> Thanks a lot,
>> Ray
>>
> One way is this:
> 
>  >>> import SE                                                      # 
> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>  >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
> HTM2ISO.se is included
> 'output_file_name'
> 
> For repeated translations the SE object would be assigned to a variable:
> 
>  >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
> 
> SE objects take and return strings as well as file names which is useful 
> for translating string variables, doing line-by-line translations and 
> for interactive development or verification. A simple way to check a 
> substitution set is to use its definitions as test data. The following 
> is a section of the definition file HTM2ISO.se:
> 
> test_string = '''
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> '''
> 
>  >>> print HTM_Decoder (test_string)
> 
> ø=(xf8)   #  248  f8
> ù=(xf9)   #  249  f9
> ú=(xfa)   #  250  fa
> û=(xfb)    #  251  fb
> ü=(xfc)     #  252  fc
> ý=(xfd)   #  253  fd
> þ=(xfe)    #  254  fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> 
> Another feature of SE is modularity.
> 
>  >>> strip_tags = '''
>    ~<(.|\x0a)*?>~=(9)               # one tag to one tab
>    ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
> |                                   # run
>    "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
>    ~\t+~=(32)                       # one or more tabs to one space
>    ~\x20\t+~=(32)                   # one space and one or more tabs to 
> one space
>    ~\t+\x20~=(32)                   # one or more tab and one space to 
> one space
> '''
> 
>  >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
> Order doesn't matter
> 
> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
> together with HTM2ISO.se:
> 
>  >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
> Order doesn't matter
> 
> Or, if you have two SE objects, one for stripping tags and one for 
> decoding the ampersands, you can nest them like this:
> 
>  >>> test_string = "<p class=MsoNormal 
> style='line-height:110%'><i>René</i> est un garçon qui 
> paraît plus âgé. </p>"
> 
>  >>> print Tag_Stripper (HTM_Decoder (test_string))
>   René est un garçon qui paraît plus âgé.
> 
> Nesting works with file names too, because file names are returned:
> 
>  >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
> 'output_file_name'
> 
> 
> Frederic
> 
> 
>