unescape HTML entities
Rares Vernica
rvernica at gmail.com
Tue Oct 31 02:38:28 EST 2006
Thanks a lot for all the answers!
Ray
Frederic Rentsch wrote:
> Rares Vernica wrote:
>> Hi,
>>
>> How can I unescape HTML entities like " "?
>>
>> I know about xml.sax.saxutils.unescape() but it only deals with "&",
>> "<", and ">".
>>
>> Also, I know about htmlentitydefs.entitydefs, but not only this
>> dictionary is the opposite of what I need, it does not have " ".
>>
>> It has to be in python 2.4.
>>
>> Thanks a lot,
>> Ray
>>
> One way is this:
>
> >>> import SE #
> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
> >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') #
> HTM2ISO.se is included
> 'output_file_name'
>
> For repeated translations the SE object would be assigned to a variable:
>
> >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
>
> SE objects take and return strings as well as file names which is useful
> for translating string variables, doing line-by-line translations and
> for interactive development or verification. A simple way to check a
> substitution set is to use its definitions as test data. The following
> is a section of the definition file HTM2ISO.se:
>
> test_string = '''
> ø=(xf8) # 248 f8
> ù=(xf9) # 249 f9
> ú=(xfa) # 250 fa
> û=(xfb) # 251 fb
> ü=(xfc) # 252 fc
> ý=(xfd) # 253 fd
> þ=(xfe) # 254 fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
> '''
>
> >>> print HTM_Decoder (test_string)
>
> ø=(xf8) # 248 f8
> ù=(xf9) # 249 f9
> ú=(xfa) # 250 fa
> û=(xfb) # 251 fb
> ü=(xfc) # 252 fc
> ý=(xfd) # 253 fd
> þ=(xfe) # 254 fe
> é=(xe9)
> ê=(xea)
> ë=(xeb)
> ì=(xec)
> í=(xed)
> î=(xee)
> ï=(xef)
>
> Another feature of SE is modularity.
>
> >>> strip_tags = '''
> ~<(.|\x0a)*?>~=(9) # one tag to one tab
> ~<!--(.|\x0a)*?-->~=(9) # one comment to one tab
> | # run
> "~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines
> ~\t+~=(32) # one or more tabs to one space
> ~\x20\t+~=(32) # one space and one or more tabs to
> one space
> ~\t+\x20~=(32) # one or more tab and one space to
> one space
> '''
>
> >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') #
> Order doesn't matter
>
> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it
> together with HTM2ISO.se:
>
> >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') #
> Order doesn't matter
>
> Or, if you have two SE objects, one for stripping tags and one for
> decoding the ampersands, you can nest them like this:
>
> >>> test_string = "<p class=MsoNormal
> style='line-height:110%'><i>René</i> est un garçon qui
> paraît plus âgé. </p>"
>
> >>> print Tag_Stripper (HTM_Decoder (test_string))
> René est un garçon qui paraît plus âgé.
>
> Nesting works with file names too, because file names are returned:
>
> >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
> 'output_file_name'
>
>
> Frederic
>
>
>
More information about the Python-list
mailing list