unescape HTML entities

Frederic Rentsch anthra.norell at vtxmail.ch
Thu Nov 2 07:17:47 EST 2006


Rares Vernica wrote:
> Hi,
>
> Nice module!
>
> I downloaded 2.3 and I started to play with it. The file names have 
> funny names, they are all caps, including extension.
>
> For example the main module file is "SE.PY". Is you try "import SE" it 
> will not work as Python expects the file extension to be "py".
>
> Thanks,
> Ray
>
> Frederic Rentsch wrote:
>   
>> Rares Vernica wrote:
>>     
>>> Hi,
>>>
>>> How can I unescape HTML entities like " "?
>>>
>>> I know about xml.sax.saxutils.unescape() but it only deals with "&", 
>>> "<", and ">".
>>>
>>> Also, I know about htmlentitydefs.entitydefs, but not only this 
>>> dictionary is the opposite of what I need, it does not have " ".
>>>
>>> It has to be in python 2.4.
>>>
>>> Thanks a lot,
>>> Ray
>>>
>>>       
>> One way is this:
>>
>>  >>> import SE                                                      # 
>> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>>  >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name')    # 
>> HTM2ISO.se is included
>> 'output_file_name'
>>
>> For repeated translations the SE object would be assigned to a variable:
>>
>>  >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
>>
>> SE objects take and return strings as well as file names which is useful 
>> for translating string variables, doing line-by-line translations and 
>> for interactive development or verification. A simple way to check a 
>> substitution set is to use its definitions as test data. The following 
>> is a section of the definition file HTM2ISO.se:
>>
>> test_string = '''
>> ø=(xf8)   #  248  f8
>> ù=(xf9)   #  249  f9
>> ú=(xfa)   #  250  fa
>> û=(xfb)    #  251  fb
>> ü=(xfc)     #  252  fc
>> ý=(xfd)   #  253  fd
>> þ=(xfe)    #  254  fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>> '''
>>
>>  >>> print HTM_Decoder (test_string)
>>
>> ø=(xf8)   #  248  f8
>> ù=(xf9)   #  249  f9
>> ú=(xfa)   #  250  fa
>> û=(xfb)    #  251  fb
>> ü=(xfc)     #  252  fc
>> ý=(xfd)   #  253  fd
>> þ=(xfe)    #  254  fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>>
>> Another feature of SE is modularity.
>>
>>  >>> strip_tags = '''
>>    ~<(.|\x0a)*?>~=(9)               # one tag to one tab
>>    ~<!--(.|\x0a)*?-->~=(9)          # one comment to one tab
>> |                                   # run
>>    "~\x0a[ \x09\x0d\x0a]*~=(x0a)"   # delete empty lines
>>    ~\t+~=(32)                       # one or more tabs to one space
>>    ~\x20\t+~=(32)                   # one space and one or more tabs to 
>> one space
>>    ~\t+\x20~=(32)                   # one or more tab and one space to 
>> one space
>> '''
>>
>>  >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ')   # 
>> Order doesn't matter
>>
>> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it 
>> together with HTM2ISO.se:
>>
>>  >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se  HTM2ISO.se')   # 
>> Order doesn't matter
>>
>> Or, if you have two SE objects, one for stripping tags and one for 
>> decoding the ampersands, you can nest them like this:
>>
>>  >>> test_string = "<p class=MsoNormal 
>> style='line-height:110%'><i>René</i> est un garçon qui 
>> paraît plus âgé. </p>"
>>
>>  >>> print Tag_Stripper (HTM_Decoder (test_string))
>>   René est un garçon qui paraît plus âgé.
>>
>> Nesting works with file names too, because file names are returned:
>>
>>  >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
>> 'output_file_name'
>>
>>
>> Frederic
>>
>>
>>
>>     
>
>   
Arrrgh!

Did it again capitalizing extensions. We had solved this problem and 
here we have it again. I am so sorry. Fortunately it isn't hard to 
solve, renaming the files once one identifies the problem, which you 
did. I shall change the upload within the next sixty seconds.

Frederic

I'm glad you find it useful.





More information about the Python-list mailing list