unescape HTML entities
Frederic Rentsch
anthra.norell at vtxmail.ch
Thu Nov 2 07:17:47 EST 2006
Rares Vernica wrote:
> Hi,
>
> Nice module!
>
> I downloaded 2.3 and I started to play with it. The file names have
> funny names, they are all caps, including extension.
>
> For example the main module file is "SE.PY". Is you try "import SE" it
> will not work as Python expects the file extension to be "py".
>
> Thanks,
> Ray
>
> Frederic Rentsch wrote:
>
>> Rares Vernica wrote:
>>
>>> Hi,
>>>
>>> How can I unescape HTML entities like " "?
>>>
>>> I know about xml.sax.saxutils.unescape() but it only deals with "&",
>>> "<", and ">".
>>>
>>> Also, I know about htmlentitydefs.entitydefs, but not only this
>>> dictionary is the opposite of what I need, it does not have " ".
>>>
>>> It has to be in python 2.4.
>>>
>>> Thanks a lot,
>>> Ray
>>>
>>>
>> One way is this:
>>
>> >>> import SE #
>> Download from http://cheeseshop.python.org/pypi/SE/2.2%20beta
>> >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') #
>> HTM2ISO.se is included
>> 'output_file_name'
>>
>> For repeated translations the SE object would be assigned to a variable:
>>
>> >>> HTM_Decoder = SE.SE ('HTM2ISO.se')
>>
>> SE objects take and return strings as well as file names which is useful
>> for translating string variables, doing line-by-line translations and
>> for interactive development or verification. A simple way to check a
>> substitution set is to use its definitions as test data. The following
>> is a section of the definition file HTM2ISO.se:
>>
>> test_string = '''
>> ø=(xf8) # 248 f8
>> ù=(xf9) # 249 f9
>> ú=(xfa) # 250 fa
>> û=(xfb) # 251 fb
>> ü=(xfc) # 252 fc
>> ý=(xfd) # 253 fd
>> þ=(xfe) # 254 fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>> '''
>>
>> >>> print HTM_Decoder (test_string)
>>
>> ø=(xf8) # 248 f8
>> ù=(xf9) # 249 f9
>> ú=(xfa) # 250 fa
>> û=(xfb) # 251 fb
>> ü=(xfc) # 252 fc
>> ý=(xfd) # 253 fd
>> þ=(xfe) # 254 fe
>> é=(xe9)
>> ê=(xea)
>> ë=(xeb)
>> ì=(xec)
>> í=(xed)
>> î=(xee)
>> ï=(xef)
>>
>> Another feature of SE is modularity.
>>
>> >>> strip_tags = '''
>> ~<(.|\x0a)*?>~=(9) # one tag to one tab
>> ~<!--(.|\x0a)*?-->~=(9) # one comment to one tab
>> | # run
>> "~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines
>> ~\t+~=(32) # one or more tabs to one space
>> ~\x20\t+~=(32) # one space and one or more tabs to
>> one space
>> ~\t+\x20~=(32) # one or more tab and one space to
>> one space
>> '''
>>
>> >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') #
>> Order doesn't matter
>>
>> If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it
>> together with HTM2ISO.se:
>>
>> >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') #
>> Order doesn't matter
>>
>> Or, if you have two SE objects, one for stripping tags and one for
>> decoding the ampersands, you can nest them like this:
>>
>> >>> test_string = "<p class=MsoNormal
>> style='line-height:110%'><i>René</i> est un garçon qui
>> paraît plus âgé. </p>"
>>
>> >>> print Tag_Stripper (HTM_Decoder (test_string))
>> René est un garçon qui paraît plus âgé.
>>
>> Nesting works with file names too, because file names are returned:
>>
>> >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')
>> 'output_file_name'
>>
>>
>> Frederic
>>
>>
>>
>>
>
>
Arrrgh!
Did it again capitalizing extensions. We had solved this problem and
here we have it again. I am so sorry. Fortunately it isn't hard to
solve, renaming the files once one identifies the problem, which you
did. I shall change the upload within the next sixty seconds.
Frederic
I'm glad you find it useful.
More information about the Python-list
mailing list