Html character entity conversion

Mon Jul 31 02:47:48 EDT 2006

John Machin wrote:
> Claudio Grondi wrote:
> 
>>pak.andrei at gmail.com wrote:
>>
>>>Claudio Grondi wrote:
>>>
>>>
>>>>pak.andrei at gmail.com wrote:
>>>>
>>>>
>>>>>Here is my script:
>>>>>
>>>>
>>>>>from mechanize import *
>>>>>from BeautifulSoup import *
>>>>
>>>>>import StringIO
>>>>>b = Browser()
>>>>>f = b.open("http://www.translate.ru/text.asp?lang=ru")
>>>>>b.select_form(nr=0)
>>>>>b["source"] = "hello python"
>>>>>html = b.submit().get_data()
>>>>>soup = BeautifulSoup(html)
>>>>>print  soup.find("span", id = "r_text").string
>>>>>
>>>>>OUTPUT:
>>>>>привет
>>>>>питон
>>>>>----------
>>>>>In russian it looks like:
>>>>>"привет питон"
>>>>>
>>>>>How can I translate this using standard Python libraries??
>>>>>
>>>>>--
>>>>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
>>>>>
>>>>
>>>>Translate to what and with what purpose?
>>>>
>>>>Assuming your intention is to get a Python Unicode string, what about:
>>>>
>>>>strHTML = 'привет
>>>>питон'
>>>>strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
>>>>strUnicode = eval("u'%s'"%strUnicodeHexCode)
>>>>
>>>>?
>>>>
>>>>I am sure, there is a more elegant and direct solution, but just wanted
>>>>to provide here some quick response.
>>>>
>>>>Claudio Grondi
>>>
>>>
>>>Thank you, Claudio.
>>>Really interest solution, but it doesn't work...
>>>
>>>In [19]: strHTML = 'привет
>>>питон'
>>>
>>>In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
>>>
>>>In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)
>>>
>>>In [22]: print strUnicode
>>>---------------------------------------------------------------------------
>>>exceptions.UnicodeEncodeError                        Traceback (most
>>>recent call last)
>>>
>>>C:\Documents and Settings\dron\<ipython console>
>>>
>>>C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
>>>     16     def encode(self,input,errors='strict'):
>>>     17
>>>---> 18         return codecs.charmap_encode(input,errors,encoding_map)
>>>     19
>>>     20     def decode(self,input,errors='strict'):
>>>
>>>UnicodeEncodeError: 'charmap' codec can't encode characters in position
>>>0-5: character maps to <undefined>
>>>
>>>In [23]: print strUnicode.encode("utf-8")
>>>сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
>>><-- it's not my string "привет питон"
>>>
>>>In [24]: strUnicode.encode("utf-8")
>>>Out[24]:
>>>'\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
>>>\xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
>>>x85' <-- and too many chars
>>>
>>
>>Have you considered, that the HTML page specifies charset=windows-1251
>>in its
>><meta http-equiv="Content-Type" content="text/html;
>>charset=windows-1251"> tag ?
>>You are apparently on Linux or so, so I can't track this problem down
>>having only a Windows box here, but inbetween I know that there is
>>another problem with it:
>>I have erronously assumed, that the numbers in п are hexadecimal,
>>but they are decimal, so it is necessary to do hex(int('1087')) on them
>>to get at the right code to put into eval().
>>As you know now the idea I hope you will succeed as I did with:
>>
>> >>> lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
>> >>> lstIntUnicodeDecimalCode
>>['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
>>'1090', '1086', '1085', '']
>> >>> lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
>> >>> lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
>> >>> lstHexUnicode
>>['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
>>'0x442', '0x43e', '0x43d']
>> >>> eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
>>u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
>> >>> strUnicode = eval(
>>'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
>> >>> print strUnicode
>>приветпитон
>>
>>Sorry for that mess not taking the space into consideration, but I think
>>  you can get the idea anyway.
> 
> 
> I hope he *doesn't* get that "idea".
> 
> #>>> strHTML =
> 'приветпит&#
> 1086;н'
> #>>> strUnicode = [unichr(int(x)) for x in
> strHTML.replace('&#','').split(';') if
>  x]
> #>>> strUnicode
> [u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
> u'\u043f', u'
> \u0438', u'\u0442', u'\u043e', u'\u043d']
> #>>>
Knowing about the built-in function unichr() is a good thing, but ... 
there are still drawbacks, because (not tested!) e.g. :
'100x hallo Python' translates to
'100x привет 
Питон'
and can't be handled by improving the core idea by usage of unichr() 
instead of the eval() stuff because of the wrong approach with using 
.replace() and .split() which work only on the given example but not in 
general case.
I am just too lazy to sit down and work on code extracting from the HTML 
the &#....; sequences to convert only them letting the other content of 
the string unchanged in order to arrive at a solution that works in 
general case (it should be not hard and I suppose the OP has it already 
:-) if he is at a Python skill level of playing around with the 
mechanize module).
I am still convinced, that there must be a more elegant and direct 
solution, so the subject is still fully open for improvements towards 
the actual final goal.
I suppose, that one can use in addition to unichr() also unicode() as 
replacement for usage of eval().

To Andrei: can you please post here what you have finally arrived at?

Claudio Grondi