Html character entity conversion

Sun Jul 30 18:52:51 EDT 2006

pak.andrei at gmail.com wrote:
> Claudio Grondi wrote:
> 
>>pak.andrei at gmail.com wrote:
>>
>>>Here is my script:
>>>
>>>from mechanize import *
>>>from BeautifulSoup import *
>>>import StringIO
>>>b = Browser()
>>>f = b.open("http://www.translate.ru/text.asp?lang=ru")
>>>b.select_form(nr=0)
>>>b["source"] = "hello python"
>>>html = b.submit().get_data()
>>>soup = BeautifulSoup(html)
>>>print  soup.find("span", id = "r_text").string
>>>
>>>OUTPUT:
>>>привет
>>>питон
>>>----------
>>>In russian it looks like:
>>>"привет питон"
>>>
>>>How can I translate this using standard Python libraries??
>>>
>>>--
>>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
>>>
>>
>>Translate to what and with what purpose?
>>
>>Assuming your intention is to get a Python Unicode string, what about:
>>
>>strHTML = 'привет
>>питон'
>>strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
>>strUnicode = eval("u'%s'"%strUnicodeHexCode)
>>
>>?
>>
>>I am sure, there is a more elegant and direct solution, but just wanted
>>to provide here some quick response.
>>
>>Claudio Grondi
> 
> 
> Thank you, Claudio.
> Really interest solution, but it doesn't work...
> 
> In [19]: strHTML = 'привет
> питон'
> 
> In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
> 
> In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)
> 
> In [22]: print strUnicode
> ---------------------------------------------------------------------------
> exceptions.UnicodeEncodeError                        Traceback (most
> recent call last)
> 
> C:\Documents and Settings\dron\<ipython console>
> 
> C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
>      16     def encode(self,input,errors='strict'):
>      17
> ---> 18         return codecs.charmap_encode(input,errors,encoding_map)
>      19
>      20     def decode(self,input,errors='strict'):
> 
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-5: character maps to <undefined>
> 
> In [23]: print strUnicode.encode("utf-8")
> сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
> <-- it's not my string "привет питон"
> 
> In [24]: strUnicode.encode("utf-8")
> Out[24]:
> '\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
> \xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
> x85' <-- and too many chars
> 
Have you considered, that the HTML page specifies charset=windows-1251 
in its
<meta http-equiv="Content-Type" content="text/html; 
charset=windows-1251"> tag ?
You are apparently on Linux or so, so I can't track this problem down 
having only a Windows box here, but inbetween I know that there is 
another problem with it:
I have erronously assumed, that the numbers in п are hexadecimal, 
but they are decimal, so it is necessary to do hex(int('1087')) on them 
to get at the right code to put into eval().
As you know now the idea I hope you will succeed as I did with:

 >>> lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
 >>> lstIntUnicodeDecimalCode
['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080', 
'1090', '1086', '1085', '']
 >>> lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
 >>> lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
 >>> lstHexUnicode
['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438', 
'0x442', '0x43e', '0x43d']
 >>> eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
 >>> strUnicode = eval( 
'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
 >>> print strUnicode
приветпитон

Sorry for that mess not taking the space into consideration, but I think 
  you can get the idea anyway.

Claudio Grondi