Html character entity conversion

Sun Jul 30 22:37:39 EDT 2006

Claudio Grondi wrote:
> pak.andrei at gmail.com wrote:
> > Claudio Grondi wrote:
> >
> >>pak.andrei at gmail.com wrote:
> >>
> >>>Here is my script:
> >>>
> >>>from mechanize import *
> >>>from BeautifulSoup import *
> >>>import StringIO
> >>>b = Browser()
> >>>f = b.open("http://www.translate.ru/text.asp?lang=ru")
> >>>b.select_form(nr=0)
> >>>b["source"] = "hello python"
> >>>html = b.submit().get_data()
> >>>soup = BeautifulSoup(html)
> >>>print  soup.find("span", id = "r_text").string
> >>>
> >>>OUTPUT:
> >>>привет
> >>>питон
> >>>----------
> >>>In russian it looks like:
> >>>"привет питон"
> >>>
> >>>How can I translate this using standard Python libraries??
> >>>
> >>>--
> >>>Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
> >>>
> >>
> >>Translate to what and with what purpose?
> >>
> >>Assuming your intention is to get a Python Unicode string, what about:
> >>
> >>strHTML = 'привет
> >>питон'
> >>strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
> >>strUnicode = eval("u'%s'"%strUnicodeHexCode)
> >>
> >>?
> >>
> >>I am sure, there is a more elegant and direct solution, but just wanted
> >>to provide here some quick response.
> >>
> >>Claudio Grondi
> >
> >
> > Thank you, Claudio.
> > Really interest solution, but it doesn't work...
> >
> > In [19]: strHTML = 'привет
> > питон'
> >
> > In [20]: strUnicodeHexCode = strHTML.replace('&#','\u').replace(';','')
> >
> > In [21]: strUnicode = eval("u'%s'"%strUnicodeHexCode)
> >
> > In [22]: print strUnicode
> > ---------------------------------------------------------------------------
> > exceptions.UnicodeEncodeError                        Traceback (most
> > recent call last)
> >
> > C:\Documents and Settings\dron\<ipython console>
> >
> > C:\usr\lib\encodings\cp866.py in encode(self, input, errors)
> >      16     def encode(self,input,errors='strict'):
> >      17
> > ---> 18         return codecs.charmap_encode(input,errors,encoding_map)
> >      19
> >      20     def decode(self,input,errors='strict'):
> >
> > UnicodeEncodeError: 'charmap' codec can't encode characters in position
> > 0-5: character maps to <undefined>
> >
> > In [23]: print strUnicode.encode("utf-8")
> > сВЗсВИсВАсБ┤сБ╖сВР сВЗсВАсВРсВЖсВЕ
> > <-- it's not my string "привет питон"
> >
> > In [24]: strUnicode.encode("utf-8")
> > Out[24]:
> > '\xe1\x82\x87\xe1\x82\x88\xe1\x82\x80\xe1\x81\xb4\xe1\x81\xb7\xe1\x82\x90
> > \xe1\x82\x87\xe1\x82\x80\xe1\x82\x90\xe1\x82\x86\xe1\x82\
> > x85' <-- and too many chars
> >
> Have you considered, that the HTML page specifies charset=windows-1251
> in its
> <meta http-equiv="Content-Type" content="text/html;
> charset=windows-1251"> tag ?
> You are apparently on Linux or so, so I can't track this problem down
> having only a Windows box here, but inbetween I know that there is
> another problem with it:
> I have erronously assumed, that the numbers in п are hexadecimal,
> but they are decimal, so it is necessary to do hex(int('1087')) on them
> to get at the right code to put into eval().
> As you know now the idea I hope you will succeed as I did with:
>
>  >>> lstIntUnicodeDecimalCode = strHTML.replace('&#','').split(';')
>  >>> lstIntUnicodeDecimalCode
> ['1087', '1088', '1080', '1074', '1077', '1090', ' 1087', '1080',
> '1090', '1086', '1085', '']
>  >>> lstIntUnicodeDecimalCode = lstIntUnicodeDecimalCode[:-1]
>  >>> lstHexUnicode = [ hex(int(item)) for item in lstIntUnicodeDecimalCode]
>  >>> lstHexUnicode
> ['0x43f', '0x440', '0x438', '0x432', '0x435', '0x442', '0x43f', '0x438',
> '0x442', '0x43e', '0x43d']
>  >>> eval( 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
> u'\u043f\u0440\u0438\u0432\u0435\u0442\u043f\u0438\u0442\u043e\u043d'
>  >>> strUnicode = eval(
> 'u"%s"'%''.join(lstHexUnicode).replace('0x','\u0' ) )
>  >>> print strUnicode
> приветпитон
>
> Sorry for that mess not taking the space into consideration, but I think
>   you can get the idea anyway.

I hope he *doesn't* get that "idea".

#>>> strHTML =
'приветпит&#
1086;н'
#>>> strUnicode = [unichr(int(x)) for x in
strHTML.replace('&#','').split(';') if
 x]
#>>> strUnicode
[u'\u043f', u'\u0440', u'\u0438', u'\u0432', u'\u0435', u'\u0442',
u'\u043f', u'
\u0438', u'\u0442', u'\u043e', u'\u043d']
#>>>