Html character entity conversion

yichun yichunwe at usc.edu
Sat Sep 9 20:58:47 EDT 2006


pak.andrei at gmail.com wrote:
> danielx wrote:
>> pak.andrei at gmail.com wrote:
>>> Here is my script:
>>>
>>> from mechanize import *
>>> from BeautifulSoup import *
>>> import StringIO
>>> b = Browser()
>>> f = b.open("http://www.translate.ru/text.asp?lang=ru")
>>> b.select_form(nr=0)
>>> b["source"] = "hello python"
>>> html = b.submit().get_data()
>>> soup = BeautifulSoup(html)
>>> print  soup.find("span", id = "r_text").string
>>>
>>> OUTPUT:
>>> привет
>>> питон
>>> ----------
>>> In russian it looks like:
>>> "привет питон"
>>>
>>> How can I translate this using standard Python libraries??
>>>
>>> --
> 
> Thank you for response.
> It doesn't matter what is 'BeautifulSoup'...

However, the best solution is to ask BeautifulSoup to do that for you. 
if you do

soup = BeautifulSoup(your_html_page, convertEntities="html")

you should not be worrying about the problem you had. this converts all 
the html entities (the five you see as soup.entitydefs) and all the 
"&#xxx;" stuff to their python unicode string.

yichun


> General question is:
> 
> How can I convert encoded string
> 
> sEncodedHtmlText = 'привет
> питон'
> 
> into human readable:
> 
> sDecodedHtmlText  == 'привет питон'
> 





More information about the Python-list mailing list