Using utidylib, empty string returned in some cases
Boris
savinovboris at gmail.com
Tue Jan 22 12:35:16 EST 2008
Hello
I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.
Here is two pages I use to check my program:
http://www.ya.ru/ (in this case everything works ok)
http://www.yellow-pages.ru/rus/nd2/qu5/ru15632 (in this case tidy did
not return me anything just empty string)
code:
--------------
# coding: utf-8
import urllib, urllib2, tidy
def get_page(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
headers = { 'User-Agent' : user_agent }
data= {}
req = urllib2.Request(url, data, headers)
responce = urllib2.urlopen(req)
page = responce.read()
return page
def convert_1251(page):
p = page.decode('windows-1251')
u = p.encode('utf-8')
return u
def clean_html(page):
tidy_options = { 'output_xhtml' : 1,
'add_xml_decl' : 1,
'indent' : 1,
'input-encoding' : 'utf8',
'output-encoding' : 'utf8',
'tidy_mark' : 1,
}
cleaned_page = tidy.parseString(page, **tidy_options)
return cleaned_page
test_url = 'http://www.yellow-pages.ru/rus/nd2/qu5/ru15632'
#test_url = 'http://www.ya.ru/'
#f = open('yp.html', 'r')
#p = f.read()
print clean_html(convert_1251(get_page(test_url)))
--------------
What am I doing wrong? Can anyone help, please?
More information about the Python-list
mailing list