Using utidylib, empty string returned in some cases

Boris savinovboris at gmail.com
Tue Jan 22 12:35:16 EST 2008


Hello

I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.

Here is two pages I use to check my program:
http://www.ya.ru/ (in this case everything works ok)
http://www.yellow-pages.ru/rus/nd2/qu5/ru15632 (in this case tidy did
not return me anything just empty string)


code:

--------------

# coding: utf-8
import urllib, urllib2, tidy

def get_page(url):
  user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
  headers = { 'User-Agent' : user_agent }
  data= {}

  req = urllib2.Request(url, data, headers)
  responce = urllib2.urlopen(req)
  page = responce.read()

  return page

def convert_1251(page):
  p = page.decode('windows-1251')
  u = p.encode('utf-8')
  return u

def clean_html(page):
  tidy_options = { 'output_xhtml' : 1,
                   'add_xml_decl' : 1,
                   'indent' : 1,
                   'input-encoding' : 'utf8',
                   'output-encoding' : 'utf8',
                   'tidy_mark' : 1,
                 }
  cleaned_page = tidy.parseString(page, **tidy_options)
  return cleaned_page

test_url = 'http://www.yellow-pages.ru/rus/nd2/qu5/ru15632'
#test_url = 'http://www.ya.ru/'

#f = open('yp.html', 'r')
#p = f.read()

print clean_html(convert_1251(get_page(test_url)))

--------------

What am I doing wrong? Can anyone help, please?



More information about the Python-list mailing list