tidy to convert google scholar page in xml

রুদ্র ব্যাণার্জী bnrj.rudra at gmail.com
Mon Oct 8 07:11:00 EDT 2012


Dear friends,
I am trying to convert a google scholar page to xml.
First, I am getting the mapge using the script:
#!/usr/bin/python
from HTMLParser import HTMLParser
import urllib2
response =
urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
f=open('sch.html','w')
f.write(response.read())

Which is giving sch.html starting as:
<!doctype html><html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
content="IE=Edge"><meta name="viewport"
content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">	

if I try tidy to convert this html page to xml, I get:
$ tidy <sch.html |more
line 3 column 40 - Warning: <style> isn't allowed in <div> elements
line 3 column 23 - Info: <div> previously mentioned
/**************************
AND MANY MORE WARNNING
**************************/
Info: Document content looks like HTML 4.01 Transitional
Info: No system identifier in emitted doctype
131 warnings, 0 errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content=
"width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
<meta name="format-detection" content="telephone=no">
<title>albert einstein+1905 - Google Scholar</title>

<script type="text/javascript">
var gs_ts=Number(new Date());
</script>
<style type="text/css">
html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
top{position:relative;min-width:980px;_width:expression(document.documentElement
.clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
#gs_top{min-width:
300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back


So, this is still in html, not in xml. How can I convert the page to
xml?




More information about the Python-list mailing list