tidy to convert google scholar page in xml

Mon Oct 8 08:11:02 EDT 2012

On 10/08/2012 07:11 AM, রুদ্র ব্যাণার্জী wrote:
> Dear friends,
> I am trying to convert a google scholar page to xml.
> First, I am getting the mapge using the script:
> #!/usr/bin/python
> from HTMLParser import HTMLParser
> import urllib2
> response =
> urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
> f=open('sch.html','w')
> f.write(response.read())
>
> Which is giving sch.html starting as:
> <!doctype html><html><head><meta http-equiv="Content-Type"
> content="text/html;charset=UTF-8"><meta http-equiv="X-UA-Compatible"
> content="IE=Edge"><meta name="viewport"
> content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no">	
>
> if I try tidy to convert this html page to xml, I get:
> $ tidy <sch.html |more
> line 3 column 40 - Warning: <style> isn't allowed in <div> elements
> line 3 column 23 - Info: <div> previously mentioned
> /**************************
> AND MANY MORE WARNNING
> **************************/
> Info: Document content looks like HTML 4.01 Transitional
> Info: No system identifier in emitted doctype
> 131 warnings, 0 errors were found!
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <meta name="generator" content=
> "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
> <meta http-equiv="Content-Type" content=
> "text/html; charset=us-ascii">
> <meta http-equiv="X-UA-Compatible" content="IE=Edge">
> <meta name="viewport" content=
> "width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2">
> <meta name="format-detection" content="telephone=no">
> <title>albert einstein+1905 - Google Scholar</title>
>
> <script type="text/javascript">
> var gs_ts=Number(new Date());
> </script>
> <style type="text/css">
> html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:
> 0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_
> top{position:relative;min-width:980px;_width:expression(document.documentElement
> .clientWidth<982?"980px":"auto");}.gs_el_ph #gs_top,.gs_el_ta
> #gs_top{min-width:
> 300px;_width:expression(document.documentElement.clientWidth<302?"300px":"auto")
> ;}body,td{font-size:13px;font-family:Arial,sans-serif;line-height:1.24}body{back
>
>
> So, this is still in html, not in xml. How can I convert the page to
> xml?
>

What makes you think it's possible?  (Possible automatically, that is)  
There is no mapping from html to xml, so a program that tries this is
just guessing in many places.  Further, many, if not most, web pages are
not even valid html, just good enough to work with most browsers.  Now,
if the page was in valid xhtml, then it would already be valid xml.

Do you have a license from google?  If not, better read their terms of
service.  While they probably won't pursue the occasional page scraping,
you should consider the costs before spending too much effort.  Besides,
they have APIs for most of their services, and there might be one
that'll be much easier to use than trying to scrape the html.

Do you have a plan for what to do when the page layout changes?

You should look into Beautiful Soup;  it's designed for parsing sloppily
written html.  I've no direct experience with it, but it gets
recommended a lot.

-- 

DaveA