Urllib's urlopen and urlretrieve

Dave Angel davea at davea.name
Thu Feb 21 13:53:25 EST 2013


On 02/21/2013 07:12 AM, qoresucks at gmail.com wrote:
>
>  <snip>
> Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?
>

Start by telling us if you're using Python2 or Python3, as this library 
is different for different versions.  Also what OS, as there are lots of 
useful utilities in Unix, and a different set in Windows or other 
places.  Even if the same program exists on both, it's likely to be 
named differently.

My earlier reply assumed you were trying to get an accurate copy of your 
website, presumably because your own local copy had gotten out of synch. 
  rh assumed differently, so I'll try again.  If you're trying to 
download someone else's, you should realize that you may be violating 
copyright, and ought to get permission.  It's one thing to extract a 
file or two, but another entirely to try to capture the entire site. 
And many sites consider all of the details proprietary.  Others consider 
the images proprietary, and enforce the individual copyrights.

You can indeed copy individual files with urlib or urlib2, but that's 
just the start of the problem.  A typical web page is written in html 
(or xhtml, or ...), and displaying it is the job of a browser, not the 
cat command.  In addition, the page will generally refer to lots of 
other files, with the most common being a css file and a few jpegs.  So 
you have to parse the page to find all those dependencies, and copy them 
as well.

Next, the page may contain code (eg. php, javascript), or it may be code 
(eg. Python or perl).  In each of those cases, what you'll get isn't 
exactly what you'd expect.  If you try to fetch a python program, 
generally what happens is it gets run, and you fetch its stdout instead. 
  On the other hand javascript gets executed by the browser, and I don't 
know where php gets executed, or by whom.  Finally, the page may make 
use of resources which simply won't be visible to you without becoming a 
hacker.  Like my rsync and scp examples, you'll probably need a userid 
and password to get into the guts.

If you want to play with some of this without programming, you could go 
to your favorite browser, and View->Source.  The method of doing that 
varies with browser brand, version & OS, but it should be there on some 
menu someplace.  In Chrome, it's Tools->ViewSource.

Examples below extracted from the main page at python.org

   <title>Python Programming Language – Official Website</title>

That simply sets the title for the page.  It is not even part of the 
body, it's part of the header for the page.  In this case, the header 
continues for 77 pages, including meta tags, javascript stuff, css 
stuff, etc.

You might observe that angle brackets are used to enclose explicit kinds 
of data.  In the above example, it's a "title" element.  And it's 
enclosed with <title>  and </title>

In xhtml, these will always come in pairs, like curly braces in C 
programming.  However, most web pages are busted, so parsing it is 
sometimes troublesome.  Most people seem to recommand Beautiful Soup, in 
part because it tolerates many kinds of errors.

I'd get a good book on html programming, making sure it covers xhtml and 
css.  But I don't know what to recommend, as everything in my arsenal is 
thoroughly dated.

Much of the body is devoted to the complexity of setting up the page in 
a browser of variable size, varying fonts, user-overrides, etc.  The 
following exerpt:

 >    <div style="align:center; padding-top: 0.5em; padding-left: 1em">
 >       <a href="/psf/donations/"><img width="116" height="42"
 > src="/images/donate.png" alt="" title="" /></a>
 >     </div>

The whole thing is a "div" or division.  It's a individual chunk of the 
page that might be placed almost anywhere within a bigger div or the 
page itself.  It has a style attribute, which gives hints to the browser 
about what it wants.  More commonly, the style will be indirected 
through a separate css page.

It has an "a" tag, which shows a link.  The link may be underlined, but 
the css or the browser may override that.  The url for the link is 
specified in the 'src' attribute, the tooltip is specified in the alt 
attribute.  This is enclosing an 'img' tag, which describes a png image 
file to be displayed, and specifies the scaling for it.

 >   <h4><a href="/about/help/">Help</a></h4>

The h4 tag refers to css which specifies various things about how 
this'll display.  It's usually used for making larger and smaller 
versions of text for titles and such.

 >   <link rel="stylesheet" type="text/css" media="screen"
 >    id="screen-switcher-stylesheet"
 >         href="/styles/screen-switcher-default.css" />

This points to a css file, which refers to another one, called 
styles.css.  That's where you can see the definition for a style of h4


 >  H1,H2,H3,H4,H5 {
 >    font-family: Georgia, "Bitstream Vera Serif",
 >   "New York", Palatino, serif;
 >    font-weight:normal;
 >    line-height: 1em;
 >    }

This defines the common attributes for all the Hn series.  Then they are 
refined and overridden by:


 > H4
 >  {
 >   font-size: 125%;
 >   color: #366D9C;
 >   margin: 0.4em 0 0.0em 0;
 >  }

So we see that H4 is 25% bigger than default.  Similarly H3 is 35%, and 
H2 is 40% bigger.

It's a very complicated topic, and I wish you luck on it.  But it's not 
clear that the first step should involve any Python programming.  I got 
all the above just with Chrome in its default setup.  I haven't even 
mentioned things like the Tools->DeveloperTools, or other stuff you 
could get via plugins.

If you're copying these files with a view of being able to run them 
locally, realize that for most websites, you need lots of installed 
software to support being a webserver.  If you're writing your own, you 
can start simple, and maybe never need any of the extra tools.  For 
example, on my own website, I only needed static pages.  So the python 
code I used was to generate the web pages, which are then uploaded as is 
to the site.  They can be tested locally by simply making up a url which 
starts

file://

instead of

http://

But as soon as I want database features, or counters, or user accounts, 
or data entry, or randomness, I might add code that runs on the server, 
and that's a lot trickier.  Probably someone who has done it can tell us 
I'm all wet, though.

-- 
DaveA



More information about the Python-list mailing list