What is the best way to "get" a web page?

Sun Sep 24 13:09:16 EDT 2006

Pete wrote:
> > > The file "temp.html" is definitely different than the first run, but
> > > still not anything close to www.python.org . Any other suggestions?
> >
> > If you mean that the page looks different in a browser, for one thing
> > you have to download the css files too. Here's the relevant extract
> > from the main page:
> >
> > <link media="screen" href="styles/screen-switcher-default.css"
> > type="text/css" id="screen-switcher-stylesheet" rel="stylesheet" />
> > <link media="scReen" href="styles/netscape4.css" type="text/css"
> > rel="stylesheet" />
> > <link media="print" href="styles/print.css" type="text/css"
> > rel="stylesheet" />
> > <link media="screen" href="styles/largestyles.css" type="text/css"
> > rel="alternate stylesheet" title="large text" />
> > <link media="screen" href="styles/defaultfonts.css" type="text/css"
> > rel="alternate stylesheet" title="default fonts" />
> >
> > You may either hardcode the urls of the css files, or parse the page,
> > extract the css links and normalize them to absolute urls. The first is
> > simpler but the second is more robust, in case a new css is added or an
> > existing one is renamed or removed.
> >
> > George
>
> Thanks for the information on CSS. I'll look into that later, but now
> my question is on the first two lines of HTML code. Here's my latest
> python code:
>
> >>> import urllib
> >>> web_page = urllib.urlopen("http://www.python.org")
> >>> fileTemp = open("temp.html", "w")
> >>> web_page_contents = web_page.read()
> >>> fileTemp.write(web_page_contents)
> >>> fileTemp.close()
>
> Here are the first two lines of temp.html:
>
>       1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/x        html1/DTD/xhtml1-transitional.dtd">
>       2 <html lang="en" xml:lang="en"
> xmlns="http://www.w3.org/1999/xhtml">
>
> Here are the first two lines of www.python.org as saved from Firefox:
>
>       1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/x        html1/DTD/xhtml1-transitional.dtd">
>       2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
> lang="en"><head>
>
> Lines one are identical. Lines two are different. Why would lines two
> differ? Hmmmm...

Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python. It looks
like Firefox will do some cosmetic changes to source but nothing that
would change the way code works. Notice that attributes in second line
are re-arranged in order only?

> 
> Thanks,
> Pete