[Tutor] Attempt to download web page failed due to use of Frames
Kalle Svensson
kalle@gnupung.net
Tue, 30 Jan 2001 20:36:56 +0100
--huq684BweRXVnRxX
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Sez Mallett, Roger:
> Using Python, I attempted to download a web page that contains frames. I
> received a message from the web site that my browser (which in this case,=
is
> the Python script below) doesn't support frames. What can I do to get a=
ll
> of the information that would normally display in my I.E. browser when
> frames are involved?
You will have to fetch the page, parse it to get the URLs of the subpages
and fetch them. There should be a few <frame> tags that contain the
addresses you are interested in.
> Script Used:
>=20
> >>> import urllib
> >>> f=3Durllib.urlopen('http://www.transitionstrading.com/Quotespage.htm')
> >>> x=3Df.read()
>>> print x
<html>
<head>
<title>Quotes</title>
<meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 3.0">
</head>
<frameset framespacing=3D"0" border=3D"false" frameborder=3D"0" cols=3D"137=
,*">
<frame name=3D"contents" target=3D"main" src=3D"quotes1.htm" scrolling=3D=
"auto">
<frame name=3D"main" src=3D"http://www.barchart.com/pl/vsn/" scrolling=3D=
"auto">
[...]
These two lines are the ones you are interested in.
One possible regexp to match the src attribute is:
r =3D re.compile(r'<frame[^>]*src=3D"([^"]*)"[^>]*>')
Then, you can find all frames by looping:
start =3D 0
frames =3D {}
while 1:
m =3D r.search(x)
if not m:
break
frames[m.group(1)] =3D ""
start =3D m.start(1)
Then, you'll have to get the frame contents.
for f in frames.keys():
# this is extremely crude.
if f[:4] =3D=3D "http":
frames[f] =3D urllib.urlopen(f).read()
else:
frames[f] =3D urllib.urlopen("http://www.transitionstrading.com/" +
f).read()=20
The whole thing, untested:
import urllib, re
f =3D urllib.urlopen('http://www.transitionstrading.com/Quotespage.htm')
x =3D f.read()
f.close() # yeah, why not... <wink>
r =3D re.compile(r'<frame[^>]*src=3D"([^"]*)"[^>]*>')
start =3D 0
frames =3D {}
while 1:
m =3D r.search(x)
if not m:
break
frames[m.group(1)] =3D ""
start =3D m.start(1)
for f in frames.keys():
# this is extremely crude.
if f[:4] =3D=3D "http":
frames[f] =3D urllib.urlopen(f).read()
else:
frames[f] =3D urllib.urlopen("http://www.transitionstrading.com/" +
f).read()
There are other ways, this is just a beginning.
HTH,
Kalle
--=20
Email: kalle@gnupung.net | You can tune a filesystem, but you
Web: http://www.gnupung.net/ | can't tune a fish. -- man tunefs(8)
PGP fingerprint: 0C56 B171 8159 327F 1824 F5DE 74D7 80D7 BF3B B1DD
--huq684BweRXVnRxX
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE6dxfYdNeA1787sd0RArVmAJ0QkTQwTv0Osgs39dIKnKD8dODVxACgile4
IgNXUCFV14COBqQCUrzKjjg=
=BNYz
-----END PGP SIGNATURE-----
--huq684BweRXVnRxX--