[Tutor] Attempt to download web page failed due to use of Frames

Tue, 30 Jan 2001 20:36:56 +0100

--huq684BweRXVnRxX
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Sez Mallett, Roger:
> Using Python, I attempted to download a web page that contains frames.  I
> received a message from the web site that my browser (which in this case,=
 is
> the Python script below)  doesn't support frames.  What can I do to get a=
ll
> of the information that would normally display in my I.E. browser when
> frames are involved?

You will have to fetch the page, parse it to get the URLs of the subpages
and fetch them.  There should be a few <frame> tags that contain the
addresses you are interested in.

> Script Used:
>=20
> >>> import urllib
> >>> f=3Durllib.urlopen('http://www.transitionstrading.com/Quotespage.htm')
> >>> x=3Df.read()

>>> print x
<html>

<head>
<title>Quotes</title>
<meta name=3D"GENERATOR" content=3D"Microsoft FrontPage 3.0">
</head>

<frameset framespacing=3D"0" border=3D"false" frameborder=3D"0" cols=3D"137=
,*">
  <frame name=3D"contents" target=3D"main" src=3D"quotes1.htm" scrolling=3D=
"auto">
  <frame name=3D"main" src=3D"http://www.barchart.com/pl/vsn/" scrolling=3D=
"auto">
[...]
These two lines are the ones you are interested in.

One possible regexp to match the src attribute is:

r =3D re.compile(r'<frame[^>]*src=3D"([^"]*)"[^>]*>')

Then, you can find all frames by looping:

start =3D 0
frames =3D {}
while 1:
    m =3D r.search(x)
    if not m:
        break
    frames[m.group(1)] =3D ""
    start =3D m.start(1)

Then, you'll have to get the frame contents.

for f in frames.keys():
    # this is extremely crude.
    if f[:4] =3D=3D "http":
        frames[f] =3D urllib.urlopen(f).read()
    else:
        frames[f] =3D urllib.urlopen("http://www.transitionstrading.com/" +
                                   f).read()=20

The whole thing, untested:

import urllib, re

f =3D urllib.urlopen('http://www.transitionstrading.com/Quotespage.htm')
x =3D f.read()
f.close() # yeah, why not... <wink>
r =3D re.compile(r'<frame[^>]*src=3D"([^"]*)"[^>]*>')

start =3D 0
frames =3D {}

while 1:
    m =3D r.search(x)
    if not m:
        break
    frames[m.group(1)] =3D ""
    start =3D m.start(1)

for f in frames.keys():
    # this is extremely crude.
    if f[:4] =3D=3D "http":
        frames[f] =3D urllib.urlopen(f).read()
    else:
        frames[f] =3D urllib.urlopen("http://www.transitionstrading.com/" +
                                   f).read()

There are other ways, this is just a beginning.

HTH,
  Kalle
--=20
Email: kalle@gnupung.net     | You can tune a filesystem, but you
Web: http://www.gnupung.net/ | can't tune a fish. -- man tunefs(8)
PGP fingerprint: 0C56 B171 8159 327F 1824 F5DE 74D7 80D7 BF3B B1DD

--huq684BweRXVnRxX
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6dxfYdNeA1787sd0RArVmAJ0QkTQwTv0Osgs39dIKnKD8dODVxACgile4
IgNXUCFV14COBqQCUrzKjjg=
=BNYz
-----END PGP SIGNATURE-----

--huq684BweRXVnRxX--