[Tutor] downloader-script (fwd)

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Fri, 27 Sep 2002 13:18:33 -0700 (PDT)


---------- Forwarded message ----------
Date: Fri, 27 Sep 2002 21:03:18 +0200
From: Magnus Lycka <magnus@thinkware.se>
To: Danny Yoo <dyoo@hkn.eecs.berkeley.edu>
Subject: Re: [Tutor] downloader-script

At 09:28 2002-09-27 -0700, you wrote:
>Wow, there's a lots of whitespace there.  So we may need to do additional
>processing like whitespace strip()ping.

If we accept to get all text just separated by single spaces,
that's as simple as

text =3D " ".join(text.split())

=2Esplit() without parameters will split on any sequnece of
whitespace.

In case we want to preserve some breaks, I think we
have to keep <p> and <br> tags in the text.

Another option would be

text =3D os.popen('lynx -dump %s' % url).read()

but that assumes that we have lynx installed (not a bad
idea anyway) and you will have to filter out links given
as footnotes unless you think they are ok.


--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se