[Tutor] Extract main text from HTML document

Brian Lockwood t100ss at gmail.com
Sat May 5 17:31:33 EDT 2018


Two things. The first thing is that you can download the page as a string
and delete a everything between tags. Secondly It might be worth looking at
Udacity cs101 as this course is all about a search engine.
On Sat, 5 May 2018 at 22:27, Simon Connah <scopensource at gmail.com> wrote:

> Hi,
>
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> JavaScript and maybe even some CSS. None of which is useful to me.
>
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.
>
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
>
> Does anyone have any suggestions on this front at all?
>
> Thanks for any help.
>
> Simon.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list