[Tutor] Extract main text from HTML document

Simon Connah scopensource at gmail.com
Mon May 7 07:05:15 EDT 2018


That looks like a useful combination. Thanks.

On 6 May 2018 at 17:32, Mark Lawrence <breamoreboy at gmail.com> wrote:
> On 05/05/18 18:59, Simon Connah wrote:
>>
>> Hi,
>>
>> I'm writing a very simple web scraper. It'll download a page from a
>> website and then store the result in a database of some sort. The
>> problem is that this will obviously include a whole heap of HTML,
>> JavaScript and maybe even some CSS. None of which is useful to me.
>>
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> HTML.
>>
>> The title is obviously easy but the main body of text could contain
>> all sorts of HTML and I'm interested to know how I might go about
>> removing the bits that are not needed but still keep the meaning of
>> the document intact.
>>
>> Does anyone have any suggestions on this front at all?
>>
>> Thanks for any help.
>>
>> Simon.
>
>
> A combination of requests http://docs.python-requests.org/en/master/ and
> beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should
> fit the bill.  Both are installable with pip and are regarded as best of
> breed.
>
> --
> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.
>
> Mark Lawrence
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor


More information about the Tutor mailing list