[Tutor] How to write a loop in python to find HTML tags in a text file

S Monzur sb.monzur at gmail.com
Wed Mar 17 07:27:53 EDT 2021


Thank you for explaining the process. Might you advise me on how to use
beautiful soup on this text file to a) separate the metadata from the
bodytext and b) remove all the html tags from the body text, and print the
clean body text of three articles separately?

On Wed, Mar 17, 2021 at 4:35 PM Alan Gauld via Tutor <tutor at python.org>
wrote:

> On 17/03/2021 05:15, S Monzur wrote:
>
> > data from 3 news articles scraped from a news website. I would like to
> > write a loop that separates the metadata from the article body for each
> of
> > these three articles. The linked code <https://pastebin.com/FU2Axiuc
> >works
> > for a single news article only (i.e., if I keep only one article in the
> > text file). People have previously suggested using beautiful soup and
> > regular expressions, but please note that I just want to modify the
> > existing code to add a loop, and not use any other methods/functions.
>
> The correct answer is indeed to use an html parser like Beautiful Soup.
> Not regular expressions, they are unreliable with HTML data.
>
> Your code is short enough to post, no need for pastebin.
>
> > with open('threenewsarticles.txt', 'r', encoding='utf8') as my_file:
> >     rawData = my_file.read()
> >     print(rawData)
> >
> > #Separating body text from metadata. This code only works if the
> > textfile has one article.
> >
> > articleStart = rawData.find("<div class=\"story-element story-element-
> > text\">")
> > articlemetaData = rawData[:articleStart]
> > articleBody = rawData[articleStart:]
>
> This works for a single article (provided the div never crosses
> a line boundary which it is perfectly entitled to do).
> But you cannot find the closing <div> without a huge amount
> of effort since there could be other divs within the body.
>
> But, to find subsequent articles you need to restart your search
> after your article. So you need to find the end of the article. And that
> is going to involve going through counting div and /div pairs until they
> come to zero and you hit an unmatched /div, which will be your closing
> /div. You can then reset the start of file at that point and go back to
> the start. ie the beginning of your "loop".
>
> [ You could of course ignore the end of file data and just search
> for the start of the next file tag, I'd suggest <html>should be the
> starting point rather than <H1> since there's nothing to prevent
> the author using H1 anywhere in the body. That will be easier than
> identifying the closing </div>. But it will make the body html
> unbalanced and therefore harder to process in later steps
> (if there are any).]
>
> That's a lot of work to avoid using a parser which already does
> all that for you! And will extract all the articles (however
> many there are) in one pass. A parser will also cope with most
> tweaks or changes to the html that the publisher may make
> whereas not using a parser will require you to constantly
> tweak your code to match. It's a job for life and if you
> have the time to spare that's OK.. It's your time.
>
> A slightly easier approach if you have the option is to
> keep the articles in separate files. Looping over
> multiple files  using code that works for one file
> is much easier.
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.amazon.com/author/alan_gauld
> Follow my photo-blog on Flickr at:
> http://www.flickr.com/photos/alangauldphotos
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list