[Tutor] How to write a loop in python to find HTML tags in a text file

Alan Gauld alan.gauld at yahoo.co.uk
Wed Mar 17 06:35:02 EDT 2021


On 17/03/2021 05:15, S Monzur wrote:

> data from 3 news articles scraped from a news website. I would like to
> write a loop that separates the metadata from the article body for each of
> these three articles. The linked code <https://pastebin.com/FU2Axiuc>works
> for a single news article only (i.e., if I keep only one article in the
> text file). People have previously suggested using beautiful soup and
> regular expressions, but please note that I just want to modify the
> existing code to add a loop, and not use any other methods/functions.

The correct answer is indeed to use an html parser like Beautiful Soup.
Not regular expressions, they are unreliable with HTML data.

Your code is short enough to post, no need for pastebin.

> with open('threenewsarticles.txt', 'r', encoding='utf8') as my_file:
>     rawData = my_file.read()
>     print(rawData)
>
> #Separating body text from metadata. This code only works if the
> textfile has one article.
>
> articleStart = rawData.find("<div class=\"story-element story-element-
> text\">")
> articlemetaData = rawData[:articleStart]
> articleBody = rawData[articleStart:]

This works for a single article (provided the div never crosses
a line boundary which it is perfectly entitled to do).
But you cannot find the closing <div> without a huge amount
of effort since there could be other divs within the body.

But, to find subsequent articles you need to restart your search
after your article. So you need to find the end of the article. And that
is going to involve going through counting div and /div pairs until they
come to zero and you hit an unmatched /div, which will be your closing
/div. You can then reset the start of file at that point and go back to
the start. ie the beginning of your "loop".

[ You could of course ignore the end of file data and just search
for the start of the next file tag, I'd suggest <html>should be the
starting point rather than <H1> since there's nothing to prevent
the author using H1 anywhere in the body. That will be easier than
identifying the closing </div>. But it will make the body html
unbalanced and therefore harder to process in later steps
(if there are any).]

That's a lot of work to avoid using a parser which already does
all that for you! And will extract all the articles (however
many there are) in one pass. A parser will also cope with most
tweaks or changes to the html that the publisher may make
whereas not using a parser will require you to constantly
tweak your code to match. It's a job for life and if you
have the time to spare that's OK.. It's your time.

A slightly easier approach if you have the option is to
keep the articles in separate files. Looping over
multiple files  using code that works for one file
is much easier.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list