[Tutor] How to write a loop in python to find HTML tags in a text file

Alan Gauld alan.gauld at yahoo.co.uk
Wed Mar 17 14:19:43 EDT 2021


On 17/03/2021 11:27, S Monzur wrote:
> Thank you for explaining the process. Might you advise me on how to use
> beautiful soup on this text file to a) separate the metadata from the
> bodytext and b) remove all the html tags

I don't have BS installed at present. Maybe someone who
does can contribute a solution?

I did try with the standard python HTML parser but it seems your file
only has part of the HTML rather than the comlete html text. That
confuses the parser so it won't work (BS may work since it is much
more forgiving than the standard one).

If you have access to the original html and can save each message
as a separate file that will make everyone's life much easier.
If you don;t then you may have to stick to your original  strategy
and live with the pain.

FWIW Here is the standard library version for stripping the
messages. Hopefully the basic technique is obvious.:

import html.parser

class MessageParser(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.inMessage = False

    def handle_starttag(self, name, atts):
        if name == "div":
            for key,val in atts:
                if key == 'class' and
                   val == "story-element story-element-text":
                     self.inMessage = True

    def handle_endtag(self, name):
        self.inMessage = False

    def handle_data(self,data):
        if self.inMessage:
            print('\n----------------\n', data)

if __name__ == "__main__":
    with open('articles.txt') as htm:
        parser = MessageParser()
        parser.feed(htm)

For stripping tags you can use an external program for that - html2txt
It's available for Linux and MacOS, I don't know about Windows.
But that would be the simplest option if you have it available.

Otherwise you can write code like the above that detects
all "text like" tags (Hn, P, etc.) and only prints the body.
You will need to decide what to do with lists and tables etc.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list