How to loop over a text file (to remove tags and normalize) using Python

S Monzur sb.monzur at gmail.com
Wed Mar 10 07:19:38 EST 2021


I initially scraped the links using beautiful soup, and from those links
downloaded the specific content of the articles I was interested in
(titles, dates, names of contributor, main texts) and stored that
information in a list. I then saved the list to a text file.
https://pastebin.com/8BMi9qjW . I am now trying to remove the html tags
from this text file, and running into issues as mentioned in the previous
post.



On Wed, Mar 10, 2021 at 3:46 PM Peter Otten <__peter__ at web.de> wrote:

> On 10/03/2021 04:35, S Monzur wrote:
> > Thanks! I ended up using beautiful soup to remove the html tags and
> create
> > three lists (titles of article, publications dates, main body) but am
> still
> > facing a problem where the list is not properly storing the main body.
> > There is something wrong with my code for that section, and any comment
> > would be really helpful!
> >
> >   ListFile Text
> > <
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
>
> How did you create that file?
>
>  > BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD>
>
> > print(bodytext[0]) # so here, I'm only getting the first paragraph of
> the body of the first article, not all of the first article
> >
> > print(bodytext[1]) # here, I'm getting the second paragraph of the first
> article, and not the second article
>
> It may help if you process the individual articles with beautiful soup,
> not the whole list at once.
>


More information about the Python-list mailing list