How to loop over a text file (to remove tags and normalize) using Python

Peter Otten __peter__ at web.de
Wed Mar 10 07:50:17 EST 2021


On 10/03/2021 13:19, S Monzur wrote:
> I initially scraped the links using beautiful soup, and from those links
> downloaded the specific content of the articles I was interested in
> (titles, dates, names of contributor, main texts) and stored that
> information in a list. I then saved the list to a text file.
> https://pastebin.com/8BMi9qjW . I am now trying to remove the html tags
> from this text file, and running into issues as mentioned in the previous
> post.

As I said in my previous post, when you process the list entries 
separately you will probably avoid the problem.

Unfortunately with the format you chose to store your intermediate data 
you cannot reconstruct it reliably.

I recommend that you either

(1) avoid the text file and extract the interesting parts from PASoup 
directly or

(2) pick a different file format to store the result sets. For 
short-term storage pickle
<https://docs.python.org/3/library/pickle.html#examples> should work.




More information about the Python-list mailing list