How to loop over a text file (to remove tags and normalize) using Python

S Monzur sb.monzur at gmail.com
Tue Mar 9 16:07:14 EST 2021


Dear List,

Newbie here. I am trying to loop over a text file to remove html tags,
punctuation marks, stopwords. I have already used Beautiful Soup (Python v
3.8.3) to scrape the text (newspaper articles) from the site. It returns a
list that I saved as a file. However, I am not sure how to use a loop in
order to process all the items in the text file.

In the code below I have used listfilereduced.text(containing data from one
news article, link to listfilereduced.txt here
<https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
however I would like to run this code on listfile.text(containing data from
multiple articles, link to listfile.text
<https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
).


Any help would be greatly appreciated!

P.S. The text is in a Non-English script, but the tags are all in English.


#The code below is for a textfile containing just one item. I am not sure
how to tweak this to make it run for listfile.text (which contains raw data
from multiple articles) with open('listfilereduced.txt', 'r',
encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
#Separating body text from other data articleStart = rawData.find("<div
class=\"story-element story-element-text\">") articleData =
rawData[:articleStart] articleBody = rawData[articleStart:]
print(articleData) print("*******") print(articleBody) print("*******")
#First, I define a function to strip tags from the body text def
stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
insideTag = 0 elif insideTag == 1: continue else: text += char return text
#Calling the function articleBodyText = stripTags(articleBody)
print(articleBodyText) ##Isolating article title and publication date
TitleEndLoc = articleData.find("</h1>") dateStartLoc =
articleData.find("<div
class=\"storyPageMetaData-m__publish-time__19bdV\">")
dateEndLoc=articleData.find("<div class=\"meta-data-icons
storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc]
##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
the date a bit more startLocDate = articleDate.find(":") endLocDate =
articleDate.find(",") articleDateClean =
articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
this data to a dictionary that saves the title, data and the body text
PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
paragraphs of text into lists of words articleBodyWordList =
articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
punctuation first listNoPunct = [] for word in articleBodyWordList: for
mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word)
print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
banglastopwords: continue else: cleanList.append(word) print(cleanList)


More information about the Python-list mailing list