How to loop over a text file (to remove tags and normalize) using Python

Dan Stromberg drsalists at gmail.com
Wed Mar 10 09:11:43 EST 2021


If you want text without tags, sometimes it's easier to use a text-based
web browser, EG:

#!/bin/sh

# for mutt to view html e-mails

#where html2txt is a shell script that performs the conversion, e.g. by
#calling

links -html-numbered-links 1 -html-images 1 -dump "file://$@"

#or
#
#lynx -force_html -dump "$@"
#
#or
#
#w3m -T text/html -F -dump "$@"


On Tue, Mar 9, 2021 at 1:26 PM S Monzur <sb.monzur at gmail.com> wrote:

> Dear List,
>
> Newbie here. I am trying to loop over a text file to remove html tags,
> punctuation marks, stopwords. I have already used Beautiful Soup (Python v
> 3.8.3) to scrape the text (newspaper articles) from the site. It returns a
> list that I saved as a file. However, I am not sure how to use a loop in
> order to process all the items in the text file.
>
> In the code below I have used listfilereduced.text(containing data from one
> news article, link to listfilereduced.txt here
> <
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >),
> however I would like to run this code on listfile.text(containing data from
> multiple articles, link to listfile.text
> <
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
> ).
>
>
> Any help would be greatly appreciated!
>
> P.S. The text is in a Non-English script, but the tags are all in English.
>
>
> #The code below is for a textfile containing just one item. I am not sure
> how to tweak this to make it run for listfile.text (which contains raw data
> from multiple articles) with open('listfilereduced.txt', 'r',
> encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
> #Separating body text from other data articleStart = rawData.find("<div
> class=\"story-element story-element-text\">") articleData =
> rawData[:articleStart] articleBody = rawData[articleStart:]
> print(articleData) print("*******") print(articleBody) print("*******")
> #First, I define a function to strip tags from the body text def
> stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
> if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
> insideTag = 0 elif insideTag == 1: continue else: text += char return text
> #Calling the function articleBodyText = stripTags(articleBody)
> print(articleBodyText) ##Isolating article title and publication date
> TitleEndLoc = articleData.find("</h1>") dateStartLoc =
> articleData.find("<div
> class=\"storyPageMetaData-m__publish-time__19bdV\">")
> dateEndLoc=articleData.find("<div class=\"meta-data-icons
> storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
> articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc]
> ##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
> = stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
> the date a bit more startLocDate = articleDate.find(":") endLocDate =
> articleDate.find(",") articleDateClean =
> articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
> this data to a dictionary that saves the title, data and the body text
> PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
> articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
> paragraphs of text into lists of words articleBodyWordList =
> articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
> and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
> punctuation first listNoPunct = [] for word in articleBodyWordList: for
> mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word)
> print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
> print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
> banglastopwords: continue else: cleanList.append(word) print(cleanList)
> --
> https://mail.python.org/mailman/listinfo/python-list
>


More information about the Python-list mailing list