How to loop over a text file (to remove tags and normalize) using Python

Dan Ciprus (dciprus) dciprus at cisco.com
Tue Mar 9 17:32:08 EST 2021


No problem, list just converts everything into plain/txt which is GREAT ! :-)

So without digging deeply into what you need to do: I am assuming that your 
input contains html tags. Why don't you utilize lib like: 
https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with parsing 
data without using regex ? Just a hint ..

On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:
>   Thank you and apologies! I did not realize how jumbled it was at the
>   receiver's end. 
>   The code is now at this site :  [1]https://pastebin.com/wSi2xzBh 
>   I'm basically trying to do a few things with my code-
>
>    1. Extract 3 strings from the text- title, date and main text
>
>    2. Remove all tags afterwards
>
>    3. Save in a dictionary, with three keys- title, date and bodytext.
>
>    4. Remove punctuation and stopwords (I've used a user generated function
>       for that).
>
>   I've been able to do all of these steps for the file [2]ListFileReduced,
>   as shown in the code (although it's clunky).
>
>   But, I would like to be able to do it for the other text file: [3]ListFile
>   which has more articles. I used BeautifulSoup to scrape the data from the
>   website, and then generated a list that I saved as a text file. 
>
>   Best,
>   Monzur
>   On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus)
>   <[4]dciprus at cisco.com> wrote:
>
>     If you could utilized pastebin or similar site to show your code, it
>     would help
>     tremendously since it's an unindented mess now and can not be read
>     easily.
>
>     On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
>     >Dear List,
>     >
>     >Newbie here. I am trying to loop over a text file to remove html tags,
>     >punctuation marks, stopwords. I have already used Beautiful Soup
>     (Python v
>     >3.8.3) to scrape the text (newspaper articles) from the site. It
>     returns a
>     >list that I saved as a file. However, I am not sure how to use a loop
>     in
>     >order to process all the items in the text file.
>     >
>     >In the code below I have used listfilereduced.text(containing data from
>     one
>     >news article, link to listfilereduced.txt here
>     ><[5]https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>),
>     >however I would like to run this code on listfile.text(containing data
>     from
>     >multiple articles, link to listfile.text
>     ><[6]https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
>     >).
>     >
>     >
>     >Any help would be greatly appreciated!
>     >
>     >P.S. The text is in a Non-English script, but the tags are all in
>     English.
>     >
>     >
>     >#The code below is for a textfile containing just one item. I am not
>     sure
>     >how to tweak this to make it run for listfile.text (which contains raw
>     data
>     >from multiple articles) with open('listfilereduced.txt', 'r',
>     >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
>     >#Separating body text from other data articleStart = rawData.find("<div
>     >class=\"story-element story-element-text\">") articleData =
>     >rawData[:articleStart] articleBody = rawData[articleStart:]
>     >print(articleData) print("*******") print(articleBody) print("*******")
>     >#First, I define a function to strip tags from the body text def
>     >stripTags(pageContents): insideTag = 0 text = '' for char in
>     pageContents:
>     >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
>     >insideTag = 0 elif insideTag == 1: continue else: text += char return
>     text
>     >#Calling the function articleBodyText = stripTags(articleBody)
>     >print(articleBodyText) ##Isolating article title and publication date
>     >TitleEndLoc = articleData.find("</h1>") dateStartLoc =
>     >articleData.find("<div
>     >class=\"storyPageMetaData-m__publish-time__19bdV\">")
>     >dateEndLoc=articleData.find("<div class=\"meta-data-icons
>     >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
>     >articleData[:TitleEndLoc] dateString =
>     articleData[dateStartLoc:dateEndLoc]
>     >##Call stripTags to clean articleTitle= stripTags(titleString)
>     articleDate
>     >= stripTags(dateString) print(articleTitle) print(articleDate)
>     #Cleaning
>     >the date a bit more startLocDate = articleDate.find(":") endLocDate =
>     >articleDate.find(",") articleDateClean =
>     >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save
>     all
>     >this data to a dictionary that saves the title, data and the body text
>     >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean,
>     "Text":
>     >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
>     >paragraphs of text into lists of words articleBodyWordList =
>     >articleBodyText.split() print(articleBodyWordList) #2.Removing
>     punctuation
>     >and stopwords from bnlp.corpus import stopwords, punctuations #A.
>     Remove
>     >punctuation first listNoPunct = [] for word in articleBodyWordList: for
>     >mark in punctuations: word=word.replace(mark, '')
>     listNoPunct.append(word)
>     >print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
>     >print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
>     >banglastopwords: continue else: cleanList.append(word) print(cleanList)
>     >--
>     >[7]https://mail.python.org/mailman/listinfo/python-list
>
>     --
>
>     Daniel Ciprus                              .:|:.:|:.
>     CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.
>
>     [8]dciprus at cisco.com
>
>     tel: +1 703 484 0205
>     mob: +1 540 223 7098
>
>References
>
>   Visible links
>   1. https://pastebin.com/wSi2xzBh
>   2. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
>   3. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
>   4. mailto:dciprus at cisco.com
>   5. https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
>   6. https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
>   7. https://mail.python.org/mailman/listinfo/python-list
>   8. mailto:dciprus at cisco.com

-- 

Daniel Ciprus                              .:|:.:|:.
CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.

dciprus at cisco.com

tel: +1 703 484 0205
mob: +1 540 223 7098

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20210309/501bcb8b/attachment.sig>


More information about the Python-list mailing list