How to loop over a text file (to remove tags and normalize) using Python

Tue Mar 9 22:35:06 EST 2021

Thanks! I ended up using beautiful soup to remove the html tags and create
three lists (titles of article, publications dates, main body) but am still
facing a problem where the list is not properly storing the main body.
There is something wrong with my code for that section, and any comment
would be really helpful!

 ListFile Text
<https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD>

On Wed, Mar 10, 2021 at 4:32 AM Dan Ciprus (dciprus) <dciprus at cisco.com>
wrote:

> No problem, list just converts everything into plain/txt which is GREAT !
> :-)
>
> So without digging deeply into what you need to do: I am assuming that
> your
> input contains html tags. Why don't you utilize lib like:
> https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with
> parsing
> data without using regex ? Just a hint ..
>
> On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:
> >   Thank you and apologies! I did not realize how jumbled it was at the
> >   receiver's end.
> >   The code is now at this site :  [1]https://pastebin.com/wSi2xzBh
> >   I'm basically trying to do a few things with my code-
> >
> >    1. Extract 3 strings from the text- title, date and main text
> >
> >    2. Remove all tags afterwards
> >
> >    3. Save in a dictionary, with three keys- title, date and bodytext.
> >
> >    4. Remove punctuation and stopwords (I've used a user generated
> function
> >       for that).
> >
> >   I've been able to do all of these steps for the file
> [2]ListFileReduced,
> >   as shown in the code (although it's clunky).
> >
> >   But, I would like to be able to do it for the other text file:
> [3]ListFile
> >   which has more articles. I used BeautifulSoup to scrape the data from
> the
> >   website, and then generated a list that I saved as a text file.
> >
> >   Best,
> >   Monzur
> >   On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus)
> >   <[4]dciprus at cisco.com> wrote:
> >
> >     If you could utilized pastebin or similar site to show your code, it
> >     would help
> >     tremendously since it's an unindented mess now and can not be read
> >     easily.
> >
> >     On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
> >     >Dear List,
> >     >
> >     >Newbie here. I am trying to loop over a text file to remove html
> tags,
> >     >punctuation marks, stopwords. I have already used Beautiful Soup
> >     (Python v
> >     >3.8.3) to scrape the text (newspaper articles) from the site. It
> >     returns a
> >     >list that I saved as a file. However, I am not sure how to use a
> loop
> >     in
> >     >order to process all the items in the text file.
> >     >
> >     >In the code below I have used listfilereduced.text(containing data
> from
> >     one
> >     >news article, link to listfilereduced.txt here
> >     ><[5]
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >),
> >     >however I would like to run this code on listfile.text(containing
> data
> >     from
> >     >multiple articles, link to listfile.text
> >     ><[6]
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
> >     >).
> >     >
> >     >
> >     >Any help would be greatly appreciated!
> >     >
> >     >P.S. The text is in a Non-English script, but the tags are all in
> >     English.
> >     >
> >     >
> >     >#The code below is for a textfile containing just one item. I am not
> >     sure
> >     >how to tweak this to make it run for listfile.text (which contains
> raw
> >     data
> >     >from multiple articles) with open('listfilereduced.txt', 'r',
> >     >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
> >     >#Separating body text from other data articleStart =
> rawData.find("<div
> >     >class=\"story-element story-element-text\">") articleData =
> >     >rawData[:articleStart] articleBody = rawData[articleStart:]
> >     >print(articleData) print("*******") print(articleBody)
> print("*******")
> >     >#First, I define a function to strip tags from the body text def
> >     >stripTags(pageContents): insideTag = 0 text = '' for char in
> >     pageContents:
> >     >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
> >     >insideTag = 0 elif insideTag == 1: continue else: text += char
> return
> >     text
> >     >#Calling the function articleBodyText = stripTags(articleBody)
> >     >print(articleBodyText) ##Isolating article title and publication
> date
> >     >TitleEndLoc = articleData.find("</h1>") dateStartLoc =
> >     >articleData.find("<div
> >     >class=\"storyPageMetaData-m__publish-time__19bdV\">")
> >     >dateEndLoc=articleData.find("<div class=\"meta-data-icons
> >     >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
> >     >articleData[:TitleEndLoc] dateString =
> >     articleData[dateStartLoc:dateEndLoc]
> >     >##Call stripTags to clean articleTitle= stripTags(titleString)
> >     articleDate
> >     >= stripTags(dateString) print(articleTitle) print(articleDate)
> >     #Cleaning
> >     >the date a bit more startLocDate = articleDate.find(":") endLocDate
> =
> >     >articleDate.find(",") articleDateClean =
> >     >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save
> >     all
> >     >this data to a dictionary that saves the title, data and the body
> text
> >     >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean,
> >     "Text":
> >     >articleBodyText} print(PAloTextDict) #Normalize text by: #1.
> Splitting
> >     >paragraphs of text into lists of words articleBodyWordList =
> >     >articleBodyText.split() print(articleBodyWordList) #2.Removing
> >     punctuation
> >     >and stopwords from bnlp.corpus import stopwords, punctuations #A.
> >     Remove
> >     >punctuation first listNoPunct = [] for word in articleBodyWordList:
> for
> >     >mark in punctuations: word=word.replace(mark, '')
> >     listNoPunct.append(word)
> >     >print(listNoPunct) #B. removing stopwords banglastopwords =
> stopwords()
> >     >print(banglastopwords) cleanList=[] for word in listNoPunct: if
> word in
> >     >banglastopwords: continue else: cleanList.append(word)
> print(cleanList)
> >     >--
> >     >[7]https://mail.python.org/mailman/listinfo/python-list
> >
> >     --
> >
> >     Daniel Ciprus                              .:|:.:|:.
> >     CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.
> >
> >     [8]dciprus at cisco.com
> >
> >     tel: +1 703 484 0205
> >     mob: +1 540 223 7098
> >
> >References
> >
> >   Visible links
> >   1. https://pastebin.com/wSi2xzBh
> >   2.
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >   3.
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >   4. mailto:dciprus at cisco.com
> >   5.
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >   6.
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >   7. https://mail.python.org/mailman/listinfo/python-list
> >   8. mailto:dciprus at cisco.com
>
> --
>
> Daniel Ciprus                              .:|:.:|:.
> CONSULTING ENGINEER.CUSTOMER DELIVERY   Cisco Systems Inc.
>
> dciprus at cisco.com
>
> tel: +1 703 484 0205
> mob: +1 540 223 7098
>
>