[Tutor] Parsing and collecting keywords from a webpage

Thu Jun 21 04:55:50 EDT 2018

On 20/06/18 20:32, Daniel Bosah wrote:

>   reg = pattern.findall(str(soup))
> 
>   for i in reg:
>      if i in reg and paul: # this loop checks to see if elements are in
> both the regexed parsed list and the list. 

No it doesn't. It checks if i is in reg and
if paul is non empty - which it always is.
So this if test is really just testing if
i is in reg. This is also always truie since
the for loop is iterating over reg.

So you are effectively saying

if True and True

or

if True.

What you really wanted was something like

if i in reg and i in paul:

But since you know i is in reg you can drop
that bit to get

if i in paul:

>             sets.append(str(i))

Because the if is always true you always add i to sets

>             with open('sets.txt', 'w') as f:
>                 f.write(str(sets))
>                 f.close()

Why not just wait to the end? Writing the entire sets stucture
to a file each time is very wasteful. Alternatively use the
append mode and just write the new item to the file.

Also you don't need f.close if you use a with statement.

> However, every time I run the current code, I get all the
> textfile(sets.txt) from the previous ( regex ) function, even though all I
> want are words and pharse shared between the textfile from regex and the
> monum list from regexparse. How can I fix this?

I think that's due to the incorrect if expression above.

But I didn't check the rest of the code...

However, I do wonder about your use of soup as your
search string. Isn't soup the parsed html structure?
Is that really what you want to search with your regex?
But I'm no BS expert, so there might be some magic
at work there.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos