[Tutor] deleting elements out of a list.

Cameron Simpson cs at cskk.id.au
Sat Jun 15 03:54:46 EDT 2019


On 15Jun2019 14:51, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>I am not sure how to tackle this issue. I am using Windows 10 and 
>Python 3.6 from Activestate.
>
>I have a list of x number of elements. Some of the elements are have similar
>words in them. For example:
>
>Dog food Pal
>Dog Food Pal qx1323
>Cat food kitty
>Absolute cleaning inv123
>Absolute Domestic cleaning inv 222
>Absolute d 3333
>Fitness first 02/19
>Fitness first

I'm going to assume that you have a list of strings, each being a line 
from a file.

>I wish to remove duplicates. I could use the collection.Count method. This
>fails due to the strings are not unique, only some of the words are.

You need to define this more tightly. Suppose the above were your input.  
What would it look like after "removing duplicates"? By providing an 
explicit example of what you expect afterwards it is easier for us to 
understand you, and will also help you with your implementation.

Do you intend to discard the second occurence of every word, turning 
line 2 above into "qx1323"? Or to remove similar lines, for some 
definition of "similar",
which might discard line 2 above?

Your code examples below seem to suggest that your want to discard words 
you've already seen.

>My
>thinking and is only rough sudo code as I am not sure how to do this and

Aside: "pseudo", not "sudo".

>wish to learn and not sure how to do without causing gtraceback errors. I
>want to delete the match pattern from the list of strings. Below is my
>attempt and I hope this makes sense.
>
>description = load_files() # returns a list
>for text in description:
>    words = text.split()
>    for i in enumerate(words):

enumerate() yields a sequence of (i, v), so you need i, v in the loop:

  for i, word in enumerate(words):

Or you need the loop variable to be a tuple and to pull out the 
enumeration counter and the associated value inside the loop:

  for x in enumerate(words):
    i, word = x

>        Word = ' '.join(words[:i])

Variable names in Python are case sensitive. You want "word", not 
"Word".

However, if you really want each word of the line you've got that from 
text.split(). The expression "words[:i]" means the letters of word from 
index 0 through to i-1. For example, "kitt" if "i" were 4.

The join string operation joins an iterable of strings. Unfortunately 
for you, a string is itself iterable: you get each character, but as a 
string (Python does not have a distinct "character" type, it just has 
single character strings). So if "word" were "kitt" above, you get:

  "k i t t"

from the join. Likely not what you want.

What _do_ you want?

>        print (word)
>        answer = input('Keep word?')
>        if answer == 'n':
>            continue
>        for i, v in enumerate(description):
>            if word in description[i]:
>                description.pop[i]

There are some problems here. The big one is that you're modifying a 
list while you're iterating over it. This is always hazardous - it 
usually leading to accidentally skipping elements. Or not, depending how 
the iteration happens.

It is generally safer to iterate over the list and construct a distinct 
new line to replace it, without modifying the original list. This way 
the enumerate cannot get confused. So instead of discarding from the 
list, you conditionally add to the new list:

  new_description = []
  for i, word in enumerate(description):
    if word not in description[i]:
      new_description.append(word)

Note the "not" above. We invert the condition ("not in" instead of "in") 
because we're inverting the action (appending something instead of 
discarding it).

However, I think  you have some fundamental confusion about what your 
iterating over.

I recommend that you adopt better variable names, and more formally 
describe your data.

If "description" is actualy a list of descriptions then give it a plural 
name like "descriptions". When you iterate over it, you can then use the 
singular form for each element i.e.  "description" instead of "text".

Instead of writing loops like:

  for i, v in enumerate(descriptions):

give "v" a better name, like "description". That way your code inside 
the loop is better described, and mistakes more obvious because the code 
will suddenly read badly in some way.

>The initial issues I see with the above is the popping of an element 
>from
>description list will cause a error.

It often won't. Instead if will mangle your iteration because after the 
pop the index "i" no longer refers to what you expect, it now points one 
word further along.

Towards the _end_ of the loop you'll get an error, but only once "i" 
starts to exceed the length of the list (because you've been shortening 
it).

>If I copy the description list into a
>new list. And use the new list for the outer loop. I will receive multiple
>occurrences of the same text. This could be addressed by a if test. But I am
>wondering if there is a better method.

The common idom is to leave the original unchanged and copy into a new 
list as in my example above. But taking a copy and iterating over that 
is also reasonable.

You will still have issues with the popping, because the index "i" will 
no longer be aligned with the modified list.

If you really want to modify in place, avoid enumerate. Instead, make 
"i" an index into the list as you do, but maintain it yourself. Loop 
from left to right in the list until you come off the end:

  i = 0
  while i < len(description):
    if ... we want to pop the element ...:
      description.pop(i)
    else:
      i = i + 1

Here we _either_ discard from the list and _do not_ advance "i", or we 
advance "i". Either way "i" then points at the next word, in the former 
case because the next word has shuffled down once position and in the 
latter because "i" has moved forwards. Either way "i" gets closer to the 
end of the list. We leave the loop when "i" gets past the end.

>2nd code example:
>
>description = load_files() # returns a list
>search_txt = description.copy() # I have not verify if this is the right
>syntax for the copy method.]

A quick way is:

  search_text = description[:]

but lists have a .copy method which does the same thing.

>for text in search_txt:
>    words = text.split()
>    for i in enumerate(words):
>        Word = ' '.join(words[:i])
>        print (word)
>        answer = input('Keep word (ynq)?')
>        if answer == 'n':
>            continue
>        elif answer = 'q':
>            break
>        for i, v in enumerate(description):
>            if word in description[i]:
>                description.pop[i]

The inner for loop still has all the same issues as before. The outer 
loop is now more robust because you've iterating over the copy.

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Tutor mailing list