[Tutor] List and dictionary comprehensions

Mon Sep 29 13:51:20 CEST 2014

On 28/09/14 03:36, Armindo Rodrigues wrote:

> have noted the beginning and end of the quotes list so you can easily skip
> and go straight to the code section. ***

It would probably have been better to just delete all but a nfew of the 
quotes. We don't need all of them to evaluate your code.

> import re
> from datetime import datetime
> import time
>
>
> ###################  DATA LIST STARTS HERE
>
> data_list=["And now here is my secret, a very simple secret: It is only
> with the heart that one can see rightly; what is essential is invisible to
> the eye.",
> "All grown-ups were once children... but only few of them remember it.",
...
> "If you love a flower that lives on a star, then it's good at night, to
> look up at the sky. All the stars are blossoming."]
>
>
> ################## CODE STARTS HERE
>
> #Create a list of words taken from each individual word in the datalist
> word_list = []
> for item in data_list:
>      for word in item.split(" "):
>          word = re.sub('^[^a-zA-z]*|[^a-zA-Z]*$','', word)

word.strip() would be better here. You can specify a string of chars to 
be stripped if its not only whitespace. Consider regular expressions as 
a weapon of last resort.

>          word_list.append(word)
> word_list = sorted(list(set(word_list))) #Remove repeated words

You don't need to convert the set into a list. sorted() works
with sets too.

> quotesDict = {}
> for word in word_list:
>      quotesDict.setdefault(word,[]) #Create a dictionary with keys based on
> each word in the word list

By putting the words in the dictionary you lose the sorting you did 
above. So the sorting was a waste of time.

> for key, value in quotesDict.items():
>      indexofquote = 0
>      for quote in data_list:

You should use enumerate for this. It will automatically give you the 
index and quote and be less error prone than maintaining the index yourself.

>          if key in quote:
>              quotesDict[key].append(indexofquote) #Append the index of the
> found quotes to the dictionary key
>          indexofquote+=1
>
> query=input("query: ")
> query = query.strip(" ").split(" ")
> query = list(set(query))
>

I don;t think you need the conversion to list here either.
You can just use the set.

> start_time = time.time()
>
> FoundQuotes = []
>
> # Right now the OR search just prints out the index of the found quotes.
> if ("or" in query) and ("and" not in query):

The logic here can be simplified by testing for 'and' first

if 'and' in query
    remove 'or'
    process and
elif 'or' in query
    process 'or'
else process simple query

>      query.remove("or")
>      print("Performing OR search for: ", query)
>      for item in query:
>          if (item in quotesDict):
>              print("FOUND ",len(quotesDict[item]),  " ", item, "QUOTES: ",
> quotesDict.get(item))
>      print("\n--- Execution ---\n", (time.time() - start_time) * 1000,
> "microseconds\n")
>
> else:
>      if "and" in query:
>          query.remove("and")
>      if "or" in query:
>          query.remove("or")
>      print("Performing AND search for: ", query)

This looks wrong. What about the case where neither and/or are in the query?

>      for item in query:
>          if (item in quotesDict):
>              FoundQuotes = FoundQuotes + (quotesDict.get(item))
>      FoundQuotes = list(set([x for x in FoundQuotes if FoundQuotes.count(x)
>> 1]))

This doesn't look right either.
Foundquotes is a list of indexes. The comprehension builds a list of all 
the indexes that appear more than once - what about a quote that was 
only found once?

It then eliminates all the duplicates(set()) and returns it back to a 
list(why not leave it as a set?)

I'd have expected a simple conversion of FoundQuotes to a set would be 
what you wanted.

>      for x in FoundQuotes:
>          print(data_list[x])
>      print("\n--- Execution ---\n", (time.time() - start_time) * 1000,
> "microseconds\n")

The other problem is that you are serching the dictionary
several times, thus losing some of the speed advantage of
using a dictionary.

You would get more benefit from the dictionary if you adopt a try/except 
approach and just access the key directly. So, instead of:

 >      for item in query:
 >          if (item in quotesDict):
 >              FoundQuotes = FoundQuotes + (quotesDict.get(item))

for item in query:
   try: FoundQuotes = FoundQuotes + quotesDict[item]
   except KeyError: pass

Or better still use the default value of get:

for item in query:
     FoundQuotes = FoundQuotes + quotesDict.get(item,[])

There are a few other things that could be tidied up but that should 
give you something to get started with.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.flickr.com/photos/alangauldphotos