[Chicago] Need advice on this project.

Tue Nov 10 12:20:33 EST 2015

This was amazingly helpful, thanks!  I'll check the denominator of my
correlation, but I'm pretty sure that's correct.  But it won't hurt to
double check it.  When I took a slice of my similarity matrix all the
correlations were floats in between -1 and +1, so that's a good sign that
my computation was correct albeit very time consuming.  The reason I didn't
use Numpy arrays is because the professor for this class doesn't know a lot
of Python, and he uses Microsoft Visual Studio to run my Python programs.
I don't know if Numpy is a part of that installation.  Numpy is not part of
the standard Python installation, so if I submit a program that contains
anything from the Numpy library then he won't be able to run my code.  I
emailed him and asked him about his Python installation, but he didn't get
back to me.

Thanks for the feedback!  Very much appreciated!!!

Best,

Douglas.

On Tue, Nov 10, 2015 at 8:52 AM, Sunhwan Jo <sunhwanj at gmail.com> wrote:

> 1. Your “correlation” function takes most of the execution time.
>
> def Correlation(p, q):
>       global PQ_Ratings
>       sum1 = 0
>       sum2 = 0
>       numeratorProduct = 1
>       denominatorProduct1 = 1
>       denominatorProduct2 = 1
>       for key in filter( lambda x: x[0] == p or x[0] == q,
> PQ_Ratings.keys( ) ):
>             if key[0] == p:
>                   sum1+= PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
>             else:
>                   sum2+= PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
>             numeratorProduct+= sum1*sum2
>             denominatorProduct1+= sum1**2
>             denominatorProduct2+= sum2**2
>       return
> numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
>
>
> By changing sum1 and sum2 as list comprehension can increase the execution
> speed about 10x (rough estimate using your code). In addition, the
> denominator is also wrong. It should be *sum of squared differences* not
> *square of sum of differences*, but I’m not concerned at this yet.
>
> def Correlation(p, q):
>       global PQ_Ratings
>       sum1 = 0
>       sum2 = 0
>       numeratorProduct = 1
>       denominatorProduct1 = 1
>       denominatorProduct2 = 1
>       keys = [key for key in PQ_Ratings.keys() if key[0] == p or key[0] ==
> q]
>       sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key
> in keys if key[0] == p])
>       sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key
> in keys if key[0] == q])
>       numeratorProduct+= sum1*sum2
>       denominatorProduct1+= sum1**2
>       denominatorProduct2+= sum2**2
>       return
> numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
>
>
> 2. You don’t have to re-calculate sum1 each time. “sum1" only depends on
> “p”. So, you can calculate that only in the outer loop and reuse it.
>
> keys = PQ_Ratings.keys()
> for i in range(1, len(SimilarityMatrix)):
>       sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key
> in keys if key[0] == i])
>
>       for j in range(i + 1, len(SimilarityMatrix)):
>             sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
> for key in keys if key[0] == j])
>             numeratorProduct = sum1*sum2 + 1
>             denominatorProduct1 = sum1**2 + 1
>             denominatorProduct2 = sum2**2 + 1
>             SimilarityMatrix[i][j] =
> numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
>
>
> This will again speed up but the total execution time is about 200 minutes
> with +900 users.
>
> 3. Is there any reason not to use NumPy array? Using NumPy it finishes
> less than a fraction of a minute. Notice I also fixed the bug in the
> nominator and the denominator.
>
> import numpy as np
> nitems = max(AverageRatingsOfItems.keys())
> nusers = max([key[0] for key in PQ_Ratings.keys()])
> avg_rating = np.zeros(nitems)
> pq_rating = np.zeros((nusers, nitems))
> keys = PQ_Ratings.keys()
> for key in keys:
>       pq_rating[key[0]-1, key[1]-1] = PQ_Ratings[key]
> keys = AverageRatingsOfItems.keys()
> for key in keys:
>       avg_rating[key-1] = AverageRatingsOfItems[key]
>
> startTime = time.time( )
>
> #### Let's finish building up our similarity matrix for this problem.
> keys = PQ_Ratings.keys()
> for i in range(1, len(SimilarityMatrix)):
>       #sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key
> in keys if key[0] == i])
>       diff1 = np.sum(pq_rating[i-1] - avg_rating)
>
>       for j in range(i + 1, len(SimilarityMatrix)):
>             #sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
> for key in keys if key[0] == j])
>             diff2 = np.sum(pq_rating[j-1] - avg_rating)
>             numeratorProduct = np.sum(diff1*diff2)
>             denominatorProduct1 = np.sum(diff1**2)
>             denominatorProduct2 = np.sum(diff2**2)
>             SimilarityMatrix[i][j] =
> numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
>
>
>
>
> On Nov 9, 2015, at 7:44 PM, Lewit, Douglas <d-lewit at neiu.edu> wrote:
>
> Hey guys,
>
> I need some advice on this one.  I'm attaching the homework assignment so
> that you understand what I'm trying to do.  I went as far as the
> construction of the Similarity Matrix, which is a matrix of Pearson
> correlation coefficients.
>
> My problem is this.  u1.base (which is also attached) contains Users
> (first column), Items (second column), Ratings (third column) and finally
> the time stamp in the 4th and final column.  (Just discard the 4th column.
> We're not using it for anything. )
>
> It's taking HOURS for Python to build the similarity matrix.  So what I
> did was:
>
> *head -n 5000 u1.base > practice.base*
>
> and I also downloaded the PyPy interpreter for Python 3.  Then using PyPy
> (or pypy or whatever) I ran my program on the first ten thousand lines of
> data from u1.base stored in the new text file, practice.base.  Not a
> problem!!!  I still had to wait a couple minutes, but not a couple hours!!!
>
>
> Is there a way to make this program work for such a large set of data?  I
> know my program successfully constructs the Similarity Matrix (i.e.
> similarity between users) for 5,000, 10,000, 20,000 and even 25,000 lines
> of data.  But for 80,000 lines of data the program becomes very slow and
> overtaxes my CPU.  (The fan turns on and the bottom of my laptop starts to
> get very hot.... a bad sign! )
>
> Does anyone have any recommendations?  ( I'm supposed to meet with my prof
> on Tuesday.  I may just explain the problem to him and request a smaller
> data set to work with.  And unfortunately he knows very little about
> Python.  He's primarily a C++ and Java programmer. )
>
> I appreciate the feedback.  Thank you!!!
>
> Best,
>
> Douglas Lewit
>
>
> <Homework3_Revision2.py><u1.base><practice2.base><HW3.pdf>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20151110/ac236063/attachment.html>