[Chicago] Need advice on this project.

Tue Nov 10 10:46:22 EST 2015

Hi Douglas,

You seem to post interesting homework assignments when I'm looking for a
fun problem, thanks.

The issue definitely isn't the performance of either Python (the language)
or CPython (the implementation). I did the assignment last night, and
calculating the matrix for "u1.base" took my code less than 10 seconds.

For readability in your Correlation function, try to avoid: globals;
creating lambdas inside loops; and indexing with constant keys rather than
using argument unpacking (i.e. key[0]). It also helps to follow PEP8 if you
want other Python programmers to be able to read your code easily.

You probably have an algorithmic error in there somewhere -- it's hard for
me to tell for sure because your code is difficult to follow. Read the
assignment carefully, and only do what it tells you. For performance, are
there different data structures you could use? Are there "batteries
included" in Python that could combine some of those individual arithmetic
operations? I don't want to be too specific here because implementing the
algorithm is the point of the assignment.

It looks like you still have two weeks to complete the project, so I'd
recommend taking your time, and don't be afraid to start a new version --
it can help you break out of bad patterns you've started in your existing
code.

Best,
Adam

On Mon, Nov 9, 2015 at 7:44 PM, Lewit, Douglas <d-lewit at neiu.edu> wrote:

> Hey guys,
>
> I need some advice on this one.  I'm attaching the homework assignment so
> that you understand what I'm trying to do.  I went as far as the
> construction of the Similarity Matrix, which is a matrix of Pearson
> correlation coefficients.
>
> My problem is this.  u1.base (which is also attached) contains Users
> (first column), Items (second column), Ratings (third column) and finally
> the time stamp in the 4th and final column.  (Just discard the 4th column.
> We're not using it for anything. )
>
> It's taking HOURS for Python to build the similarity matrix.  So what I
> did was:
>
> *head -n 5000 u1.base > practice.base*
>
> and I also downloaded the PyPy interpreter for Python 3.  Then using PyPy
> (or pypy or whatever) I ran my program on the first ten thousand lines of
> data from u1.base stored in the new text file, practice.base.  Not a
> problem!!!  I still had to wait a couple minutes, but not a couple hours!!!
>
>
> Is there a way to make this program work for such a large set of data?  I
> know my program successfully constructs the Similarity Matrix (i.e.
> similarity between users) for 5,000, 10,000, 20,000 and even 25,000 lines
> of data.  But for 80,000 lines of data the program becomes very slow and
> overtaxes my CPU.  (The fan turns on and the bottom of my laptop starts to
> get very hot.... a bad sign! )
>
> Does anyone have any recommendations?  ( I'm supposed to meet with my prof
> on Tuesday.  I may just explain the problem to him and request a smaller
> data set to work with.  And unfortunately he knows very little about
> Python.  He's primarily a C++ and Java programmer. )
>
> I appreciate the feedback.  Thank you!!!
>
> Best,
>
> Douglas Lewit
>
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20151110/7906440f/attachment.html>