In Pandas, can you do groupby in on two different datasets?

C W tmrsg11 at gmail.com
Sat Jun 9 14:24:57 EDT 2018


Dear all,
I want find the average ratings of movies by movieId. Below is
ratings.head() of the dataset.

> ratings.head()

userId  movieId  rating   timestamp         parsed_time
0       1        2     3.5  1112486027 2005-04-02 23:53:47
1       1       29     3.5  1112484676 2005-04-02 23:31:16
2       1       32     3.5  1112484819 2005-04-02 23:33:39
3       1       47     3.5  1112484727 2005-04-02 23:32:07
4       1       50     3.5  1112484580 2005-04-02 23:29:40


I'm trying two methods:

Method 1 (makes sense)
> ratings[['movieId', 'rating']].groupby('rating').mean()
This returns dataframe, it's the most common.

Method 2 (confusing)
> ratings.rating.groupby(ratings.movieId).mean()

movieId
1    3.921240
2    3.211977
3    3.151040
4    2.861393
5    3.064592
Name: rating, dtype: float64



What going on in method 2? It's calling ratings dataset twice.
First in ratings.rating, it restricts the working dataset to only rating
column.
Then, it groups by ratings.movieId, but how does it know there is movieId.
Didn't we just restrict the data to rating column only?

Thanks in advance!



More information about the Python-list mailing list