Case Study: Movies And Books Recommendation System

Introduction

During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. As a proof of the importance of recommender systems, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal was to produce a recommender system that performs better than its own algorithm with a prize of 1 million dollars to win.

In this project, different types of recommender systems will be implemented to make a good recommendation for the users in movies and books. For each of them, it can be observe how they work, describe their theoretical aspect.

Dataset:

Review

Movie Recommendation System

Weighted Hybrid Technique

Import Libraries And Data

In [132]:
# Import libraries
import pandas as pd
import numpy as np
In [133]:
# Import datasets
movies_df = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

Data Exploration

In [134]:
# Check movies data
movies_df.head()
Out[134]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.438 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.000 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.200 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.083 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.000 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.900 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.377 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.000 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.300 4466
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.313 [{"name": "Legendary Pictures", "id": 923}, {"... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.000 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.600 9106
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.927 [{"name": "Walt Disney Pictures", "id": 2}] [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.000 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.100 2124
In [135]:
# Check credits data
credits.head()
Out[135]:
movie_id title cast crew
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 285 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 206647 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 49026 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 49529 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
In [136]:
# Check datasets dimension
print("Credits:", credits.shape)
print("Movies:", movies_df.shape)
Credits: (4803, 4)
Movies: (4803, 20)

Data Preprocessing

In [137]:
# Rename column
credits_column_renamed = credits.rename(index = str, columns = {"movie_id": "id"})

# Merge column data
movies_df_merge = movies_df.merge(credits_column_renamed, on = 'id')

# Check data
movies_df_merge.head()
Out[137]:
budget genres homepage id keywords original_language original_title overview popularity production_companies ... runtime spoken_languages status tagline title_x vote_average vote_count title_y cast crew
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.438 [{"name": "Ingenious Film Partners", "id": 289... ... 162.000 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.200 11800 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.083 [{"name": "Walt Disney Pictures", "id": 2}, {"... ... 169.000 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.900 4500 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.377 [{"name": "Columbia Pictures", "id": 5}, {"nam... ... 148.000 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.300 4466 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.313 [{"name": "Legendary Pictures", "id": 923}, {"... ... 165.000 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.600 9106 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.927 [{"name": "Walt Disney Pictures", "id": 2}] ... 132.000 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.100 2124 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...

5 rows × 23 columns

In [138]:
# Drop unnecessary columns
movies_cleaned_df = movies_df_merge.drop(columns = ['homepage', 'title_x', 'title_y', 'status','production_countries'])

# Check data
movies_cleaned_df.head()
Out[138]:
budget genres id keywords original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages tagline vote_average vote_count cast crew
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.438 [{"name": "Ingenious Film Partners", "id": 289... 2009-12-10 2787965087 162.000 [{"iso_639_1": "en", "name": "English"}, {"iso... Enter the World of Pandora. 7.200 11800 [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.083 [{"name": "Walt Disney Pictures", "id": 2}, {"... 2007-05-19 961000000 169.000 [{"iso_639_1": "en", "name": "English"}] At the end of the world, the adventure begins. 6.900 4500 [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.377 [{"name": "Columbia Pictures", "id": 5}, {"nam... 2015-10-26 880674609 148.000 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... A Plan No One Escapes 6.300 4466 [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.313 [{"name": "Legendary Pictures", "id": 923}, {"... 2012-07-16 1084939099 165.000 [{"iso_639_1": "en", "name": "English"}] The Legend Ends 7.600 9106 [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.927 [{"name": "Walt Disney Pictures", "id": 2}] 2012-03-07 284139100 132.000 [{"iso_639_1": "en", "name": "English"}] Lost in our world, found in another. 6.100 2124 [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
In [139]:
# Check data info
movies_cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count            4803 non-null   int64  
 16  cast                  4803 non-null   object 
 17  crew                  4803 non-null   object 
dtypes: float64(3), int64(4), object(11)
memory usage: 712.9+ KB
In [140]:
# Calculate all the components based on the above formula
v = movies_cleaned_df['vote_count']
R = movies_cleaned_df['vote_average']
C = movies_cleaned_df['vote_average'].mean()
m = movies_cleaned_df['vote_count'].quantile(0.70)
In [141]:
# Add column of weighted average
movies_cleaned_df['weighted_average'] = ((R * v)+ (C * m)) / (v + m)
In [142]:
# Check data
movies_cleaned_df.head()
Out[142]:
budget genres id keywords original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages tagline vote_average vote_count cast crew weighted_average
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.438 [{"name": "Ingenious Film Partners", "id": 289... 2009-12-10 2787965087 162.000 [{"iso_639_1": "en", "name": "English"}, {"iso... Enter the World of Pandora. 7.200 11800 [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de... 7.148
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.083 [{"name": "Walt Disney Pictures", "id": 2}, {"... 2007-05-19 961000000 169.000 [{"iso_639_1": "en", "name": "English"}] At the end of the world, the adventure begins. 6.900 4500 [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de... 6.808
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.377 [{"name": "Columbia Pictures", "id": 5}, {"nam... 2015-10-26 880674609 148.000 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... A Plan No One Escapes 6.300 4466 [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de... 6.276
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.313 [{"name": "Legendary Pictures", "id": 923}, {"... 2012-07-16 1084939099 165.000 [{"iso_639_1": "en", "name": "English"}] The Legend Ends 7.600 9106 [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de... 7.510
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.927 [{"name": "Walt Disney Pictures", "id": 2}] 2012-03-07 284139100 132.000 [{"iso_639_1": "en", "name": "English"}] Lost in our world, found in another. 6.100 2124 [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de... 6.098
In [143]:
# Sort data according to weighted averages
movie_sorted_ranking = movies_cleaned_df.sort_values('weighted_average',ascending = False)

# Get only the neccessary columns
movie_sorted_ranking[['original_title', 'vote_count', 'vote_average', 'weighted_average', 'popularity']].head(20)
Out[143]:
original_title vote_count vote_average weighted_average popularity
1881 The Shawshank Redemption 8205 8.500 8.341 136.748
3337 The Godfather 5893 8.400 8.193 143.660
662 Fight Club 9413 8.300 8.172 146.757
3232 Pulp Fiction 8428 8.300 8.158 121.463
65 The Dark Knight 12002 8.200 8.103 187.323
809 Forrest Gump 7927 8.200 8.056 138.133
1818 Schindler's List 4329 8.300 8.039 104.469
3865 Whiplash 4254 8.300 8.035 192.529
96 Inception 13752 8.100 8.019 167.584
1990 The Empire Strikes Back 5879 8.200 8.010 78.518
2294 千と千尋の神隠し 3840 8.300 8.010 118.969
95 Interstellar 10867 8.100 7.998 724.248
2731 The Godfather: Part II 3338 8.300 7.973 105.793
329 The Lord of the Rings: The Return of the King 8064 8.100 7.965 123.630
2912 Star Wars 6624 8.100 7.938 126.394
690 The Green Mile 4048 8.200 7.935 103.698
1553 Se7en 5765 8.100 7.916 79.580
262 The Lord of the Rings: The Fellowship of the Ring 8705 8.000 7.881 138.050
1847 GoodFellas 3128 8.200 7.870 63.654
2091 The Silence of the Lambs 4443 8.100 7.868 18.175

Data Visualization

In [144]:
# Import library
import matplotlib.pyplot as plt
import seaborn as sns

# Sorted dataset according the weighted averages, from highest to lowest
weight_average = movie_sorted_ranking.sort_values('weighted_average', ascending = False)

# Create barplot
plt.figure(figsize = (12,6))
axis1 = sns.barplot(x = weight_average['weighted_average'].head(10), y = weight_average['original_title'].head(10), data = weight_average)
plt.title('Best Movies by average votes', weight = 'bold')
plt.xlabel('Weighted Average Score', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')
plt.xlim(4, 10)
plt.show()
In [149]:
# Sorted dataset according to popularity
popularity = movie_sorted_ranking.sort_values('popularity',ascending = False)

# Check data
popularity.head()
Out[149]:
budget genres id keywords original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages tagline vote_average vote_count cast crew weighted_average
546 74000000 [{"id": 10751, "name": "Family"}, {"id": 16, "... 211672 [{"id": 3487, "name": "assistant"}, {"id": 179... en Minions Minions Stuart, Kevin and Bob are recruited by... 875.581 [{"name": "Universal Pictures", "id": 33}, {"n... 2015-06-17 1156730962 91.000 [{"iso_639_1": "en", "name": "English"}] Before Gru, they had a history of bad bosses 6.400 4571 [{"cast_id": 22, "character": "Scarlet Overkil... [{"credit_id": "5431b2b10e0a2656e20026c7", "de... 6.365
95 165000000 [{"id": 12, "name": "Adventure"}, {"id": 18, "... 157336 [{"id": 83, "name": "saving the world"}, {"id"... en Interstellar Interstellar chronicles the adventures of a gr... 724.248 [{"name": "Paramount Pictures", "id": 4}, {"na... 2014-11-05 675120017 169.000 [{"iso_639_1": "en", "name": "English"}] Mankind was born on Earth. It was never meant ... 8.100 10867 [{"cast_id": 9, "character": "Joseph Cooper", ... [{"credit_id": "52fe4bbf9251416c910e4801", "de... 7.998
788 58000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 293660 [{"id": 2095, "name": "anti hero"}, {"id": 307... en Deadpool Deadpool tells the origin story of former Spec... 514.570 [{"name": "Twentieth Century Fox Film Corporat... 2016-02-09 783112979 108.000 [{"iso_639_1": "en", "name": "English"}] Witness the beginning of a happy ending 7.400 10995 [{"cast_id": 99, "character": "Wade Wilson / D... [{"credit_id": "56c986b2925141172f0068b6", "de... 7.334
94 170000000 [{"id": 28, "name": "Action"}, {"id": 878, "na... 118340 [{"id": 8828, "name": "marvel comic"}, {"id": ... en Guardians of the Galaxy Light years from Earth, 26 years after being a... 481.099 [{"name": "Marvel Studios", "id": 420}, {"name... 2014-07-30 773328629 121.000 [{"iso_639_1": "en", "name": "English"}] All heroes start somewhere. 7.900 9742 [{"cast_id": 1, "character": "Peter Quill / St... [{"credit_id": "538ce329c3a3687155003358", "de... 7.798
127 150000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 76341 [{"id": 2964, "name": "future"}, {"id": 3713, ... en Mad Max: Fury Road An apocalyptic story set in the furthest reach... 434.279 [{"name": "Village Roadshow Pictures", "id": 7... 2015-05-13 378858340 120.000 [{"iso_639_1": "en", "name": "English"}] What a Lovely Day. 7.200 9427 [{"cast_id": 2, "character": "Max Rockatansky"... [{"credit_id": "577da370c3a36817f8003838", "de... 7.136
In [150]:
# Create barplot
plt.figure(figsize = (12,6))
ax = sns.barplot(x = popularity['popularity'].head(10), y = popularity['original_title'].head(10), data = popularity)
plt.title('Most Popular by Votes', weight = 'bold')
plt.xlabel('Score of Popularity', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')

Data Scaling

In [151]:
# Recommendation based on scaled weighted average and popularity score (Priority is given 50% to both)
# Import library
from sklearn.preprocessing import MinMaxScaler

# Scale the data
scaling = MinMaxScaler()
movie_scaled_df = scaling.fit_transform(movies_cleaned_df[['weighted_average','popularity']])

# Create dataframe
movie_normalized_df = pd.DataFrame(movie_scaled_df,columns = ['weighted_average','popularity'])

# Check data
movie_normalized_df.head()
Out[151]:
weighted_average popularity
0 0.674 0.172
1 0.581 0.159
2 0.436 0.123
3 0.773 0.128
4 0.388 0.050
In [152]:
# Insert new columns
movies_cleaned_df[['normalized_weight_average','normalized_popularity']]= movie_normalized_df
In [153]:
# Check data
movies_cleaned_df.head()
Out[153]:
budget genres id keywords original_language original_title overview popularity production_companies release_date ... runtime spoken_languages tagline vote_average vote_count cast crew weighted_average normalized_weight_average normalized_popularity
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.438 [{"name": "Ingenious Film Partners", "id": 289... 2009-12-10 ... 162.000 [{"iso_639_1": "en", "name": "English"}, {"iso... Enter the World of Pandora. 7.200 11800 [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de... 7.148 0.674 0.172
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.083 [{"name": "Walt Disney Pictures", "id": 2}, {"... 2007-05-19 ... 169.000 [{"iso_639_1": "en", "name": "English"}] At the end of the world, the adventure begins. 6.900 4500 [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de... 6.808 0.581 0.159
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.377 [{"name": "Columbia Pictures", "id": 5}, {"nam... 2015-10-26 ... 148.000 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... A Plan No One Escapes 6.300 4466 [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de... 6.276 0.436 0.123
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.313 [{"name": "Legendary Pictures", "id": 923}, {"... 2012-07-16 ... 165.000 [{"iso_639_1": "en", "name": "English"}] The Legend Ends 7.600 9106 [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de... 7.510 0.773 0.128
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.927 [{"name": "Walt Disney Pictures", "id": 2}] 2012-03-07 ... 132.000 [{"iso_639_1": "en", "name": "English"}] Lost in our world, found in another. 6.100 2124 [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de... 6.098 0.388 0.050

5 rows × 21 columns

Apply Weighted Average Formula

In [154]:
# Create new column
movies_cleaned_df['score'] = (movies_cleaned_df['normalized_weight_average'] * 0.5) + (movies_cleaned_df['normalized_popularity'] * 0.5)

# Sort data according to score
movies_scored_df = movies_cleaned_df.sort_values(['score'], ascending = False)

# Check data
movies_scored_df[['original_title', 'normalized_weight_average', 'normalized_popularity', 'score']].head(20)
Out[154]:
original_title normalized_weight_average normalized_popularity score
95 Interstellar 0.906 0.827 0.867
546 Minions 0.461 1.000 0.730
94 Guardians of the Galaxy 0.852 0.549 0.701
788 Deadpool 0.725 0.588 0.656
127 Mad Max: Fury Road 0.671 0.496 0.583
1881 The Shawshank Redemption 1.000 0.156 0.578
65 The Dark Knight 0.935 0.214 0.574
3865 Whiplash 0.916 0.220 0.568
3337 The Godfather 0.960 0.164 0.562
662 Fight Club 0.954 0.168 0.561
96 Inception 0.912 0.191 0.552
3232 Pulp Fiction 0.950 0.139 0.544
809 Forrest Gump 0.922 0.158 0.540
199 Pirates of the Caribbean: The Curse of the Bla... 0.741 0.311 0.526
2294 千と千尋の神隠し 0.910 0.136 0.523
88 Big Hero 6 0.812 0.233 0.522
329 The Lord of the Rings: The Return of the King 0.897 0.141 0.519
1818 Schindler's List 0.918 0.119 0.518
2912 Star Wars 0.890 0.144 0.517
262 The Lord of the Rings: The Fellowship of the Ring 0.874 0.158 0.516

Result

In [157]:
# Sorted data
scored_df = movies_cleaned_df.sort_values('score', ascending = False)

# Create barplot
plt.figure(figsize=(16,6))
ax = sns.barplot(x=scored_df['score'].head(10), y = scored_df['original_title'].head(10), data=scored_df, palette='deep')
plt.title('Best Rated & Most Popular Blend', weight = 'bold')
plt.xlabel('Score', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')
plt.show()

Content-Based Filtering

Import Libraries And Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
In [2]:
# Import datasets
movies_df = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

Data Exploration

In [3]:
# Check movies data
movies_df.head()
Out[3]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124
In [5]:
# Check credits data
credits.head()
Out[5]:
movie_id title cast crew
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 285 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 206647 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 49026 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 49529 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
In [6]:
# Check datasets dimension
print("Credits:", credits.shape)
print("Movies:", movies_df.shape)
Credits: (4803, 4)
Movies: (4803, 20)

Data Preprocessing

In [7]:
# Rename column
credits_column_renamed = credits.rename(index = str, columns = {"movie_id": "id"})

# Merge column data
movies_df_merge = movies_df.merge(credits_column_renamed, on = 'id')

# Check data
movies_df_merge.head()
Out[7]:
budget genres homepage id keywords original_language original_title overview popularity production_companies ... runtime spoken_languages status tagline title_x vote_average vote_count title_y cast crew
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... ... 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800 Avatar [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... ... 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500 Pirates of the Caribbean: At World's End [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... ... 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466 Spectre [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... ... 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106 The Dark Knight Rises [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] ... 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124 John Carter [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...

5 rows × 23 columns

In [8]:
# Drop unnecessary columns
movies_cleaned_df = movies_df_merge.drop(columns = ['homepage', 'title_x', 'title_y', 'status','production_countries'])

# Check data
movies_cleaned_df.head()
Out[8]:
budget genres id keywords original_language original_title overview popularity production_companies release_date revenue runtime spoken_languages tagline vote_average vote_count cast crew
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Enter the World of Pandora. 7.2 11800 [{"cast_id": 242, "character": "Jake Sully", "... [{"credit_id": "52fe48009251416c750aca23", "de...
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] At the end of the world, the adventure begins. 6.9 4500 [{"cast_id": 4, "character": "Captain Jack Spa... [{"credit_id": "52fe4232c3a36847f800b579", "de...
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... A Plan No One Escapes 6.3 4466 [{"cast_id": 1, "character": "James Bond", "cr... [{"credit_id": "54805967c3a36829b5002c41", "de...
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] The Legend Ends 7.6 9106 [{"cast_id": 2, "character": "Bruce Wayne / Ba... [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Lost in our world, found in another. 6.1 2124 [{"cast_id": 5, "character": "John Carter", "c... [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
In [9]:
# Check data info
movies_cleaned_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count            4803 non-null   int64  
 16  cast                  4803 non-null   object 
 17  crew                  4803 non-null   object 
dtypes: float64(3), int64(4), object(11)
memory usage: 712.9+ KB
In [10]:
# Check data
movies_cleaned_df.head(1)['overview']
Out[10]:
0    In the 22nd century, a paraplegic Marine is di...
Name: overview, dtype: object

Apply TF-IDF Vectorizer

In [11]:
# Import library
from sklearn.feature_extraction.text import TfidfVectorizer

# Apply vectorizer
tfv = TfidfVectorizer(min_df = 3,  max_features = None, 
            strip_accents = 'unicode', analyzer = 'word', token_pattern = r'\w{1,}', # Remove unncessary characters !@#$%^&*()
            ngram_range = (1, 3), # Take 1 to 3 combinations of words
            stop_words = 'english') # Remove unneccessary words

# Fill NaNs with empty string
movies_cleaned_df['overview'] = movies_cleaned_df['overview'].fillna('')
In [12]:
# Fit the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])

# Sparse matrix
tfv_matrix
Out[12]:
<4803x10417 sparse matrix of type '<class 'numpy.float64'>'
	with 127220 stored elements in Compressed Sparse Row format>
In [13]:
# Check data dimension
tfv_matrix.shape
Out[13]:
(4803, 10417)
In [14]:
# Import library
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)
In [16]:
# Check data relation score
sig[0]
Out[16]:
array([0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])
In [17]:
# Reverse mapping of indices and movie titles
indices = pd.Series(movies_cleaned_df.index, index = movies_cleaned_df['original_title']).drop_duplicates()
In [18]:
# Check data
indices
Out[18]:
original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64
In [27]:
# Check data
indices['Newlyweds']
Out[27]:
4799
In [27]:
# Check data relation score
sig[4799]
Out[27]:
array([0.76159416, 0.76159416, 0.76159438, ..., 0.76159432, 0.76159416,
       0.76159478])
In [24]:
# Show 10 indices and its relation score
list(enumerate(sig[indices['Newlyweds']][:10]))
Out[24]:
[(0, 0.7615941559557649),
 (1, 0.7615941559557649),
 (2, 0.7615943791623508),
 (3, 0.7615945564232902),
 (4, 0.7615945779342557),
 (5, 0.7615943267971559),
 (6, 0.7615948190414071),
 (7, 0.761594346971664),
 (8, 0.7615943903358866),
 (9, 0.761594688255891)]
In [25]:
# Show indices and its relation score in descending order # Top 10
sorted(list(enumerate(sig[indices['Newlyweds']][:10])), key = lambda x: x[1], reverse = True)
Out[25]:
[(6, 0.7615948190414071),
 (9, 0.761594688255891),
 (4, 0.7615945779342557),
 (3, 0.7615945564232902),
 (8, 0.7615943903358866),
 (2, 0.7615943791623508),
 (7, 0.761594346971664),
 (5, 0.7615943267971559),
 (0, 0.7615941559557649),
 (1, 0.7615941559557649)]
In [29]:
# Recommender function
def give_rec(title, sig = sig):
    
    # Get the index corresponding to original_title
    idx = indices[title]

    # Get the pairwsie similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies 
    sig_scores = sorted(sig_scores, key = lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return movies_cleaned_df['original_title'].iloc[movie_indices]

Result

In [35]:
# Test content-based recommendation system with the seminal film The Dark Knight Rises
give_rec('The Dark Knight Rises')
Out[35]:
299                              Batman Forever
65                              The Dark Knight
1359                                     Batman
428                              Batman Returns
2507                                  Slow Burn
119                               Batman Begins
1181                                        JFK
9            Batman v Superman: Dawn of Justice
3854    Batman: The Dark Knight Returns, Part 2
210                              Batman & Robin
Name: original_title, dtype: object

Book Recommendation System

Collaborative Filtering

Import Libraries And Data

In [37]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [38]:
# Import data
books = pd.read_csv('BX-Books.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
users = pd.read_csv('BX-Users.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
C:\Users\Joseff\miniconda3\envs\joseff\lib\site-packages\IPython\core\interactiveshell.py:3071: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
In [52]:
# Books columns
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

# Users columns
users.columns = ['userID', 'Location', 'Age']

# Ratings columns
ratings.columns = ['userID', 'ISBN', 'bookRating']
In [56]:
# Check dataset dimnensions
books.shape, users.shape, ratings.shape
Out[56]:
((271360, 8), (278858, 3), (1149780, 3))

Data Visualization

In [59]:
# Create bar plot for ratings distribution
plt.rc("font", size=15)
ratings['bookRating'].value_counts(sort = False).plot(kind ='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()
In [58]:
# Create bar plot for age distribution
users['Age'].hist(bins=[0, 10, 20, 30, 40, 50, 100])
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('system2.png', bbox_inches='tight')
plt.show()

Data Preprocessing

In [73]:
# Get value counts
counts1 = ratings['userID'].value_counts()

# Get value counts
counts = ratings['bookRating'].value_counts()

# To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]
In [74]:
# KNN is a machine learning algorithm to find clusters of similar users based on common book ratings,
# and make predictions using the average rating of top-k nearest neighbors.
# For example, we first present ratings in a matrix with the matrix having one row for each item (book)
# and one column for each user.

# Merge data based on ISBN
combine_book_rating = pd.merge(ratings, books, on = 'ISBN')

# Data columns to be drop
columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']

# Drop columns
combine_book_rating = combine_book_rating.drop(columns, axis = 1)

# Check data
combine_book_rating.head()
Out[74]:
userID ISBN bookRating bookTitle
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta...
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta...
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta...
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta...
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta...
In [75]:
# Drop missing values
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])

# Group by book titles and create a new column for total rating count.
book_ratingCount = (combine_book_rating.
                    groupby(by = ['bookTitle'])['bookRating'].
                    count().
                    reset_index().
                    rename(columns = {'bookRating': 'totalRatingCount'})
                    [['bookTitle', 'totalRatingCount']]
                    )

# Check data
book_ratingCount.head()
Out[75]:
bookTitle totalRatingCount
0 A Light in the Storm: The Civil War Diary of ... 2
1 Always Have Popsicles 1
2 Apple Magic (The Collector's series) 1
3 Beyond IBM: Leadership Marketing and Finance ... 1
4 Clifford Visita El Hospital (Clifford El Gran... 1
In [76]:
# Combine the rating data with the total rating count data
# This gives exactly what we need to find out which books are popular and filter out lesser-known books
rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')

# Check data
rating_with_totalRatingCount.head()
Out[76]:
userID ISBN bookRating bookTitle totalRatingCount
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta... 82
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82
In [77]:
# Set decimal
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Check data statistics
print(book_ratingCount['totalRatingCount'].describe())
count   160576.000
mean         3.044
std          7.428
min          1.000
25%          1.000
50%          1.000
75%          2.000
max        365.000
Name: totalRatingCount, dtype: float64
In [78]:
# The median book has been rated only once. Let’s look at the top of the distribution
print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))
0.900    5.000
0.910    6.000
0.920    7.000
0.930    7.000
0.940    8.000
0.950   10.000
0.960   11.000
0.970   14.000
0.980   19.000
0.990   31.000
Name: totalRatingCount, dtype: float64
In [79]:
# Set threshold
popularity_threshold = 50

# Satisfy threshold
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')

# Check data
rating_popular_book.head()
Out[79]:
userID ISBN bookRating bookTitle totalRatingCount
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82
2 11676 002542730X 6 Politically Correct Bedtime Stories: Modern Ta... 82
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82
In [80]:
# Check data dimension
rating_popular_book.shape
Out[80]:
(62149, 5)
In [81]:
# Merge column data
combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')

# Filter to users in US and Canada only
us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")]

# Drop column
us_canada_user_rating = us_canada_user_rating.drop('Age', axis=1)

# Check data
us_canada_user_rating.head()
Out[81]:
userID ISBN bookRating bookTitle totalRatingCount Location
0 277427 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82 gilbert, arizona, usa
1 3363 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82 knoxville, tennessee, usa
3 12538 002542730X 10 Politically Correct Bedtime Stories: Modern Ta... 82 byron, minnesota, usa
4 13552 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82 cordova, tennessee, usa
5 16795 002542730X 0 Politically Correct Bedtime Stories: Modern Ta... 82 mechanicsville, maryland, usa
In [83]:
# Import library
from scipy.sparse import csr_matrix

# Drop duplicates
us_canada_user_rating = us_canada_user_rating.drop_duplicates(['userID', 'bookTitle'])

# Pivot dataframe
us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0)

# Create sparse matrix
us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)
In [84]:
# Check dataframe
us_canada_user_rating_pivot
Out[84]:
userID 254 2276 2766 2977 3363 4017 4385 6242 6251 6323 ... 271448 271705 273979 274061 274308 274808 275970 277427 277639 278418
bookTitle
1984 9.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 10.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1st to Die: A Novel 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2nd Chance 0.000 10.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
4 Blondes 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
84 Charing Cross Road 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 10.000 0.000 0.000 0.000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Year of Wonders 0.000 0.000 0.000 7.000 0.000 0.000 0.000 7.000 0.000 0.000 ... 0.000 0.000 9.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
You Belong To Me 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Zoya 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
\O\" Is for Outlaw" 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

746 rows × 734 columns

Apply K-NN (Cosine Similarity)

In [85]:
# Import library
from sklearn.neighbors import NearestNeighbors

# Cosine similarity
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')

# Fit the model
model_knn.fit(us_canada_user_rating_matrix)
Out[85]:
NearestNeighbors(algorithm='brute', metric='cosine')
In [130]:
# Get a random sample
query_index = np.random.choice(us_canada_user_rating_pivot.shape[0])

# Show random sample
print('Index:', query_index, '=> Book:', us_canada_user_rating_pivot.index[query_index])

# Set distances and indices
distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)
Index: 353 => Book: Mother of Pearl

Result

In [131]:
# Recommendation function
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))
Recommendations for Mother of Pearl:

1: River, Cross My Heart, with distance of 0.6940202225362795:
2: Song of Solomon (Oprah's Book Club (Paperback)), with distance of 0.7215526286784837:
3: Watermelon, with distance of 0.7599137034779938:
4: We Were the Mulvaneys, with distance of 0.7752395067391988:
5: A Patchwork Planet, with distance of 0.7926458891181571: