Case Study: Movies And Books Recommendation System¶

Introduction¶

During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. As a proof of the importance of recommender systems, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal was to produce a recommender system that performs better than its own algorithm with a prize of 1 million dollars to win.

In this project, different types of recommender systems will be implemented to make a good recommendation for the users in movies and books. For each of them, it can be observe how they work, describe their theoretical aspect.

Dataset:

Review¶

Movie Recommendation System¶

Weighted Hybrid Technique¶

Import Libraries And Data¶

# Import libraries
import pandas as pd
import numpy as np

# Import datasets
movies_df = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

Data Exploration¶

# Check movies data
movies_df.head()

# Check credits data
credits.head()

# Check datasets dimension
print("Credits:", credits.shape)
print("Movies:", movies_df.shape)

Credits: (4803, 4)
Movies: (4803, 20)

Data Preprocessing¶

# Rename column
credits_column_renamed = credits.rename(index = str, columns = {"movie_id": "id"})

# Merge column data
movies_df_merge = movies_df.merge(credits_column_renamed, on = 'id')

# Check data
movies_df_merge.head()

# Drop unnecessary columns
movies_cleaned_df = movies_df_merge.drop(columns = ['homepage', 'title_x', 'title_y', 'status','production_countries'])

# Check data
movies_cleaned_df.head()

# Check data info
movies_cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count            4803 non-null   int64  
 16  cast                  4803 non-null   object 
 17  crew                  4803 non-null   object 
dtypes: float64(3), int64(4), object(11)
memory usage: 712.9+ KB

# Calculate all the components based on the above formula
v = movies_cleaned_df['vote_count']
R = movies_cleaned_df['vote_average']
C = movies_cleaned_df['vote_average'].mean()
m = movies_cleaned_df['vote_count'].quantile(0.70)

# Add column of weighted average
movies_cleaned_df['weighted_average'] = ((R * v)+ (C * m)) / (v + m)

# Check data
movies_cleaned_df.head()

# Sort data according to weighted averages
movie_sorted_ranking = movies_cleaned_df.sort_values('weighted_average',ascending = False)

# Get only the neccessary columns
movie_sorted_ranking[['original_title', 'vote_count', 'vote_average', 'weighted_average', 'popularity']].head(20)

Data Visualization¶

# Import library
import matplotlib.pyplot as plt
import seaborn as sns

# Sorted dataset according the weighted averages, from highest to lowest
weight_average = movie_sorted_ranking.sort_values('weighted_average', ascending = False)

# Create barplot
plt.figure(figsize = (12,6))
axis1 = sns.barplot(x = weight_average['weighted_average'].head(10), y = weight_average['original_title'].head(10), data = weight_average)
plt.title('Best Movies by average votes', weight = 'bold')
plt.xlabel('Weighted Average Score', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')
plt.xlim(4, 10)
plt.show()

# Sorted dataset according to popularity
popularity = movie_sorted_ranking.sort_values('popularity',ascending = False)

# Check data
popularity.head()

# Create barplot
plt.figure(figsize = (12,6))
ax = sns.barplot(x = popularity['popularity'].head(10), y = popularity['original_title'].head(10), data = popularity)
plt.title('Most Popular by Votes', weight = 'bold')
plt.xlabel('Score of Popularity', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')

Data Scaling¶

# Recommendation based on scaled weighted average and popularity score (Priority is given 50% to both)
# Import library
from sklearn.preprocessing import MinMaxScaler

# Scale the data
scaling = MinMaxScaler()
movie_scaled_df = scaling.fit_transform(movies_cleaned_df[['weighted_average','popularity']])

# Create dataframe
movie_normalized_df = pd.DataFrame(movie_scaled_df,columns = ['weighted_average','popularity'])

# Check data
movie_normalized_df.head()

# Insert new columns
movies_cleaned_df[['normalized_weight_average','normalized_popularity']]= movie_normalized_df

# Check data
movies_cleaned_df.head()

Apply Weighted Average Formula¶

# Create new column
movies_cleaned_df['score'] = (movies_cleaned_df['normalized_weight_average'] * 0.5) + (movies_cleaned_df['normalized_popularity'] * 0.5)

# Sort data according to score
movies_scored_df = movies_cleaned_df.sort_values(['score'], ascending = False)

# Check data
movies_scored_df[['original_title', 'normalized_weight_average', 'normalized_popularity', 'score']].head(20)

Result¶

# Sorted data
scored_df = movies_cleaned_df.sort_values('score', ascending = False)

# Create barplot
plt.figure(figsize=(16,6))
ax = sns.barplot(x=scored_df['score'].head(10), y = scored_df['original_title'].head(10), data=scored_df, palette='deep')
plt.title('Best Rated & Most Popular Blend', weight = 'bold')
plt.xlabel('Score', weight = 'bold')
plt.ylabel('Movie Title', weight = 'bold')
plt.show()

Content-Based Filtering¶

Import Libraries And Data¶

# Import libraries
import pandas as pd
import numpy as np

# Import datasets
movies_df = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")

Data Exploration¶

# Check movies data
movies_df.head()

# Check credits data
credits.head()

# Check datasets dimension
print("Credits:", credits.shape)
print("Movies:", movies_df.shape)

Credits: (4803, 4)
Movies: (4803, 20)

Data Preprocessing¶

# Rename column
credits_column_renamed = credits.rename(index = str, columns = {"movie_id": "id"})

# Merge column data
movies_df_merge = movies_df.merge(credits_column_renamed, on = 'id')

# Check data
movies_df_merge.head()

# Drop unnecessary columns
movies_cleaned_df = movies_df_merge.drop(columns = ['homepage', 'title_x', 'title_y', 'status','production_countries'])

# Check data
movies_cleaned_df.head()

# Check data info
movies_cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count            4803 non-null   int64  
 16  cast                  4803 non-null   object 
 17  crew                  4803 non-null   object 
dtypes: float64(3), int64(4), object(11)
memory usage: 712.9+ KB

# Check data
movies_cleaned_df.head(1)['overview']

0    In the 22nd century, a paraplegic Marine is di...
Name: overview, dtype: object

Apply TF-IDF Vectorizer¶

# Import library
from sklearn.feature_extraction.text import TfidfVectorizer

# Apply vectorizer
tfv = TfidfVectorizer(min_df = 3,  max_features = None, 
            strip_accents = 'unicode', analyzer = 'word', token_pattern = r'\w{1,}', # Remove unncessary characters !@#$%^&*()
            ngram_range = (1, 3), # Take 1 to 3 combinations of words
            stop_words = 'english') # Remove unneccessary words

# Fill NaNs with empty string
movies_cleaned_df['overview'] = movies_cleaned_df['overview'].fillna('')

# Fit the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])

# Sparse matrix
tfv_matrix

<4803x10417 sparse matrix of type '<class 'numpy.float64'>'
	with 127220 stored elements in Compressed Sparse Row format>

# Check data dimension
tfv_matrix.shape

(4803, 10417)

# Import library
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)

# Check data relation score
sig[0]

array([0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

# Reverse mapping of indices and movie titles
indices = pd.Series(movies_cleaned_df.index, index = movies_cleaned_df['original_title']).drop_duplicates()

# Check data
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

# Check data
indices['Newlyweds']

4799

# Check data relation score
sig[4799]

array([0.76159416, 0.76159416, 0.76159438, ..., 0.76159432, 0.76159416,
       0.76159478])

# Show 10 indices and its relation score
list(enumerate(sig[indices['Newlyweds']][:10]))

[(0, 0.7615941559557649),
 (1, 0.7615941559557649),
 (2, 0.7615943791623508),
 (3, 0.7615945564232902),
 (4, 0.7615945779342557),
 (5, 0.7615943267971559),
 (6, 0.7615948190414071),
 (7, 0.761594346971664),
 (8, 0.7615943903358866),
 (9, 0.761594688255891)]

# Show indices and its relation score in descending order # Top 10
sorted(list(enumerate(sig[indices['Newlyweds']][:10])), key = lambda x: x[1], reverse = True)

[(6, 0.7615948190414071),
 (9, 0.761594688255891),
 (4, 0.7615945779342557),
 (3, 0.7615945564232902),
 (8, 0.7615943903358866),
 (2, 0.7615943791623508),
 (7, 0.761594346971664),
 (5, 0.7615943267971559),
 (0, 0.7615941559557649),
 (1, 0.7615941559557649)]

# Recommender function
def give_rec(title, sig = sig):
    
    # Get the index corresponding to original_title
    idx = indices[title]

    # Get the pairwsie similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies 
    sig_scores = sorted(sig_scores, key = lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return movies_cleaned_df['original_title'].iloc[movie_indices]

Result¶

# Test content-based recommendation system with the seminal film The Dark Knight Rises
give_rec('The Dark Knight Rises')

299                              Batman Forever
65                              The Dark Knight
1359                                     Batman
428                              Batman Returns
2507                                  Slow Burn
119                               Batman Begins
1181                                        JFK
9            Batman v Superman: Dawn of Justice
3854    Batman: The Dark Knight Returns, Part 2
210                              Batman & Robin
Name: original_title, dtype: object

Book Recommendation System¶

Collaborative Filtering¶

Import Libraries And Data¶

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import data
books = pd.read_csv('BX-Books.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
users = pd.read_csv('BX-Users.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")
ratings = pd.read_csv('BX-Book-Ratings.csv', sep = ';', error_bad_lines = False, encoding = "latin-1")

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
C:\Users\Joseff\miniconda3\envs\joseff\lib\site-packages\IPython\core\interactiveshell.py:3071: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

# Books columns
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

# Users columns
users.columns = ['userID', 'Location', 'Age']

# Ratings columns
ratings.columns = ['userID', 'ISBN', 'bookRating']

# Check dataset dimnensions
books.shape, users.shape, ratings.shape

((271360, 8), (278858, 3), (1149780, 3))

Data Visualization¶

# Create bar plot for ratings distribution
plt.rc("font", size=15)
ratings['bookRating'].value_counts(sort = False).plot(kind ='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()

# Create bar plot for age distribution
users['Age'].hist(bins=[0, 10, 20, 30, 40, 50, 100])
plt.title('Age Distribution\n')
plt.xlabel('Age')
plt.ylabel('Count')
plt.savefig('system2.png', bbox_inches='tight')
plt.show()

Data Preprocessing¶

# Get value counts
counts1 = ratings['userID'].value_counts()

# Get value counts
counts = ratings['bookRating'].value_counts()

# To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

# KNN is a machine learning algorithm to find clusters of similar users based on common book ratings,
# and make predictions using the average rating of top-k nearest neighbors.
# For example, we first present ratings in a matrix with the matrix having one row for each item (book)
# and one column for each user.

# Merge data based on ISBN
combine_book_rating = pd.merge(ratings, books, on = 'ISBN')

# Data columns to be drop
columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']

# Drop columns
combine_book_rating = combine_book_rating.drop(columns, axis = 1)

# Check data
combine_book_rating.head()

# Drop missing values
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])

# Group by book titles and create a new column for total rating count.
book_ratingCount = (combine_book_rating.
                    groupby(by = ['bookTitle'])['bookRating'].
                    count().
                    reset_index().
                    rename(columns = {'bookRating': 'totalRatingCount'})
                    [['bookTitle', 'totalRatingCount']]
                    )

# Check data
book_ratingCount.head()

# Combine the rating data with the total rating count data
# This gives exactly what we need to find out which books are popular and filter out lesser-known books
rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')

# Check data
rating_with_totalRatingCount.head()

# Set decimal
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Check data statistics
print(book_ratingCount['totalRatingCount'].describe())

count   160576.000
mean         3.044
std          7.428
min          1.000
25%          1.000
50%          1.000
75%          2.000
max        365.000
Name: totalRatingCount, dtype: float64

# The median book has been rated only once. Let’s look at the top of the distribution
print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))

0.900    5.000
0.910    6.000
0.920    7.000
0.930    7.000
0.940    8.000
0.950   10.000
0.960   11.000
0.970   14.000
0.980   19.000
0.990   31.000
Name: totalRatingCount, dtype: float64

# Set threshold
popularity_threshold = 50

# Satisfy threshold
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')

# Check data
rating_popular_book.head()

# Check data dimension
rating_popular_book.shape

(62149, 5)

# Merge column data
combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')

# Filter to users in US and Canada only
us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")]

# Drop column
us_canada_user_rating = us_canada_user_rating.drop('Age', axis=1)

# Check data
us_canada_user_rating.head()

# Import library
from scipy.sparse import csr_matrix

# Drop duplicates
us_canada_user_rating = us_canada_user_rating.drop_duplicates(['userID', 'bookTitle'])

# Pivot dataframe
us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0)

# Create sparse matrix
us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)

# Check dataframe
us_canada_user_rating_pivot

Apply K-NN (Cosine Similarity)¶

# Import library
from sklearn.neighbors import NearestNeighbors

# Cosine similarity
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')

# Fit the model
model_knn.fit(us_canada_user_rating_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

# Get a random sample
query_index = np.random.choice(us_canada_user_rating_pivot.shape[0])

# Show random sample
print('Index:', query_index, '=> Book:', us_canada_user_rating_pivot.index[query_index])

# Set distances and indices
distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

Index: 353 => Book: Mother of Pearl

Result¶

# Recommendation function
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Mother of Pearl:

1: River, Cross My Heart, with distance of 0.6940202225362795:
2: Song of Solomon (Oprah's Book Club (Paperback)), with distance of 0.7215526286784837:
3: Watermelon, with distance of 0.7599137034779938:
4: We Were the Mulvaneys, with distance of 0.7752395067391988:
5: A Patchwork Planet, with distance of 0.7926458891181571:

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.438	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.000	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.200	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.083	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.000	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.900	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.377	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.000	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.300	4466
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.313	[{"name": "Legendary Pictures", "id": 923}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2012-07-16	1084939099	165.000	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.600	9106
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.927	[{"name": "Walt Disney Pictures", "id": 2}]	[{"iso_3166_1": "US", "name": "United States o...	2012-03-07	284139100	132.000	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.100	2124

	movie_id	title	cast	crew
0	19995	Avatar	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	285	Pirates of the Caribbean: At World's End	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	206647	Spectre	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...
3	49026	The Dark Knight Rises	[{"cast_id": 2, "character": "Bruce Wayne / Ba...	[{"credit_id": "52fe4781c3a36847f81398c3", "de...
4	49529	John Carter	[{"cast_id": 5, "character": "John Carter", "c...	[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	...	runtime	spoken_languages	status	tagline	title_x	vote_average	vote_count	title_y	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.438	[{"name": "Ingenious Film Partners", "id": 289...	...	162.000	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.200	11800	Avatar	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.083	[{"name": "Walt Disney Pictures", "id": 2}, {"...	...	169.000	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.900	4500	Pirates of the Caribbean: At World's End	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.377	[{"name": "Columbia Pictures", "id": 5}, {"nam...	...	148.000	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.300	4466	Spectre	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.313	[{"name": "Legendary Pictures", "id": 923}, {"...	...	165.000	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.600	9106	The Dark Knight Rises	[{"cast_id": 2, "character": "Bruce Wayne / Ba...	[{"credit_id": "52fe4781c3a36847f81398c3", "de...
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.927	[{"name": "Walt Disney Pictures", "id": 2}]	...	132.000	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.100	2124	John Carter	[{"cast_id": 5, "character": "John Carter", "c...	[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

	budget	genres	id	keywords	original_language	original_title	overview	popularity	production_companies	release_date	revenue	runtime	spoken_languages	tagline	vote_average	vote_count	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.438	[{"name": "Ingenious Film Partners", "id": 289...	2009-12-10	2787965087	162.000	[{"iso_639_1": "en", "name": "English"}, {"iso...	Enter the World of Pandora.	7.200	11800	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.083	[{"name": "Walt Disney Pictures", "id": 2}, {"...	2007-05-19	961000000	169.000	[{"iso_639_1": "en", "name": "English"}]	At the end of the world, the adventure begins.	6.900	4500	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.377	[{"name": "Columbia Pictures", "id": 5}, {"nam...	2015-10-26	880674609	148.000	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	A Plan No One Escapes	6.300	4466	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.313	[{"name": "Legendary Pictures", "id": 923}, {"...	2012-07-16	1084939099	165.000	[{"iso_639_1": "en", "name": "English"}]	The Legend Ends	7.600	9106	[{"cast_id": 2, "character": "Bruce Wayne / Ba...	[{"credit_id": "52fe4781c3a36847f81398c3", "de...
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.927	[{"name": "Walt Disney Pictures", "id": 2}]	2012-03-07	284139100	132.000	[{"iso_639_1": "en", "name": "English"}]	Lost in our world, found in another.	6.100	2124	[{"cast_id": 5, "character": "John Carter", "c...	[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

	budget	genres	id	keywords	original_language	original_title	overview	popularity	production_companies	release_date	revenue	runtime	spoken_languages	tagline	vote_average	vote_count	cast	crew	weighted_average
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.438	[{"name": "Ingenious Film Partners", "id": 289...	2009-12-10	2787965087	162.000	[{"iso_639_1": "en", "name": "English"}, {"iso...	Enter the World of Pandora.	7.200	11800	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...	7.148
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.083	[{"name": "Walt Disney Pictures", "id": 2}, {"...	2007-05-19	961000000	169.000	[{"iso_639_1": "en", "name": "English"}]	At the end of the world, the adventure begins.	6.900	4500	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...	6.808
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.377	[{"name": "Columbia Pictures", "id": 5}, {"nam...	2015-10-26	880674609	148.000	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	A Plan No One Escapes	6.300	4466	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...	6.276
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.313	[{"name": "Legendary Pictures", "id": 923}, {"...	2012-07-16	1084939099	165.000	[{"iso_639_1": "en", "name": "English"}]	The Legend Ends	7.600	9106	[{"cast_id": 2, "character": "Bruce Wayne / Ba...	[{"credit_id": "52fe4781c3a36847f81398c3", "de...	7.510
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.927	[{"name": "Walt Disney Pictures", "id": 2}]	2012-03-07	284139100	132.000	[{"iso_639_1": "en", "name": "English"}]	Lost in our world, found in another.	6.100	2124	[{"cast_id": 5, "character": "John Carter", "c...	[{"credit_id": "52fe479ac3a36847f813eaa3", "de...	6.098

	original_title	vote_count	vote_average	weighted_average	popularity
1881	The Shawshank Redemption	8205	8.500	8.341	136.748
3337	The Godfather	5893	8.400	8.193	143.660
662	Fight Club	9413	8.300	8.172	146.757
3232	Pulp Fiction	8428	8.300	8.158	121.463
65	The Dark Knight	12002	8.200	8.103	187.323
809	Forrest Gump	7927	8.200	8.056	138.133
1818	Schindler's List	4329	8.300	8.039	104.469
3865	Whiplash	4254	8.300	8.035	192.529
96	Inception	13752	8.100	8.019	167.584
1990	The Empire Strikes Back	5879	8.200	8.010	78.518
2294	千と千尋の神隠し	3840	8.300	8.010	118.969
95	Interstellar	10867	8.100	7.998	724.248
2731	The Godfather: Part II	3338	8.300	7.973	105.793
329	The Lord of the Rings: The Return of the King	8064	8.100	7.965	123.630
2912	Star Wars	6624	8.100	7.938	126.394
690	The Green Mile	4048	8.200	7.935	103.698
1553	Se7en	5765	8.100	7.916	79.580
262	The Lord of the Rings: The Fellowship of the Ring	8705	8.000	7.881	138.050
1847	GoodFellas	3128	8.200	7.870	63.654
2091	The Silence of the Lambs	4443	8.100	7.868	18.175

	budget	genres	id	keywords	original_language	original_title	overview	popularity	production_companies	release_date	revenue	runtime	spoken_languages	tagline	vote_average	vote_count	cast	crew	weighted_average
546	74000000	[{"id": 10751, "name": "Family"}, {"id": 16, "...	211672	[{"id": 3487, "name": "assistant"}, {"id": 179...	en	Minions	Minions Stuart, Kevin and Bob are recruited by...	875.581	[{"name": "Universal Pictures", "id": 33}, {"n...	2015-06-17	1156730962	91.000	[{"iso_639_1": "en", "name": "English"}]	Before Gru, they had a history of bad bosses	6.400	4571	[{"cast_id": 22, "character": "Scarlet Overkil...	[{"credit_id": "5431b2b10e0a2656e20026c7", "de...	6.365
95	165000000	[{"id": 12, "name": "Adventure"}, {"id": 18, "...	157336	[{"id": 83, "name": "saving the world"}, {"id"...	en	Interstellar	Interstellar chronicles the adventures of a gr...	724.248	[{"name": "Paramount Pictures", "id": 4}, {"na...	2014-11-05	675120017	169.000	[{"iso_639_1": "en", "name": "English"}]	Mankind was born on Earth. It was never meant ...	8.100	10867	[{"cast_id": 9, "character": "Joseph Cooper", ...	[{"credit_id": "52fe4bbf9251416c910e4801", "de...	7.998
788	58000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	293660	[{"id": 2095, "name": "anti hero"}, {"id": 307...	en	Deadpool	Deadpool tells the origin story of former Spec...	514.570	[{"name": "Twentieth Century Fox Film Corporat...	2016-02-09	783112979	108.000	[{"iso_639_1": "en", "name": "English"}]	Witness the beginning of a happy ending	7.400	10995	[{"cast_id": 99, "character": "Wade Wilson / D...	[{"credit_id": "56c986b2925141172f0068b6", "de...	7.334
94	170000000	[{"id": 28, "name": "Action"}, {"id": 878, "na...	118340	[{"id": 8828, "name": "marvel comic"}, {"id": ...	en	Guardians of the Galaxy	Light years from Earth, 26 years after being a...	481.099	[{"name": "Marvel Studios", "id": 420}, {"name...	2014-07-30	773328629	121.000	[{"iso_639_1": "en", "name": "English"}]	All heroes start somewhere.	7.900	9742	[{"cast_id": 1, "character": "Peter Quill / St...	[{"credit_id": "538ce329c3a3687155003358", "de...	7.798
127	150000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	76341	[{"id": 2964, "name": "future"}, {"id": 3713, ...	en	Mad Max: Fury Road	An apocalyptic story set in the furthest reach...	434.279	[{"name": "Village Roadshow Pictures", "id": 7...	2015-05-13	378858340	120.000	[{"iso_639_1": "en", "name": "English"}]	What a Lovely Day.	7.200	9427	[{"cast_id": 2, "character": "Max Rockatansky"...	[{"credit_id": "577da370c3a36817f8003838", "de...	7.136

	original_title	normalized_weight_average	normalized_popularity	score
95	Interstellar	0.906	0.827	0.867
546	Minions	0.461	1.000	0.730
94	Guardians of the Galaxy	0.852	0.549	0.701
788	Deadpool	0.725	0.588	0.656
127	Mad Max: Fury Road	0.671	0.496	0.583
1881	The Shawshank Redemption	1.000	0.156	0.578
65	The Dark Knight	0.935	0.214	0.574
3865	Whiplash	0.916	0.220	0.568
3337	The Godfather	0.960	0.164	0.562
662	Fight Club	0.954	0.168	0.561
96	Inception	0.912	0.191	0.552
3232	Pulp Fiction	0.950	0.139	0.544
809	Forrest Gump	0.922	0.158	0.540
199	Pirates of the Caribbean: The Curse of the Bla...	0.741	0.311	0.526
2294	千と千尋の神隠し	0.910	0.136	0.523
88	Big Hero 6	0.812	0.233	0.522
329	The Lord of the Rings: The Return of the King	0.897	0.141	0.519
1818	Schindler's List	0.918	0.119	0.518
2912	Star Wars	0.890	0.144	0.517
262	The Lord of the Rings: The Fellowship of the Ring	0.874	0.158	0.516

	userID	ISBN	bookRating	bookTitle
0	277427	002542730X	10	Politically Correct Bedtime Stories: Modern Ta...
1	3363	002542730X	0	Politically Correct Bedtime Stories: Modern Ta...
2	11676	002542730X	6	Politically Correct Bedtime Stories: Modern Ta...
3	12538	002542730X	10	Politically Correct Bedtime Stories: Modern Ta...
4	13552	002542730X	0	Politically Correct Bedtime Stories: Modern Ta...

	bookTitle	totalRatingCount
0	A Light in the Storm: The Civil War Diary of ...	2
1	Always Have Popsicles	1
2	Apple Magic (The Collector's series)	1
3	Beyond IBM: Leadership Marketing and Finance ...	1
4	Clifford Visita El Hospital (Clifford El Gran...	1

userID	254	2276	2766	2977	3363	4017	4385	6242	6251	6323	...	271448	271705	273979	274061	274308	274808	275970	277427	277639	278418
bookTitle
1984	9.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	10.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
1st to Die: A Novel	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
2nd Chance	0.000	10.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
4 Blondes	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
84 Charing Cross Road	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	10.000	0.000	0.000	0.000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Year of Wonders	0.000	0.000	0.000	7.000	0.000	0.000	0.000	7.000	0.000	0.000	...	0.000	0.000	9.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
You Belong To Me	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Zoya	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
\O\" Is for Outlaw"	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	...	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000

	weighted_average	popularity
0	0.674	0.172
1	0.581	0.159
2	0.436	0.123
3	0.773	0.128
4	0.388	0.050

Case Study: Movies And Books Recommendation System¶

Table of Contents

Introduction¶

Review¶

Movie Recommendation System¶

Weighted Hybrid Technique¶

Import Libraries And Data¶

Data Exploration¶

Data Preprocessing¶

Data Visualization¶

Data Scaling¶

Apply Weighted Average Formula¶

Result¶

Content-Based Filtering¶

Import Libraries And Data¶

Data Exploration¶

Data Preprocessing¶

Apply TF-IDF Vectorizer¶

Result¶

Book Recommendation System¶

Collaborative Filtering¶

Import Libraries And Data¶

Data Visualization¶

Data Preprocessing¶

Apply K-NN (Cosine Similarity)¶

Result¶