Case Study: Sentiment Analysis of Yelp Review Ratings

Girl in a jacket

Introduction

Yelp is an app that helps people to connect and write reviews to local businesses. In this project, Natural Language Processing (NLP) strategies will be used to analyze and classify their reviews whether if it is a bad or good review.

Problem:

  • To assess the public perception of restaurants on Yelp via exploratory data analysis
  • To build a machine learning model which accurately predicts the sentiment of reviews

Dataset:

  • Number of 'stars' indicate the business rating given by a customer, ranging from 1 to 5
  • 'Cool', 'Useful' and 'Funny' indicate the number of cool votes given by other Yelp Users.

Source:

  • This dataset is a subset of Yelp's businesses, reviews, and user data.

Libraries and Dataset Importation

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [2]:
# Import dataset
df = pd.read_csv('project_data/yelp.csv')
df.head()
Out[2]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0

Data Exploration

In [3]:
# Check the dataset shape
df.shape
Out[3]:
(10000, 10)
In [4]:
# Check the dataset description
df.describe()
Out[4]:
stars cool useful funny
count 10000.000000 10000.000000 10000.000000 10000.000000
mean 3.777500 0.876800 1.409300 0.701300
std 1.214636 2.067861 2.336647 1.907942
min 1.000000 0.000000 0.000000 0.000000
25% 3.000000 0.000000 0.000000 0.000000
50% 4.000000 0.000000 1.000000 0.000000
75% 5.000000 1.000000 2.000000 1.000000
max 5.000000 77.000000 76.000000 57.000000
In [5]:
# Check more info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  10000 non-null  object
 1   date         10000 non-null  object
 2   review_id    10000 non-null  object
 3   stars        10000 non-null  int64 
 4   text         10000 non-null  object
 5   type         10000 non-null  object
 6   user_id      10000 non-null  object
 7   cool         10000 non-null  int64 
 8   useful       10000 non-null  int64 
 9   funny        10000 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 781.4+ KB
In [6]:
df.isnull().mean()
Out[6]:
business_id    0.0
date           0.0
review_id      0.0
stars          0.0
text           0.0
type           0.0
user_id        0.0
cool           0.0
useful         0.0
funny          0.0
dtype: float64
In [7]:
# Check the first text comment
df['text'][0]
Out[7]:
'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'
In [8]:
# Check more text comment
df['text'][500]
Out[8]:
"Hands down my favorite coffee shop in town. They used to be located inside the Conspire art collective but have now moved into the old Hood Rides digs up the block and it's awesome.\n\nPlenty of seating room at the bar and also a really cool little room next door  where you can pull up a seat and hang out. From what I understand there will be seating outside too.\n\nMost importantly however is the coffee. They use a special mix roasted specially by Cartel. I'll be the first to admit I'm no coffee connoisseur but I do know this, every drink I've had, no matter who's behind the bar has been perfect. Not burnt, not too sweet, and not a hint of pretentiousness.\n\nSooooo, why go to S-word Bucks when you can get a better cup of coffee up the street. Oh and for those of us who burn the midnight oil on more than a few occasions, they're open until 12am EVERY night."
In [9]:
# Check more comment
df['text'][999]
Out[9]:
"IKEA is so much fun. It's a little bit of a walk up and down the store but with all the different items on display there is time to sit and relax on the chairs, couches, beds. I love walking around the store looking a new ideas for my own home. They have very good prices on all household items. When shopping for accessories for the house I highly recommend going to IKEA because you are bound to find something. Simple things such as Tupperware, trash cans, bathroom accessories can all be found at a really good price."

Data Visualization

In [10]:
# Add lenght column in the dataframe 
df['length'] = df['text'].apply(len)
In [11]:
# Check dataframe
df.head(2)
Out[11]:
business_id date review_id stars text type user_id cool useful funny length
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345
In [12]:
# Check the longest text
df[df['length'] == 4997]['text'].iloc[0]
Out[12]:
'In our continuing quest to identify cool, locally owned places to eat and/or drink, Caroline and I auditioned Vintage 95 last night. \n\nBefore I go further, understand that whenever I go out for eats or drinks, I have  in  mind a Platonic Ideal of the Bar/Pub/Eatery I most want to frequent. I\'m on a constant quest to find that  Ideal expressed in Real Life. \n\nVintage 95 isn\'t quite there, but it\'s damn close. If I ever give something Five Stars, you\'ll know it has pegged my Platonic Ideal. Anyway...\n\nThe plan last night was for drinks. No plans for food, just Adult Beverages and warm conversation. But it turned into more.\n\nThe location in downtown Chandler is terrific for us. The owners have created a very fine visual experience - leather, stone, dark woods, good lighting. And they don\'t have the music turned up so loud that you CANNOT HAVE A CONVERSATION. This is one of my pet peeves. If I want to stare at people move their mouths while enduring an aural assault, I\'ll stand on the deck of an aircraft carrier. When I go out with friends, I want to enjoy their company AND their conversation. Is that concept so difficult to grasp? [/rant off]\n\nThe atmosphere at Vintage 95 is very close to my Ideal. I\'d go back just to sit on the leather couches in front of the fireplace, and then go back another time to sit on the leather stools at the bar, and then go back about fourteen more times to sit out on the patio. Seriously - go check out the patio. It is EXACTLY what a Patio Hangout Bar should be. EXACTLY.\n\nCaroline and I told the hostesses we were only there for drinks, so we were seated in the bar area in some fabulous leather club chairs. It wasn\'t initmate, but we weren\'t looking for intimate. And speaking of the bar, even though V95 advertises itself as a wine bar, they DO have booze. I\'m not much of a wine drinker and was very pleased to see that they carried a pretty good selection of single malt scotches. Not an overwhelming selection, but well beyond the normal Glenfiddich /Glenlivit /GlenMorangie trio to which most places are limited. I had a couple of drums of Ardbeg, which is one of my new favorites and very reasonably priced at retail. (Scotch is never reasonably priced in restaurants, but I was celebrating so I didn\'t care.) Caroline had her normal "vodka martini extra dirty extra cold" which she judged to have "perfect dirtiness", (no wonder I love her!), perfect amount of olives and very cold. \n\nThe limited Happy Hour menu had some very interesting choices. We settled on the bruschetta and the smoked tomato bisque. The bruschetta was VERY nice and quite unusual. You get to select four of eight choices for your bruschetta platter; we picked: (1) white bean and pancetta, (2) gravlax, caper goat cheese and pickled onions, (3) fig chutney, ricotta and prosciutto, (4) brie, pear and onion jam. They were all served cold, in nice sized portions and the flavors were all nicely balanced and very interesting. Caroline would have preferred the bread to not be so crispy, but I really liked it. The tomato bisque  was creamy, smoky and had well-balanced flavor. Caroline said it was unique and I say it was just darn delicious. \n\nThings being as they are, drinks and appetizers turned into food. A friend had told us "you have to try the Vintage burger", so we did. It came served with a mixture of regular and sweet potato fries, all nicely cooked and nicely seasoned. Recommended. The burger was VERY tasty. They obviously use good beef, the bun was fresh, the fixin\'s were tasty. HIGHLY recommended.\n\nIn for a dime, in for a dollar, right? So we ordered dessert. Again, the dessert menu is short, but I\'m okay with that as long as they do it well. Chocolate torte with hazelnut gelato, apple pie with carmel sauce and creme fraiche gelato, and something else we couldn\'t remember. I\'m allergic to hazelnut and don\'t like sweet desserts, so we decided to try the apple pie.\n\nLike everything else we had sampled, the apple pie was unusual - you wouldn\'t find it anywhere else. It was served on a freshly baked puff pastry, cubed apples served on top and inside - tender but not mushy -  with lots of cinnamon and sugar, plate was swirled with salted dolce la leche. It was tasty, but instead of the expected creme fraiche gelato, we were served hazelnut gelato. I didn\'t realize it was hazelnut until I\'d had a couple of bites and my throat started to swell up.\n\nAt this point that the night could have turned into a disaster, but to their credit - it didn\'t. We told the waiter who told the manager, (Gavin - one of the owners), who immediately came and asked if I needed emergency assistance. I didn\'t, I\'m not THAT allergic.)  Frankly, their response was EXACTLY the kind of customer service you want to see. Anyone can make a mistake, so no harm, no foul. But I must give BIG Kudos to Gavin for his kindness, attention to detail and outstanding customer service.\n\nWe will DEFINTELY be back and I strongly recommend you put it on your list too.'
In [13]:
# Visualize the occurence of reviews
df['length'].plot(bins = 100, kind = 'hist')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ee457c6130>
In [14]:
# Check the count plot in stars column
sns.countplot(y = 'stars', data = df)
plt.show()
In [15]:
# Create a facegrid for stars and length
g = sns.FacetGrid(data = df, col = 'stars', col_wrap = 3)
g.map(plt.hist, 'length', bins = 20, color = 'r')
plt.show()
In [16]:
# Check the 1 star reviews
df_1 = df[df['stars'] == 1]
df_1.head(2)
Out[16]:
business_id date review_id stars text type user_id cool useful funny length
23 IJ0o6b8bJFAbG6MjGfBebQ 2010-09-05 Dx9sfFU6Zn0GYOckijom-g 1 U can go there n check the car out. If u wanna... review zRlQEDYd_HKp0VS3hnAffA 0 1 1 594
31 vvA3fbps4F9nGlAEYKk_sA 2012-05-04 S9OVpXat8k5YwWCn6FAgXg 1 Disgusting! Had a Groupon so my daughter and ... review 8AMn6644NmBf96xGO3w6OA 0 1 0 361
In [17]:
# Check the 5 stars reviews
df_5 = df[df['stars'] == 5]
df_5.head(2)
Out[17]:
business_id date review_id stars text type user_id cool useful funny length
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345
In [18]:
# join the two dataframe
df_15 = pd.concat([df_1, df_5])
df_15
Out[18]:
business_id date review_id stars text type user_id cool useful funny length
23 IJ0o6b8bJFAbG6MjGfBebQ 2010-09-05 Dx9sfFU6Zn0GYOckijom-g 1 U can go there n check the car out. If u wanna... review zRlQEDYd_HKp0VS3hnAffA 0 1 1 594
31 vvA3fbps4F9nGlAEYKk_sA 2012-05-04 S9OVpXat8k5YwWCn6FAgXg 1 Disgusting! Had a Groupon so my daughter and ... review 8AMn6644NmBf96xGO3w6OA 0 1 0 361
35 o1GIYYZJjM6nM03fQs_uEQ 2011-11-30 ApKbwpYJdnhhgP4NbjQw2Q 1 I've eaten here many times, but none as bad as... review iwUN95LIaEr75TZE_JC6bg 0 4 3 1198
61 l4vBbCL9QbGiwLuLKwD_bA 2011-11-22 DJVxOfj2Rw9zklC9tU3i1w 1 I have always been a fan of Burlington's deals... review EPROVap0M19Y6_4uf3eCmQ 0 0 0 569
64 CEswyP-9SsXRNLR9fFGKKw 2012-05-19 GXj4PNAi095-q9ynPYH3kg 1 Another night meeting friends here. I have to... review MjLAe48XNfYlTeFYca5gMw 0 1 2 498
... ... ... ... ... ... ... ... ... ... ... ...
9990 R8VwdLyvsp9iybNqRvm94g 2011-10-03 pcEeHdAJPoFNF23es0kKWg 5 Yes I do rock the hipster joints. I dig this ... review b92Y3tyWTQQZ5FLifex62Q 1 1 1 263
9991 WJ5mq4EiWYAA4Vif0xDfdg 2011-12-05 EuHX-39FR7tyyG1ElvN1Jw 5 Only 4 stars? \n\n(A few notes: The folks that... review hTau-iNZFwoNsPCaiIUTEA 1 1 0 908
9992 f96lWMIAUhYIYy9gOktivQ 2009-03-10 YF17z7HWlMj6aezZc-pVEw 5 I'm not normally one to jump at reviewing a ch... review W_QXYA7A0IhMrvbckz7eVg 2 3 2 1326
9994 L3BSpFvxcNf3T_teitgt6A 2012-03-19 0nxb1gIGFgk3WbC5zwhKZg 5 Let's see...what is there NOT to like about Su... review OzOZv-Knlw3oz9K5Kh5S6A 1 2 1 1968
9999 pF7uRzygyZsltbmVpjIyvw 2010-10-16 vWSmOhg2ID1MNZHaWapGbA 5 4-5 locations.. all 4.5 star average.. I think... review KSBFytcdjPKZgXKQnYQdkA 0 0 0 461

4086 rows × 11 columns

In [19]:
# Check more info
df_15.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4086 entries, 23 to 9999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  4086 non-null   object
 1   date         4086 non-null   object
 2   review_id    4086 non-null   object
 3   stars        4086 non-null   int64 
 4   text         4086 non-null   object
 5   type         4086 non-null   object
 6   user_id      4086 non-null   object
 7   cool         4086 non-null   int64 
 8   useful       4086 non-null   int64 
 9   funny        4086 non-null   int64 
 10  length       4086 non-null   int64 
dtypes: int64(5), object(6)
memory usage: 383.1+ KB
In [20]:
# Check 1 star percentage
print('1 Star Review Percentage = ', (len(df_1) / len(df_15) ) *100, '%')
1 Star Review Percentage =  18.330885952031327 %
In [21]:
# Check 5 star percentage
print('5 Star Review Percentage = ', (len(df_5) / len(df_15) ) *100, '%')
5 Star Review Percentage =  81.66911404796868 %
In [22]:
sns.countplot(df_15['stars'], label = 'count')
plt.show()

Data Pre-processing

In [23]:
# Import string to check string punctuation
import string
string.punctuation
Out[23]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [24]:
# Create a test sample
Test = 'Hello joseff, I am so happy learning AI now!!'
In [25]:
# Remove unnecessary punctuation in the test sample
Test_punc_removed = [ char for char in Test if char not in string.punctuation]
In [66]:
# Check test sample
Test_punc_removed;
In [27]:
# Join the test sample
Test_punc_removed_join = ''.join(Test_punc_removed)
In [28]:
# Check the connected test sample
Test_punc_removed_join
Out[28]:
'Hello joseff I am so happy learning AI now'
In [65]:
# Import stopwords for cleaning
from nltk.corpus import stopwords
stopwords.words('english');
In [30]:
# Create a function to get only the important features
Test_punc_removed_join_clean = [ word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')] 
In [31]:
# Check the cleaned test sample
Test_punc_removed_join_clean
Out[31]:
['Hello', 'joseff', 'happy', 'learning', 'AI']
In [32]:
# Turn the words to numbers by CountVectorizer
sample_data = ['Hello joseff, I am so happy learning AI now!!', 'joseff is so cute', 'AI is love']

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)
In [33]:
# Get the important features 
print(vectorizer.get_feature_names())
['ai', 'am', 'cute', 'happy', 'hello', 'is', 'joseff', 'learning', 'love', 'now', 'so']
In [34]:
# Check the equivalent number value
print(X.toarray())
[[1 1 0 1 1 0 1 1 0 1 1]
 [0 0 1 0 0 1 1 0 0 0 1]
 [1 0 0 0 0 1 0 0 1 0 0]]

More Data Pre-processing

In [35]:
# Create preprocessing function
def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [ word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean
In [36]:
# Check cleaned data
Test_punc_removed_join_clean
Out[36]:
['Hello', 'joseff', 'happy', 'learning', 'AI']
In [37]:
# Initiate cleaning data
df_clean = df_15['text'].apply(message_cleaning)
In [38]:
# Original review
print(df_15['text'][0])
My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!
In [39]:
# Cleaned review
print(df_clean[0])
['wife', 'took', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'made', 'sitting', 'outside', 'overlooking', 'grounds', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrived', 'quickly', 'semibusy', 'Saturday', 'morning', 'looked', 'like', 'place', 'fills', 'pretty', 'quickly', 'earlier', 'get', 'better', 'favor', 'get', 'Bloody', 'Mary', 'phenomenal', 'simply', 'best', 'Ive', 'ever', 'Im', 'pretty', 'sure', 'use', 'ingredients', 'garden', 'blend', 'fresh', 'order', 'amazing', 'EVERYTHING', 'menu', 'looks', 'excellent', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'tasty', 'delicious', 'came', '2', 'pieces', 'griddled', 'bread', 'amazing', 'absolutely', 'made', 'meal', 'complete', 'best', 'toast', 'Ive', 'ever', 'Anyway', 'cant', 'wait', 'go', 'back']
In [40]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# fit CountVectorizer to the dataset
vectorizer = CountVectorizer( analyzer = message_cleaning)
df_countvectorizer = vectorizer.fit_transform(df_15['text'])
In [63]:
# Check vectorized dataset
vectorizer.get_feature_names();
In [42]:
# Check df_countvectorizer in array format
print(df_countvectorizer.toarray())
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
In [43]:
# df_countvectorizer dimension
df_countvectorizer.shape
Out[43]:
(4086, 26435)

Model Creation

In [51]:
# Label the indepedent and dependent variable
X = df_countvectorizer
y = df_15['stars'].values
In [52]:
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 )
In [53]:
# Check the train dataset shape
X_train.shape, y_train.shape
Out[53]:
((3268, 26435), (3268,))
In [54]:
# Check the test dataset shape
X_test.shape, y_test.shape
Out[54]:
((818, 26435), (818,))
In [55]:
# Fit the dataset into the model
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)
Out[55]:
MultinomialNB()

Model Evaluation

In [56]:
# Import evaluation libraries
from sklearn.metrics import classification_report, confusion_matrix

# Initiate prediction for training dataset
y_predict_train = NB_classifier.predict(X_train)
y_predict_train
Out[56]:
array([5, 5, 5, ..., 5, 1, 5], dtype=int64)
In [57]:
# Check confusion matrix for training dataset
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
plt.show()
In [64]:
# Initiate prediction for testing dataset
y_predict_test = NB_classifier.predict(X_test)
y_predict_test;
In [59]:
# Check confusion matrix for training dataset
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
plt.show()
In [60]:
print(classification_report(y_test, y_predict_test))
              precision    recall  f1-score   support

           1       0.84      0.69      0.76       149
           5       0.93      0.97      0.95       669

    accuracy                           0.92       818
   macro avg       0.89      0.83      0.85       818
weighted avg       0.92      0.92      0.92       818

Conclusion:

The Naive Bayes model were able to achieved an accuracy of 92%. This model would summarize the reviews in their business to let the owners have more time and focus more in what really matter most. As a business owner it would be reasonable to wonder what aspects of the services provided are viewed as negative or positive by customers.