Yelp is an app that helps people to connect and write reviews to local businesses. In this project, Natural Language Processing (NLP) strategies will be used to analyze and classify their reviews whether if it is a bad or good review.
Problem:
Dataset:
Source:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Import dataset
df = pd.read_csv('project_data/yelp.csv')
df.head()
# Check the dataset shape
df.shape
# Check the dataset description
df.describe()
# Check more info about the dataset
df.info()
df.isnull().mean()
# Check the first text comment
df['text'][0]
# Check more text comment
df['text'][500]
# Check more comment
df['text'][999]
# Add lenght column in the dataframe
df['length'] = df['text'].apply(len)
# Check dataframe
df.head(2)
# Check the longest text
df[df['length'] == 4997]['text'].iloc[0]
# Visualize the occurence of reviews
df['length'].plot(bins = 100, kind = 'hist')
# Check the count plot in stars column
sns.countplot(y = 'stars', data = df)
plt.show()
# Create a facegrid for stars and length
g = sns.FacetGrid(data = df, col = 'stars', col_wrap = 3)
g.map(plt.hist, 'length', bins = 20, color = 'r')
plt.show()
# Check the 1 star reviews
df_1 = df[df['stars'] == 1]
df_1.head(2)
# Check the 5 stars reviews
df_5 = df[df['stars'] == 5]
df_5.head(2)
# join the two dataframe
df_15 = pd.concat([df_1, df_5])
df_15
# Check more info
df_15.info()
# Check 1 star percentage
print('1 Star Review Percentage = ', (len(df_1) / len(df_15) ) *100, '%')
# Check 5 star percentage
print('5 Star Review Percentage = ', (len(df_5) / len(df_15) ) *100, '%')
sns.countplot(df_15['stars'], label = 'count')
plt.show()
# Import string to check string punctuation
import string
string.punctuation
# Create a test sample
Test = 'Hello joseff, I am so happy learning AI now!!'
# Remove unnecessary punctuation in the test sample
Test_punc_removed = [ char for char in Test if char not in string.punctuation]
# Check test sample
Test_punc_removed;
# Join the test sample
Test_punc_removed_join = ''.join(Test_punc_removed)
# Check the connected test sample
Test_punc_removed_join
# Import stopwords for cleaning
from nltk.corpus import stopwords
stopwords.words('english');
# Create a function to get only the important features
Test_punc_removed_join_clean = [ word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
# Check the cleaned test sample
Test_punc_removed_join_clean
# Turn the words to numbers by CountVectorizer
sample_data = ['Hello joseff, I am so happy learning AI now!!', 'joseff is so cute', 'AI is love']
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)
# Get the important features
print(vectorizer.get_feature_names())
# Check the equivalent number value
print(X.toarray())
# Create preprocessing function
def message_cleaning(message):
Test_punc_removed = [char for char in message if char not in string.punctuation]
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join_clean = [ word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
return Test_punc_removed_join_clean
# Check cleaned data
Test_punc_removed_join_clean
# Initiate cleaning data
df_clean = df_15['text'].apply(message_cleaning)
# Original review
print(df_15['text'][0])
# Cleaned review
print(df_clean[0])
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# fit CountVectorizer to the dataset
vectorizer = CountVectorizer( analyzer = message_cleaning)
df_countvectorizer = vectorizer.fit_transform(df_15['text'])
# Check vectorized dataset
vectorizer.get_feature_names();
# Check df_countvectorizer in array format
print(df_countvectorizer.toarray())
# df_countvectorizer dimension
df_countvectorizer.shape
# Label the indepedent and dependent variable
X = df_countvectorizer
y = df_15['stars'].values
# Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 )
# Check the train dataset shape
X_train.shape, y_train.shape
# Check the test dataset shape
X_test.shape, y_test.shape
# Fit the dataset into the model
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)
# Import evaluation libraries
from sklearn.metrics import classification_report, confusion_matrix
# Initiate prediction for training dataset
y_predict_train = NB_classifier.predict(X_train)
y_predict_train
# Check confusion matrix for training dataset
cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
plt.show()
# Initiate prediction for testing dataset
y_predict_test = NB_classifier.predict(X_test)
y_predict_test;
# Check confusion matrix for training dataset
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
plt.show()
print(classification_report(y_test, y_predict_test))
The Naive Bayes model were able to achieved an accuracy of 92%. This model would summarize the reviews in their business to let the owners have more time and focus more in what really matter most. As a business owner it would be reasonable to wonder what aspects of the services provided are viewed as negative or positive by customers.