Case Study: Netflix Stock Prices Prediction

Girl in a jacket

Introduction

Algorithmic trading is becoming popular over the past few years and also by using robot to do the work for them. It uses a method of executing orders using automated pre-programmed trading instructions. This project will attempt to predict the stock prices of Netflix and Facebook by using machine learning algorithms.

Problem

  • Part 1: Predict the price of netflix stock in the next thirty days.
  • Part 2: Predict the price of Facebook stock in a specific day.

Dataset

  • Historical data of Netflix
  • Historical data of Facebook

Source: Yahoo Finance

Libraries and Data Importation

In [32]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('bmh')
In [33]:
# Load the data
df = pd.read_csv('project_data/NFLX.csv')
df.head()
Out[33]:
Date Open High Low Close Adj Close Volume
0 2019-07-08 378.190002 378.250000 375.359985 376.160004 376.160004 3113400
1 2019-07-09 379.059998 384.760010 377.500000 379.929993 379.929993 6932800
2 2019-07-10 382.769989 384.339996 362.679993 381.000000 381.000000 5878800
3 2019-07-11 381.100006 384.540009 378.799988 379.500000 379.500000 4336300
4 2019-07-12 378.679993 379.739990 372.790009 373.250000 373.250000 6636900

Data Exploration

In [34]:
# Get the traing days
df.shape
Out[34]:
(252, 7)
In [35]:
# Visualize the close price data
plt.figure(figsize=(16,8))
plt.title('NETFLIX')
plt.xlabel('Days')
plt.ylabel('Close Price (USD)')
plt.plot(df['Close'])
plt.show()
In [36]:
# Get the close price
df = df[['Close']]
df.head()
Out[36]:
Close
0 376.160004
1 379.929993
2 381.000000
3 379.500000
4 373.250000

Data Pre-processing

In [37]:
# Create a variable to predict 'x' days out in the future
future_days = 30

# Create a new column (target) shifted 'x' units/days up
df['Prediction'] = df[['Close']].shift(-future_days)
df.tail()
Out[37]:
Close Prediction
247 447.239990 NaN
248 455.040009 NaN
249 485.640015 NaN
250 476.890015 NaN
251 493.809998 NaN
In [38]:
# Create the feature dataset(X) and convert it to a numpy array and remove the 'x' rows/days
X = np.array(df.drop(['Prediction'], axis = 1))[:-future_days]
X;
In [39]:
# Create a target data set (Y) and convert it to a numpy array 
# and get all of the target values except the last 'X' rows/days
y = np.array(df['Prediction'])[:-future_days]
y;
In [40]:
# Split the data into 75% training and 25% testing
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
In [41]:
# Get the last 'x' rows of the feature dataset 
x_future = df.drop(['Prediction'], axis = 1)[:-future_days]
x_future = x_future.tail(future_days)
x_future = np.array(x_future)
x_future;

Model Creation

In [42]:
# Fitting linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(x_train, y_train)

# Fitting SVR to the dataset
from sklearn.svm import SVR
svrl = SVR(kernel = 'linear')
svrl.fit(x_train, y_train)

# Fitting SVR to the dataset
from sklearn.svm import SVR
svrp = SVR(kernel = 'poly')
svrp.fit(x_train, y_train)

# Fitting Decision Tree to the dataset
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()
tree = tree.fit(x_train, y_train)

# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 20, criterion = 'mse', random_state = 0)
forest.fit(x_train, y_train)

# Fitting XGBoost Regression to the dataset
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(x_train, y_train)
Out[42]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)
In [43]:
# Show the model linear regression prediction
lr_prediction = lr.predict(x_future)
print('lr_prediction: ',lr_prediction)

# Show the model SVR linear regression prediction
svrl_prediction = svrl.predict(x_future)
print('svr1_prediction: ', svrl_prediction)

# Show the model SVR Poly prediction
svrp_prediction = svrp.predict(x_future)
print('svrp_prediction: ', svrp_prediction)

# Show the model tree prediction
tree_prediction = tree.predict(x_future)
print('tree_prediction: ', tree_prediction)

# Show the model Random Forest prediction
forest_prediction = forest.predict(x_future)
print('forest_prediction: ', forest_prediction)

# Show the XGBoost prediction
xgb_prediction = xgb.predict(x_future)
print('xgb_prediction: ', xgb_prediction)

print()
lr_prediction:  [383.60905272 407.45931151 422.89775557 435.0063595  446.39945659
 431.52971737 444.85834185 441.50095651 430.11706837 434.96050463
 433.39187127 430.08036833 413.98142713 421.37503304 428.67687325
 424.47554284 436.2905987  433.10750555 441.8954242  443.97772516
 443.07874378 447.6378134  439.65716639 445.57384869 448.94959873
 460.17755754 458.70066237 457.28801338 454.19665658 443.72087714]
svr1_prediction:  [385.53937099 408.52008749 423.39567055 435.06281415 446.04053713
 431.71292483 444.55560897 441.32062854 430.35177876 435.01863107
 433.50718758 430.31641674 414.8044172  421.92846406 428.96409085
 424.91593422 436.30023204 433.23318938 441.70071456 443.70709816
 442.8408922  447.23374543 439.54406005 445.24502958 448.49770517
 459.31631055 457.89326062 456.53211455 453.55346368 443.45961441]
svrp_prediction:  [376.34805205 403.61999459 423.29403109 439.88861991 456.46902433
 435.01701225 454.17041122 449.22358024 433.062319   439.82379981
 437.61552503 433.01172679 411.73230965 421.2804906  431.08416249
 425.39750669 441.71018596 437.21710075 449.80049142 452.86484832
 451.53796875 458.32885512 446.54215983 455.2354173  460.31143657
 477.81181912 475.45507524 473.21648639 468.37076629 452.48513678]
tree_prediction:  [429.320007 414.769989 419.890015 413.440002 419.730011 425.920013
 427.309998 421.970001 414.329987 419.600006 419.48999  434.049988
 414.769989 425.559998 418.070007 425.5      436.130005 447.769989
 449.869995 453.720001 468.040009 466.26001  421.970001 465.910004
 443.399994 447.23999  455.040009 485.640015 476.890015 453.720001]
forest_prediction:  [430.070007   404.11049045 421.87150915 418.68050205 440.9425081
 424.5000029  431.1699986  432.5420012  420.02648955 420.52850325
 424.5639906  430.4799898  416.90049295 423.8560032  421.0790022
 424.66150215 427.14050285 436.19998925 442.3069991  448.4380004
 459.6235048  455.0875063  430.7830016  446.610001   445.98399825
 451.109996   459.6000064  470.3100085  476.6035129  450.47450095]
xgb_prediction:  [429.92593 413.70047 422.9177  416.59482 422.2149  425.72382 430.69208
 424.7191  418.3787  419.94144 419.94144 430.15622 413.70047 423.7935
 420.79367 423.28537 433.30252 442.20663 449.66965 453.82437 465.28488
 464.38287 424.7191  463.25876 444.77448 447.58707 454.96893 483.44696
 476.75873 453.82437]

Model Evaluation

Linear Regression Prediction

In [44]:
# Visualize the data
Predictions = lr_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-44-ec9eb76eee8f>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

SVR Linear Prediction

In [45]:
# Visualize the data
Predictions = svrl_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-45-c0869a4836a4>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

SVR Poly Prediction

In [46]:
# Visualize the data
Predictions = svrp_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-46-d51d77364124>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

Decision Tree Prediction

In [47]:
# Visualize the data
Predictions = tree_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-47-7901a16d8aac>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

Random Forest Prediction

In [48]:
# Visualize the data
Predictions = forest_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-48-17458f1a1ad6>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

XGB Prediction

In [49]:
# Visualize the data
Predictions = xgb_prediction

valid = df[X.shape[0]:]
valid['Prediction'] = Predictions
plt.figure(figsize = (16,8))
plt.title('Model')
plt.xlabel('Days')
plt.ylabel('Close Price USD')
plt.plot(df['Close'])
plt.plot(valid[['Close', 'Prediction']])
plt.legend(['Orig', 'Val', 'Pred'])
plt.show()
<ipython-input-49-bdc2206736eb>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Prediction'] = Predictions

Conclusion

It can be observe that Linear Regression and Support Vector Machines prediction are very far from the actual price. Meanwhile, the Decision Tree, Random Forest and XGBOOST model somewhat failed to predict starting days and last remaining days but in the middle days it seems to almost fit the prediction from the actual prices. Nevertheless, this can be further improve by tuning the right hyper parameter and trying different models such as Reinforcement Learning and Artificial Neural Network.


Case Study: Facebook Stock Price Prediction

Data Importation

In [50]:
# Load the data
df = pd.read_csv('project_data/FB.csv')
df
Out[50]:
Date Open High Low Close Adj Close Volume
0 2020-06-01 224.589996 232.440002 223.500000 231.910004 231.910004 18223800
1 2020-06-02 230.940002 233.000000 226.559998 232.720001 232.720001 20919000
2 2020-06-03 232.110001 232.649994 228.529999 230.160004 230.160004 15380300
3 2020-06-04 229.559998 231.630005 224.610001 226.289993 226.289993 17041500
4 2020-06-05 226.710007 231.350006 225.309998 230.770004 230.770004 16750400
5 2020-06-08 229.029999 231.550003 227.410004 231.399994 231.399994 15466500
6 2020-06-09 231.520004 239.770004 230.410004 238.669998 238.669998 27462900
7 2020-06-10 240.960007 241.210007 235.279999 236.729996 236.729996 20720700
8 2020-06-11 229.940002 232.889999 223.550003 224.429993 224.429993 26708200
9 2020-06-12 229.899994 231.660004 224.500000 228.580002 228.580002 22071700
10 2020-06-15 225.089996 233.770004 224.800003 232.500000 232.500000 15340300
11 2020-06-16 237.139999 238.460007 233.000000 235.649994 235.649994 15236700
12 2020-06-17 235.000000 237.589996 231.729996 235.529999 235.529999 19552800
13 2020-06-18 234.990005 236.139999 232.149994 235.940002 235.940002 15782500
14 2020-06-19 237.789993 240.830002 235.550003 238.789993 238.789993 30081300
15 2020-06-22 238.559998 240.699997 236.910004 239.220001 239.220001 18917800
16 2020-06-23 241.279999 245.190002 239.860001 242.240005 242.240005 24017900
17 2020-06-24 241.199997 243.220001 232.679993 234.020004 234.020004 20834900
18 2020-06-25 234.619995 237.300003 232.740005 235.679993 235.679993 18704300
19 2020-06-26 232.639999 233.089996 215.399994 216.080002 216.080002 76343900
20 2020-06-29 209.750000 220.750000 207.110001 220.639999 220.639999 58514300
In [51]:
# Get the number of rows and columns
df.shape
Out[51]:
(21, 7)

Data Pre-processing

In [52]:
# Get and print the last row of data
actual_price = df.tail(1)
actual_price
Out[52]:
Date Open High Low Close Adj Close Volume
20 2020-06-29 209.75 220.75 207.110001 220.639999 220.639999 58514300
In [53]:
# Prepare the data for training the models
# get all the data except for the last row
df = df.head(len(df)-1)

#print the new dataset
df
Out[53]:
Date Open High Low Close Adj Close Volume
0 2020-06-01 224.589996 232.440002 223.500000 231.910004 231.910004 18223800
1 2020-06-02 230.940002 233.000000 226.559998 232.720001 232.720001 20919000
2 2020-06-03 232.110001 232.649994 228.529999 230.160004 230.160004 15380300
3 2020-06-04 229.559998 231.630005 224.610001 226.289993 226.289993 17041500
4 2020-06-05 226.710007 231.350006 225.309998 230.770004 230.770004 16750400
5 2020-06-08 229.029999 231.550003 227.410004 231.399994 231.399994 15466500
6 2020-06-09 231.520004 239.770004 230.410004 238.669998 238.669998 27462900
7 2020-06-10 240.960007 241.210007 235.279999 236.729996 236.729996 20720700
8 2020-06-11 229.940002 232.889999 223.550003 224.429993 224.429993 26708200
9 2020-06-12 229.899994 231.660004 224.500000 228.580002 228.580002 22071700
10 2020-06-15 225.089996 233.770004 224.800003 232.500000 232.500000 15340300
11 2020-06-16 237.139999 238.460007 233.000000 235.649994 235.649994 15236700
12 2020-06-17 235.000000 237.589996 231.729996 235.529999 235.529999 19552800
13 2020-06-18 234.990005 236.139999 232.149994 235.940002 235.940002 15782500
14 2020-06-19 237.789993 240.830002 235.550003 238.789993 238.789993 30081300
15 2020-06-22 238.559998 240.699997 236.910004 239.220001 239.220001 18917800
16 2020-06-23 241.279999 245.190002 239.860001 242.240005 242.240005 24017900
17 2020-06-24 241.199997 243.220001 232.679993 234.020004 234.020004 20834900
18 2020-06-25 234.619995 237.300003 232.740005 235.679993 235.679993 18704300
19 2020-06-26 232.639999 233.089996 215.399994 216.080002 216.080002 76343900
In [54]:
# Create empty list to store the independent and dependent data
days = list()
adj_close_prices = list()
In [55]:
# Get the date and adjusted close price
df_days = df.loc[:, 'Date']
df_adj_close = df.loc[:, 'Adj Close']
In [56]:
# Create the indepedent dataset
for day in df_days:
    days.append( [int(day.split('-')[2])] )
    
# Create the dependent dataset
for adj_close_price in df_adj_close:
    adj_close_prices.append( float(adj_close_price) )
In [57]:
#print the days and the adj close prices
print(days)
print(adj_close_prices)
[[1], [2], [3], [4], [5], [8], [9], [10], [11], [12], [15], [16], [17], [18], [19], [22], [23], [24], [25], [26]]
[231.91000400000001, 232.72000099999997, 230.16000400000001, 226.289993, 230.770004, 231.399994, 238.669998, 236.729996, 224.42999300000002, 228.580002, 232.5, 235.649994, 235.52999900000003, 235.940002, 238.789993, 239.22000099999997, 242.24000499999997, 234.020004, 235.67999300000002, 216.080002]

Model Creation

In [58]:
# Create models
from sklearn.svm import SVR

# Create and train  a SVR model using a linear kernel
lin_svr = SVR(kernel = 'linear', C= 1000.0)
lin_svr.fit(days, adj_close_prices)

# Create and train  a SVR model using a linear kernel
poly_svr = SVR(kernel = 'poly', degree = 2, C= 1000.0)
poly_svr.fit(days, adj_close_prices)

# Create and train  a SVR model using a linear kernel
rbf_svr = SVR(kernel = 'rbf', gamma = 0.15, C= 1000.0)
rbf_svr.fit(days, adj_close_prices)

# Fitting Decision Tree to the dataset
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()
tree = tree.fit(days, adj_close_prices)

# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 20, criterion = 'mse', random_state = 0)
forest.fit(days, adj_close_prices)
Out[58]:
RandomForestRegressor(n_estimators=20, random_state=0)

Model Visualization

In [59]:
# Plot the model on the graph to see which has the best fit on the original data
plt.figure(figsize=(16,8))
plt.scatter(days, adj_close_prices, color = 'red', label = 'data')
plt.plot(days, rbf_svr.predict(days), color = 'green', label = 'RBF Model')
plt.plot(days, poly_svr.predict(days), color = 'orange', label = 'Polynomial Model')
plt.plot(days, lin_svr.predict(days), color = 'blue', label = 'Linear  Model')
plt.legend()
plt.show()
In [60]:
# Plot the model on the graph to see which has the best fit on the original data
plt.figure(figsize=(16,8))
plt.scatter(days, adj_close_prices, color = 'red', label = 'data')
plt.plot(days, tree.predict(days), color = 'green', label = 'Decision Tree Model')
plt.plot(days, forest.predict(days), color = 'orange', label = 'Random Forest Model')
plt.legend()
plt.show()

Model Evaluation

In [61]:
# Show the predicted price for the given day
day = [[30]]

print('The Linear SVR predicted:', lin_svr.predict(day))
print('The Polynomial SVR predicted:', poly_svr.predict(day))
print('The RBF SVR predicted:', rbf_svr.predict(day))
print('The Decision Tree predicted:', tree.predict(day))
print('The Random Forest predicted:', forest.predict(day))
The Linear SVR predicted: [240.23070864]
The Polynomial SVR predicted: [237.51221858]
The RBF SVR predicted: [213.69266326]
The Decision Tree predicted: [216.080002]
The Random Forest predicted: [224.89600075]
In [62]:
# print the actual price of the stock on day 30
print('The actual price:', actual_price['Adj Close'][20])
The actual price: 220.63999900000002

Conclusion

The Decision Tree and Random Forest model has the closer value in predicting the actual price. Machine learning can be used as a range of guidance in how much will the stock will go up or down but can not guarantee for a high accuracy and precision. There can be a lot of factors to be consider when it comes to predicting the stock market prices.