Case Study: Apple Stock Prices Predicton¶

Introduction¶

This is an another attempt to predict the future stocks of a certain company. In this project, an artificial recurrent neural network called Long Short Term Memory (LSTM) will be now use to predict the closing stock price of a corporation (Apple Inc.).

The project goes through following steps: Data generation, Data pre-processing and feature selection, Baseline prediction, LSTM training and demonstration of results.

Problem:

Predict the stock price at a certain date

Dataset:

Historical data of APL Inc

Source: Yahoo Finance

Libaries and Data Importation¶

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For calculation and web scrape
import math
import pandas_datareader as web

# For preprocessing and model building
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM

# Style settings
sns.set()
plt.style.use('fivethirtyeight')

# Get stack quote
df = web.DataReader('AAPL', data_source = 'yahoo', start = '2012-01-01', end = '2020-07-13')

Data Exploration¶

# Show the data
df.head()

# Check data dimension
df.shape

(2145, 6)

# Visualize the closing price
plt.figure(figsize = (16, 8))
plt.title('Close Price History')
plt.plot(df['Close'])
plt.xlabel('Date', fontsize = 18)
plt.ylabel('Close Price USD', fontsize = 18)
plt.show()

Data Pre-processing¶

# Create new dataframe with the 'Close' column
data = df.filter(['Close'])

# Convert dataframe into a numpy array
dataset = data.values

# Get the number of rows to train the model on # math.ceil -> Round up
training_data_len = math.ceil(len(dataset)* 0.8)

# Check the data length
training_data_len

1716

# Scale the data
scaler = MinMaxScaler(feature_range = (0,1))

# Fit and transform the data
scaled_data = scaler.fit_transform(dataset)

# Check the data
scaled_data;

# Create the training dataset
# Create the scaled training dataset
train_data = scaled_data[0: training_data_len, :] # Get all the data from 0 to training_data_len, all column

# Split the data
X_train = []
y_train = []

for i in range (60, len(train_data)):
    X_train.append(train_data[i - 60:i, 0])
    y_train.append(train_data[i, 0])
    if i <=61:
        print(X_train) # Past 60 days
        print(y_train) # target label
        print()

[array([0.0090187 , 0.00998157, 0.01198137, 0.01388532, 0.01359341,
       0.01425129, 0.01395067, 0.01344527, 0.01275689, 0.01488739,
       0.01680877, 0.01621624, 0.01297038, 0.0160681 , 0.0130183 ,
       0.02445506, 0.02357062, 0.02472519, 0.02722167, 0.02873349,
       0.02860715, 0.02814097, 0.0301277 , 0.03199678, 0.03411423,
       0.03753437, 0.04471883, 0.04482775, 0.04882736, 0.05181615,
       0.04667941, 0.04865743, 0.04861822, 0.05416451, 0.05337591,
       0.05483545, 0.05745828, 0.05891785, 0.06312222, 0.0661851 ,
       0.06706952, 0.06737887, 0.06214193, 0.06087842, 0.06106578,
       0.06598902, 0.06737452, 0.07035024, 0.07736479, 0.08672334,
       0.08497187, 0.08497623, 0.09174244, 0.09385987, 0.09235239,
       0.09097564, 0.08954223, 0.09430427, 0.09757193, 0.09893998])]
[0.09555904727802514]

[array([0.0090187 , 0.00998157, 0.01198137, 0.01388532, 0.01359341,
       0.01425129, 0.01395067, 0.01344527, 0.01275689, 0.01488739,
       0.01680877, 0.01621624, 0.01297038, 0.0160681 , 0.0130183 ,
       0.02445506, 0.02357062, 0.02472519, 0.02722167, 0.02873349,
       0.02860715, 0.02814097, 0.0301277 , 0.03199678, 0.03411423,
       0.03753437, 0.04471883, 0.04482775, 0.04882736, 0.05181615,
       0.04667941, 0.04865743, 0.04861822, 0.05416451, 0.05337591,
       0.05483545, 0.05745828, 0.05891785, 0.06312222, 0.0661851 ,
       0.06706952, 0.06737887, 0.06214193, 0.06087842, 0.06106578,
       0.06598902, 0.06737452, 0.07035024, 0.07736479, 0.08672334,
       0.08497187, 0.08497623, 0.09174244, 0.09385987, 0.09235239,
       0.09097564, 0.08954223, 0.09430427, 0.09757193, 0.09893998]), array([0.00998157, 0.01198137, 0.01388532, 0.01359341, 0.01425129,
       0.01395067, 0.01344527, 0.01275689, 0.01488739, 0.01680877,
       0.01621624, 0.01297038, 0.0160681 , 0.0130183 , 0.02445506,
       0.02357062, 0.02472519, 0.02722167, 0.02873349, 0.02860715,
       0.02814097, 0.0301277 , 0.03199678, 0.03411423, 0.03753437,
       0.04471883, 0.04482775, 0.04882736, 0.05181615, 0.04667941,
       0.04865743, 0.04861822, 0.05416451, 0.05337591, 0.05483545,
       0.05745828, 0.05891785, 0.06312222, 0.0661851 , 0.06706952,
       0.06737887, 0.06214193, 0.06087842, 0.06106578, 0.06598902,
       0.06737452, 0.07035024, 0.07736479, 0.08672334, 0.08497187,
       0.08497623, 0.09174244, 0.09385987, 0.09235239, 0.09097564,
       0.08954223, 0.09430427, 0.09757193, 0.09893998, 0.09555905])]
[0.09555904727802514, 0.0910671303202952]

# Convert the X_train and y_train to numpy arrays
X_train, y_train = np.array(X_train), np.array(y_train)

# Reshape the data # LSTM needs 3 dimensional data
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_train.shape

(1656, 60, 1)

Model Creation¶

# Build the LSTM model # X_train.shape[1] = 60
model = Sequential()

# First layer
model.add(LSTM(50, return_sequences = True, input_shape = (X_train.shape[1], 1)))

# Second layer
model.add(LSTM(50, return_sequences = False))

# Third layer
model.add(Dense(25))

# Last layer
model.add(Dense(1))

# Compile the model
model.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Train the model
model.fit(X_train, y_train, batch_size = 1, epochs = 1)

1656/1656 [==============================] - 144s 87ms/step - loss: 3.9337e-04

<tensorflow.python.keras.callbacks.History at 0x1c2cdef1940>

Model Evaluation¶

# Create the testing dataset
# Create a new array containing scaled values from index 1656 to 2145
test_data = scaled_data[training_data_len - 60:, :]

# Create the datasets X_test and y_test
X_test = []
y_test = dataset[training_data_len:, :] # 60 first actual values, not scaled

for i in range(60, len(test_data)):
    X_test.append(test_data[i - 60:i, 0]) # Past 60 values

# Convert the data to a numpy array to be able to use in LSTM model
X_test = np.array(X_test)

# Check the shape
X_test;

# Reshape the data for LSTM model. It needs 3 dimensional data
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Get the models predicted price values
predictions = model.predict(X_test)

# Get the inverse shape of the data
predictions = scaler.inverse_transform(predictions)

# Evaluate the model
# Get the root mean squared error (RMSE)
rmse = np.sqrt(np.mean(predictions - y_test)**2)

# Show the rmse
print('rmse:', rmse)

rmse: 6.095267876044853

# Plot the data
train = data[:training_data_len]
valid = data[training_data_len:]
valid['Predictions'] = predictions

# Visualize the data
plt.figure(figsize = (18, 8))
plt.title('Model')
plt.xlabel('Date')
plt.ylabel('Close Price USD', fontsize = 18)
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])
plt.legend(['Train', 'Val', 'Predictions'], loc = 'lower right')
plt.show()

<ipython-input-80-17b94435c8f5>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['Predictions'] = predictions

# Show the valid (Actual price) and predicted prices
valid

Another Prediction¶

# Get stack quote
apple_quote = web.DataReader('AAPL', data_source = 'yahoo', start = '2012-01-01', end = '2020-07-13')

# Create a new dataframe 
new_df = apple_quote.filter(['Close'])

# Get the last 60 days closing price values and convert the dataframe to an array
last_60_days  = new_df[-60:].values

# Scale the data to be values between 0 and 1
last_60_days_scaled = scaler.transform(last_60_days)

# Create an empty list
X_test = []

# Append the past 60 days
X_test.append(last_60_days_scaled)

# Convert the X_test dataset to a numpy array
X_test = np.array(X_test)

# Reshape the data
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Get the predicted scaled price
pred_price = model.predict(X_test)

# Undo the scaling
pred_price = scaler.inverse_transform(pred_price)
print('Predicted price at July 14, 2020 is', np.round(pred_price))

Predicted price at July 14, 2020 is [[368.]]

# Get stack quote
apple_quote_2 = web.DataReader('AAPL', data_source = 'yahoo', start = '2020-07-14', end = '2020-07-14')

# Show the actual price
print('Actual price at July 14, 2020 is', np.round(apple_quote_2['Close'][1]))

Actual price at July 14, 2020 is 383.0

Conclusion¶

The project concludes that LSTM neural network can be good alternative for stock market time series prediction among other machine learning methods.Further testing of different parameter and architecture settings are needed to improve the accuracy.

	High	Low	Open	Close	Volume	Adj Close
Date
2012-01-03	58.928570	58.428570	58.485714	58.747143	75555200.0	50.857235
2012-01-04	59.240002	58.468571	58.571430	59.062859	65005500.0	51.130558
2012-01-05	59.792858	58.952858	59.278572	59.718571	67817400.0	51.698215
2012-01-06	60.392857	59.888573	59.967144	60.342857	79573200.0	52.238651
2012-01-09	61.107143	60.192856	60.785713	60.247143	98506100.0	52.155792

	Close	Predictions
Date
2018-10-26	216.300003	215.546387
2018-10-29	212.240005	215.108917
2018-10-30	213.300003	214.077057
2018-10-31	218.860001	213.038071
2018-11-01	222.220001	212.820068
...	...	...
2020-07-07	372.690002	354.960175
2020-07-08	381.369995	356.941803
2020-07-09	383.010010	359.823456
2020-07-10	383.679993	362.886322
2020-07-13	381.910004	365.708496