Overview#
In this post, I’ll demonstrate how to build a neural network from scratch without relying on popular ML frameworks like PyTorch, TensorFlow, or Keras. Instead, I’ll use Python libraries such as numpy
, pandas
, and matplotlib
to develop a model that classifies handwritten digits.
Why implement a Neural Net from scratch?#
Plenty of ML frameworks offer out-of-the-box functionality for building neural networks. However, implementing one from scratch is a valuable exercise. It helps you understand how neural networks work under the hood, which is essential for designing effective models.
Architecture#
The model will consist of an input layer of 784 neurons, two hidden layers with 16 neurons each and an output layer with 10 neurons. This is a very simple configuration relative to modern standards.
Both hidden layers use sigmoid as the activation function. The final layer goes through a softmax function. The cost function is categorical cross entropy.
The model uses batch gradient descent algorithm to find the minima of cost function.
Implementation#
Setup#
As mentioned above, I will import the required Python libraries.
import numpy as np
import pandas as pd
import matplotlib as plt
Then, I will split the data into train
and test
sets, taking m = 30000
as number of training examples.
data = pd.read_csv('/data.csv')
train = data[:30000]
test = data[30000:]
X = train.drop(columns=['label']).transpose() # input
Y = train['label'] # output
X_t = test.drop(columns=['label']).transpose() # input
Y_t = test['label'] # output
Now, I will one-hot encode the labels
Y_one = np.zeros((m, 10))
for i in range(m):
Y_one[i][Y[i]] = 1
Y_one = Y_one.T
and initialise weights and biases. np.random.rand()
generates random values in [0, 1], I’ll subtract them with 0.5
for better performance
W1 = np.random.rand(16, 784) - 0.5 # 16x784 matrix
B1 = np.random.rand(16) - 0.5 # 16x1 matrix
W2 = np.random.rand(16, 16) - 0.5 # 16x16 matrix
B2 = np.random.rand(16) - 0.5 # 16x1 matrix
W3 = np.random.rand(10, 16) - 0.5 # 10X16 matrix
B3 = np.random.rand(10) - 0.5 # 10x1 matrix
At first, these parameters are just random numbers. When used, they produce garbage results. As the model learns to predict the correct values, it tunes them to reasonable numbers. Which, when used, will yield good results.
Training#
Before going into the training process, I will discuss each part separately.
Forward Prop#
I will feed the training examples to the input layers, multiplying them by the weights and adding the bias values. This output is then input into the first hidden layer, and this process continues to the final output layer.
Z1 = np.dot(W1, X[i]) + B1
A1 = 1/(1+np.exp(-Z1+1e-5)) # sigmoid
Z2 = np.dot(W2, A1) + B2
A2 = 1/(1+np.exp(-Z2+1e-5)) # sigmoid
Z3 = np.dot(W3, A2) + B3
y = np.exp(Z3+1e-5)
y /= sum(y) # softmax
# I added a small value of 10^-5 to prevent the exponent function from vanishing
Back Prop#
By applying the chain rule of calculus, I will compute the partial derivatives for each node in the layers. However, numpy
makes implementing backpropagation much simpler.
dz3 = y - Y_one.T[i]
dw3 += np.outer(dz3, A2)
db3 += dz3
da2 = np.dot(W3.T, dz3)
dz2 = A2*(1 - A2)*da2
dw2 += np.outer(dz2, A1)
db2 += dz2
da1 = np.dot(W2.T, dz2)
dz1 = A1*(1 - A1)*da1
dw1 += np.outer(dz1, X[i])
db1 += dz1
I’m summing all the increments and decrements to the weights and biases across all training examples and updating them at the end of the epoch.
Update Params#
I’ll adjust the parameters once I’ve gone through all the training examples.
W3 -= alpha*dw3/m
B3 -= alpha*db3/m
W2 -= alpha*dw2/m
B2 -= alpha*db2/m
W1 -= alpha*dw1/m
B1 -= alpha*db1/m
I will repeat the process through multiple iterations (or epochs) to find the minimum value.
Putting it all together#
for run in range(epoch):
dw1 = np.zeros((16, 784))
db1 = np.zeros(16)
dw2 = np.zeros((16, 16))
db2 = np.zeros(16)
dw3 = np.zeros((10, 16))
db3 = np.zeros(10)
for i in range(m):
# Forward prop
Z1 = np.dot(W1, X[i]) + B1
A1 = 1/(1+np.exp(-Z1+1e-5))
Z2 = np.dot(W2, A1) + B2
A2 = 1/(1+np.exp(-Z2+1e-5))
Z3 = np.dot(W3, A2) + B3
y = np.exp(Z3+1e-5)
y /= sum(y)
# Back prop
dz3 = y - Y_one.T[i]
dw3 += np.outer(dz3, A2)
db3 += dz3
da2 = np.dot(W3.T, dz3)
dz2 = A2*(1 - A2)*da2
dw2 += np.outer(dz2, A1)
db2 += dz2
da1 = np.dot(W2.T, dz2)
dz1 = A1*(1 - A1)*da1
dw1 += np.outer(dz1, X[i])
db1 += dz1
# Update params
W3 -= alpha*dw3/m
B3 -= alpha*db3/m
W2 -= alpha*dw2/m
B2 -= alpha*db2/m
W1 -= alpha*dw1/m
B1 -= alpha*db1/m
Results#
I obtained the following results after training the model for epoch = 500
with a learning rate alpha = 0.8
.
Training accuracy: 87.22 %
Test accuracy: 85.65 %
The model generalises the test data very well. The cost v/s epoch
plot shows that the cost function converged to a value.
I executed the code in Google Colab. The link to that notebook is here.