Coding a Neural Network from scratch

Overview
#

In this post, I’ll demonstrate how to build a neural network from scratch without relying on popular ML frameworks like PyTorch, TensorFlow, or Keras. Instead, I’ll use Python libraries such as numpy, pandas, and matplotlib to develop a model that classifies handwritten digits.

Why implement a Neural Net from scratch?
#

Plenty of ML frameworks offer out-of-the-box functionality for building neural networks. However, implementing one from scratch is a valuable exercise. It helps you understand how neural networks work under the hood, which is essential for designing effective models.

For a deep dive into neural networks, check out 3Blue1Brown’s series. Here, I’ll focus on the practical implementation.

Architecture
#

The model will consist of an input layer of 784 neurons, two hidden layers with 16 neurons each and an output layer with 10 neurons. This is a very simple configuration relative to modern standards.

Both hidden layers use sigmoid as the activation function. The final layer goes through a softmax function. The cost function is categorical cross entropy.

The model uses batch gradient descent algorithm to find the minima of cost function.

Implementation
#

Setup
#

As mentioned above, I will import the required Python libraries.

import numpy as np
import pandas as pd
import matplotlib as plt

Then, I will split the data into train and test sets, taking m = 30000 as number of training examples.

data = pd.read_csv('/data.csv')
train = data[:30000]
test = data[30000:]

X = train.drop(columns=['label']).transpose() # input
Y = train['label'] # output

X_t = test.drop(columns=['label']).transpose() # input
Y_t = test['label'] # output

Now, I will one-hot encode the labels

Y_one = np.zeros((m, 10))

for i in range(m):
    Y_one[i][Y[i]] = 1

Y_one = Y_one.T

and initialise weights and biases. np.random.rand() generates random values in [0, 1], I’ll subtract them with 0.5 for better performance

W1 = np.random.rand(16, 784) - 0.5  # 16x784 matrix
B1 = np.random.rand(16) - 0.5       # 16x1 matrix

W2 = np.random.rand(16, 16) - 0.5   # 16x16 matrix
B2 = np.random.rand(16) - 0.5       # 16x1 matrix

W3 = np.random.rand(10, 16) - 0.5   # 10X16 matrix
B3 = np.random.rand(10) - 0.5       # 10x1 matrix

At first, these parameters are just random numbers. When used, they produce garbage results. As the model learns to predict the correct values, it tunes them to reasonable numbers. Which, when used, will yield good results.

Training
#

Before going into the training process, I will discuss each part separately.

Forward Prop
#

I will feed the training examples to the input layers, multiplying them by the weights and adding the bias values. This output is then input into the first hidden layer, and this process continues to the final output layer.

Z1 = np.dot(W1, X[i]) + B1
A1 = 1/(1+np.exp(-Z1+1e-5)) # sigmoid

Z2 = np.dot(W2, A1) + B2
A2 = 1/(1+np.exp(-Z2+1e-5)) # sigmoid

Z3 = np.dot(W3, A2) + B3
y = np.exp(Z3+1e-5)
y /= sum(y) # softmax

# I added a small value of 10^-5 to prevent the exponent function from vanishing

Back Prop
#

By applying the chain rule of calculus, I will compute the partial derivatives for each node in the layers. However, numpy makes implementing backpropagation much simpler.

dz3 = y - Y_one.T[i]
dw3 += np.outer(dz3, A2)
db3 += dz3

da2 = np.dot(W3.T, dz3)
dz2 = A2*(1 - A2)*da2
dw2 += np.outer(dz2, A1)
db2 += dz2

da1 = np.dot(W2.T, dz2)
dz1 = A1*(1 - A1)*da1
dw1 += np.outer(dz1, X[i])
db1 += dz1

I’m summing all the increments and decrements to the weights and biases across all training examples and updating them at the end of the epoch.

Update Params
#

I’ll adjust the parameters once I’ve gone through all the training examples.

W3 -= alpha*dw3/m
B3 -= alpha*db3/m
W2 -= alpha*dw2/m
B2 -= alpha*db2/m
W1 -= alpha*dw1/m
B1 -= alpha*db1/m

I will repeat the process through multiple iterations (or epochs) to find the minimum value.

Putting it all together
#

for run in range(epoch):

    dw1 = np.zeros((16, 784))
    db1 = np.zeros(16)
    dw2 = np.zeros((16, 16))
    db2 = np.zeros(16)
    dw3 = np.zeros((10, 16))
    db3 = np.zeros(10)

    for i in range(m):
        # Forward prop
        Z1 = np.dot(W1, X[i]) + B1
        A1 = 1/(1+np.exp(-Z1+1e-5))

        Z2 = np.dot(W2, A1) + B2
        A2 = 1/(1+np.exp(-Z2+1e-5))

        Z3 = np.dot(W3, A2) + B3
        y = np.exp(Z3+1e-5)
        y /= sum(y)

        # Back prop
        dz3 = y - Y_one.T[i]
        dw3 += np.outer(dz3, A2)
        db3 += dz3
        da2 = np.dot(W3.T, dz3)
        dz2 = A2*(1 - A2)*da2
        dw2 += np.outer(dz2, A1)
        db2 += dz2
        da1 = np.dot(W2.T, dz2)
        dz1 = A1*(1 - A1)*da1
        dw1 += np.outer(dz1, X[i])
        db1 += dz1

    # Update params
    W3 -= alpha*dw3/m
    B3 -= alpha*db3/m
    W2 -= alpha*dw2/m
    B2 -= alpha*db2/m
    W1 -= alpha*dw1/m
    B1 -= alpha*db1/m

Results
#

I obtained the following results after training the model for epoch = 500 with a learning rate alpha = 0.8.

Training accuracy: 87.22 %
Test accuracy: 85.65 %

The model generalises the test data very well. The cost v/s epoch plot shows that the cost function converged to a value.

I executed the code in Google Colab. The link to that notebook is here.

Overview#

Why implement a Neural Net from scratch?#

Architecture#

Implementation#

Setup#

Training#

Forward Prop#

Back Prop#

Update Params#

Putting it all together#

Results#