Using Logistic Regression to solve MNIST

In this blog post I show how to use logistic regression to classify images. Logistic regression can be regarded as the simplest form of a feed forward neural network (in the sense of “perceptron”). It’s therefore a good starting point for understanding the inner works of neural networks.


In a previous blog post I described linear regression. If you’re not familiar with linear regression read that post first. Let’s briefly recap:

Linear regression is a mathematical method for predicting the value of a continuous dependent variable, e.g. the price of a stock or of houses, based on one or several explanatory variables. To predict e.g. a house price based on its size, number of bedrooms, year built, etc.

Logistic regression works very similar to linear regression. Except that for classification problems such as image classification of MNIST, the target variable is binary not continuous. We’re interested in predicting whether or not a given image is a certain number, e.g. a “5”.

Similar to linear regression, the input pixels of the image are linearly combined (multiplied by indidividual weights and then summed) to derive an estimate, our hypothesis, of the target variable which is either 1 (the image is a “5”) or 0 (the image is not a “5”).

To squash all output values into an interval of [0..1] the sum of the weighted inputs is passed through the “logistic” (or Sigmoid) function. So our hyposthesis function changes from



The second difference is the cost function. For classication we cannot use the squared error cost function because the logistic function causes the output to be non-convex, i.e. have many local minima. In practice, the difference in the cost function does actually not matter because all we need is to calculate the derivative of this cost function to derive the gradient. The derivate and the calculation of the gradient is the same as for linear regression:

Mathematically, that’s all that we need. Now let’s dive into the code and see how this works in practice.

First, we load the images and their respective labels.

double  *train_images = read_MNIST_training_images();
uint8_t *train_labels = read_MNIST_training_labels();

double  *test_images  = read_MNIST_testing_images();
uint8_t *test_labels  = read_MNIST_testing_labels();

The helper function ‘read_MNIST_training_images’ also normalizes all inputs between 0 and 1 by dividing the original greyscale value by 255.

Then we add the ‘bias’ into the input matrix.

double *X      = add_bias(train_images, MNIST_MAX_TRAINING_IMAGES);
double *X_test = add_bias(test_images,  MNIST_MAX_TESTING_IMAGES);

Adding the bias means adding a column of 1s to the left of the matrix.

double *add_bias(double *imgs, const int img_count){

    const long num_features = MNIST_IMG_PIXELS + 1;

    double *x = malloc(num_features * img_count * sizeof(double));

    for (int i=0; i<img_count; i++){

        double *dest = x    + i*num_features;
        double *src  = imgs + i*MNIST_IMG_PIXELS;

        *dest = 1;      // add bias

        memcpy(dest+1, src, MNIST_IMG_PIXELS * sizeof(double));

    return x;

We also need to convert the MNIST labels into a binary matrix of 0s and 1s.

double *create_binary_labels(int m, uint8_t *labels){

    double *lbls = malloc(10 * m * sizeof(double));
    for (int i=0; i<10 * m; i++) *(lbls+i) = 0;

    for (int i=0; i<m; i++){
        double *p = lbls + i*10;

        if (labels[i]==0) *(p + 9) = 1;
        else *(p + labels[i] - 1) = 1;


    return lbls;

Written on December 26, 2017