Using logistic regression to classify images

In this blog post I show how to use logistic regression to classify images. Logistic regression is a statistical method for binary classification, i.e. for analyzing the dependency of a binary outcome on one or more independent variables.

In a previous blog post I described linear regression. If you’re not familiar with linear regression read that post first.

Let’s briefly recap:

Linear Regression

Linear regression is a statistical method for predicting the value of a continuous dependent variable based on one or several explanatory variables. With linear regression you can, e.g., try to predict housing prices based on the size of the house, its number of bedrooms, the year it was built, etc.

Given an ideally large amount of sample data sets, linear regression tries to find a function to map input variables to a predicted output. This function is called our hypothesis and can be written as:

$$ h_\theta (x) = \Theta^T_x $$

where x is our input data, and Theta are parameters that are randomly set in the beginning and then gradually adjusted based on the difference of the respective h with the actual target value provided with the data sets. This optimization algorithm was called gradient descent.

Logistic Regression

Logistic regression is strictly speaking not a regression. It’s a classification algorithm. But since it works very similar to linear regression, it’s referred to as logistic regression.

Discrete vs Continuous Output

For image classification problems such as MNIST the target variable is binary not continuous. We’re interested in predicting whether or not a given image is e.g. a “5”. Using a continuous target variable doesn’t make sense, since you can’t argue that a “6” is almost a “7”. They’re totally separate predictions.

The method, though, is very similar. Input pixels are still linearly combined, i.e. multiplied by individual parameters and then summed, to derive our hypothesis of the target variable which is either 1 (the image is a “5”) or 0 (the image is not a “5”).

But since we only want target values to be either 0 or 1, we squash all output values into an interval of [0..1]. This is achieved by passing the sum of the weighted inputs through the “logistic” (or Sigmoid) function. The mathematical stipulation of our hypothesis function hence becomes:

$$ h_\theta (x) = g(\Theta^T_x) $$

whereas

$$ g(z) = \frac{1}{1 + e^{-z}} $$

Maximum Likelihood Cost Function

The second difference of logistic regression to linear regression is the particular cost function that is used.

For classification we cannot use the squared error cost function because the logistic function causes its output to be non-convex, i.e. the function may have many local minima instead of a single global minimum.

Instead, we use a different cost function called: Maximum Likelihood, which ensures J(0) is convex:

$$ J(\Theta) = - \frac{1}{m} \sum_{i=1}^m [ \ y^{(i)} \text{ log } h_\Theta (x^{(i)}) + ( 1 - y^{(i)} ) \text{ log } ( 1 - h_\Theta (x^{(i)} ) ) \ ] $$

If you’ve read my previous post on linear regression and you tried to actually implement this in practice, you noticed that the cost function is only needed conceptionally.

In practice, what we need to optimize our parameters and thereby optimize our hypothesis is the derivative of this cost function to compute the gradient. And fortunately, the derivate and the calculation of the gradient of the above cost function are the same as for linear regression.

Logistic Regression and Neural Networks

Logistic regression and neural networks are closely related. Basically, we can think of logistic regression as a simple 1-layer neural network. That’s why I consider logistic regression a great starting point for understanding deep learning and the inner workings of neural networks.

Converting the math into code

That’s all that we need. Now let’s dive into the code and see how this works in practice:

First, we load the images and their respective labels via 4 MNIST helper functions:

1double  *train_images = read_MNIST_training_images();
2uint8_t *train_labels = read_MNIST_training_labels();
3
4double  *test_images  = read_MNIST_testing_images();
5uint8_t *test_labels  = read_MNIST_testing_labels();

The functions read MNIST training or testing images from the respective data files. Images are read as grey scale values 0..255 and automatically normalized to a range of [0..1] by dividing the original pixel value by 255. The respective variable is therefore a pointer of type ‘double’ instead of type int.

Add the Bias

The image pixels have been read sequentially into memory effectively creating a 60,000 * 784 matrix of [0..1] doubles. Each row of this matrix corresponds to a single image with 784 pixels.

When we design our hypothesis function, we specifically want to consider the case where all independent variables are 0. Hence we need to add an additional parameter

$$ \Theta $$

whose corresponding x equals 1. Therefore, we want to add a 785th column into this matrix with all values set to 1. This is called the bias.

1double *X      = add_bias(train_images, MNIST_MAX_TRAINING_IMAGES);
2double *X_test = add_bias(test_images,  MNIST_MAX_TESTING_IMAGES);

In practice, you normally add the bias as the 1st column not as the last. The following function does exactly that, adding a column of 1s to the left of the matrix.

 1double *add_bias(double *imgs, const int img_count){
 2
 3    const long num_features = MNIST_IMG_PIXELS + 1;
 4
 5    double *x = malloc(num_features * img_count * sizeof(double));
 6
 7    for (int i=0; i<img_count; i++){
 8
 9        double *dest = x    + i*num_features;
10        double *src  = imgs + i*MNIST_IMG_PIXELS;
11
12        *dest = 1;      // add bias
13
14        memcpy(dest+1, src, MNIST_IMG_PIXELS * sizeof(double));
15    }
16
17    return x;
18}

The resulting variables X and X_test now point to [60,000 * 785] and [10,000 * 785] matrices respectively. The former points to pixels of the training images, the latter points to the pixels of the testing images.

Create Binary Labels

The optimization algorithm Gradient Descent compares the computed output from the hypothesis function with the actual, correct value from the data set.

For the MNIST data set, correct values (aka labels) are provided as numbers from 0..9. Yet, to classify the images we need our labels to be either 0 or 1.

Therefore, every label is converted into a binary vector of size 10 in which only the correct value is set to 1 and all others are set to 1. So a label of “5” will be converted into a vector [0 0 0 0 0 1 0 0 0 0] if you count from 0 to 9 or into a vector [0 0 0 0 1 0 0 0 0 0] if you count from 1 to 10.

Below function creates a matrix of such binary vectors for all labels in the training or testing data set. (I chose to deviate from the normal 0-based indexing in C and used the [1..10] representation here in order to make results easier comparable to Octave and Matlab.)

 1double *create_binary_labels(int m, uint8_t *labels){
 2
 3    double *lbls = malloc(10 * m * sizeof(double));
 4    
 5    for (int i=0; i<10 * m; i++) *(lbls+i) = 0;
 6
 7    for (int i=0; i<m; i++){
 8        double *p = lbls + i*10;
 9
10        if (labels[i]==0) *(p + 9) = 1;
11        else *(p + labels[i] - 1) = 1;
12
13    }
14
15    return lbls;
16}

Classifying Images

Classifying the images is done by optimizing the hypothesis function via gradient descent.

In particular, there are 3 steps that need to be repeated over and over to slowly adjust randomly initialized parameters to values that correctly predict an image.

Compute Hypothesis

First, we compute the hypothesis based on the current parameters using a simple matrix multiplication:

1multiply_matrices(X, m, n, theta, n, 10, h);

Obviously, there are different ways to implement this matrix multiplication function. For now, without much consideration of performance, a simple loop-based implementation could look like this:

 1void multiply_matrices(double *matrix1, long row1, long col1, double *matrix2, long row2, long col2, double *result){
 2    
 3    assert(row1>0);
 4    assert(col1>0);
 5    assert(col1==row2);
 6    assert(col2>0);
 7    
 8    long num_rows = row1;
 9    long num_cols = col2;
10    long num_x    = col1;
11    
12    for (long i=0; i<row1*col2; i++) *(result+i)=0;
13    
14    for (long c=0; c<num_cols; c++){
15        for (long r=0; r<num_rows; r++){
16            for (long x=0; x<num_x; x++){
17                
18                *(result + (r*num_cols) + c) += *(matrix1 + (r*num_x)+x) * *(matrix2 + (x*num_cols)+c );
19                
20            }
21        }
22    }
23    
24}

A more efficient implementation using the cblas library would be:

1cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, 10, n, 1, X, n, theta, 10, 0, h, 10);

Remember from above that we now still need to pass the hypothesis values through the logistic function. We’ll create a simple function called ‘sigmoid’

1sigmoid(h, m*10);

which simply loops through all output values and applies the logistic function:

 1void sigmoid(double *outputs, long num_outputs){
 2    
 3    for (int i=0; i<num_outputs; i++){
 4        
 5        double *val = outputs + i;
 6        
 7        *val = 1 / (1 + (exp((double)-*val)) );
 8        
 9    }
10    
11}

Compute the Gradients

To compute the gradients we need to compute the derivative of the cost function, which is the difference of the hypothesis minus the actual label, multiplied by the transposed inputs:

1// compute the difference of hypothesis and label
2for (int i=0; i<10*m;i++) *(diff+i) = *(h+i) - *(y+i);
3
4// transpose the input matrix X
5transpose_matrix(X, m, n, X_transposed);
6
7// multiply transposed input with the difference of y-h
8multiply_matrices(X_transposed, n, m, diff, m, 10, grad);

Adjust Parameters

Then we update our parameters by subtracting the gradient multiplied by a learning rate:

1for (int i=0; i<10*n;i++) *(theta+i) -= ((double)learning_rate/m) * *(grad+i);

Assess Accuracy

Now that we’ve successfully updated the parameters we want to validate the parameters and measure the accuracy of the current model. To do so, we simply follow the same steps that we did above (for training), except that in the end we do not update the parameters. Instead, we simply infer the predicted classifications of the model for each test image.

All predictions will be values between 0-1. Since a correct label was represented by a 1 and incorrect lables by 0, we know the highest value (closest to 1) is the predicted value of the model.

We use a simple helper function get_maximum which simply implements an argmax function:

 1uint8_t get_maximum(double *dbls, int count){
 2    
 3    double max = *dbls;
 4    uint8_t max_pos = 1;
 5    
 6    
 7    for (int i=0; i<count;i++){
 8        if (*(dbls+i) > max) {
 9            max     = *(dbls+i);
10            max_pos = i+1;
11        }
12    }
13    
14    if (max_pos==10) max_pos = 0;
15    
16    return max_pos;
17}

And we use the variable correct_count to track how many images were classified correctly.

 1multiply_matrices(X_test, m_test, n, theta, n, 10, h_test);
 2
 3sigmoid(h_test, m_test*10);
 4        
 5int correct_count = 0;
 6
 7for (int i=0; i<m_test; i++){
 8    *(pred+i) = get_maximum(h_test+(i*10),10);
 9
10    if (*(pred+i) == *(test_labels+i)) correct_count++;
11}

To ouput the accuracy in percent we divide by the number of images in the test set:

1double accuracy = correct_count / (double)m_test;

That’s it! If you run above optimization process for multiple ‘epochs’ accuracy gradually increases.

In my testing I get to an accuracy above 80% after only 18 epochs, and to above 90% after about 180 epochs.

While 90% on MNIST is certainly not a spectecular result, nevertheless it’s surprisingly good for a simple tool like logistic regression.

Thanks for reading! Feel free to contact me for any questions or comments.

Happy hacking!