This project is just one of the many (image) classification projects I’ve worked on. The project was particular interesting because it’s a Kaggle competition, where you can directly compare your model’s performance with other contestants. The competition at Kaggle is quite strong, so you really have to pull out a rabbit out of your hat in order to perform well. I worked on this competition while taking Jeremy Howard’s very good course Deep Learning for coders. The competition was also the basis of my capstone project for Udacity’s machine learning engineer nanodegree. The leaderboard was already closed when I was working on the competition, so I could still let my model be evaluated but my result wasn’t added to the leaderboard. I also have to write down the specifics from memory, due to the loss of my linux partition of that time (don’t be hasty when formatting a computer with identical hard drives….).

View competition site


The task was to build a model that is able to distinguish cat from dog images. The provided data contains a training set with 25,000 images, where the classes are balanced. The test set contains another 12,500 images. The training labels need to be inferred from the directory structure, while the whole test set is in one single directory. The evaluation metric is the log loss:

To get the test set predictions evaluated, a text file with the predictions has to be created and then uploaded on Kaggle. Below are some random images from the test set.

Example images from the test set.



The preprocessing basically consisted of creating a training / validation split (either 80 / 20 or 90 / 10) and resizing the images to a common size, since the images come in all shapes and sizes. Below is an image of my data exploration where I investigated the shape distribution of the training set, which I then compared with the test set.

Other steps like data augmentation and mean centering were performed live during training.

Optimization Objective

For the cost function I used the cross entropy with softmax, which boils down to the simple logistic regression loss in the case of just two classes. The softmax cross entropy is just the general case of the log loss and is shown below.

This may become more clear when the log loss is written differently:

where the conditional probability can be formulated with the sigmoid in the binary case or the softmax in the multiclass case. A great explanation can be found on Stanford’s deep learning tutorial, which is the source of these equations.


All my approaches were based on convolutional neural networks (CNNs) and I think for most of them I started out with weights pre-trained on ImageNet, although I also trained some models from scratch. I tried many different architectures like VGG16, Inception and ResNet. I finally went with ResNet50 or better an ensemble of five ResNets. In addition to an ensemble I used tricks like modifying the architecture and its hyperparameters (especially dropout), data augmentation, prediction clipping, pseudo labeling and playing with different input sizes.


With my best submission with a log loss of 0.04326, I was able to beat place 24 on the leaderboard. Out of the 1,314 teams on the leaderboard, this would have put me in the top 2%. Below is a screenshot of that submission and the leaderboard.