# Image-to-Recipe Translation with Deep Convolutional Neural Networks

Written by Muriz Serifovic ·13 min read

Hardly any other area affects human well-being to a similar extent as nutrition. Every day countless of food pictures are published from users on social networks; from the first home-made cake to the top Michelin dish, the joy is shared with you in case a dish was successfully cooked.

It is a fact that no matter how different we might be from each other, good food is appreciated by everyone.

Advances in the classification of individual cooking ingredients are sparse. The problem is that there are almost no public edited records available. This work deals with the problem of automated recognition of a photographed cooking dish and the subsequent output of the appropriate recipe. The distinction between the difficulty of the chosen problem and previous supervised classification problems is that there are large overlaps in food dishes (aka high intra-class similarity), as dishes of different categories may look very similar only in terms of image information.

According to the current state, the largest German-language cooking community with more than 350'000 recipes will be scraped and analyzed. Feeding this data into a re-ranking scheme constisting of a combination of cooking court recognition using Convolutional Neural Networks (short CNN) and the scores of the nearest neighbors (Next-Neighbor Classification) in a record of over 800,000 images will yield accurate recipe prediction given a new, unseen image query.

This combination helps find the correct recipe more likely, as the top-k categories of the CNN are compared to the next-neighbor category with ranked correlation. Rank correlation based approaches such as Kendall Tau essentially measure the probability of two items being in the same order in the two ranked lists. Mathematically, Kendall Tau is computed as $$\tau=P(C)-P(D)=\frac{C}{N}-\frac{D}{N}=\frac{C-D}{N}$$ where $$\mathcal{N}$$ denotes total number of pairs, $$\mathcal{C}$$ the number of concordant pairs and $$\mathcal{D}$$ the number of discordant pairs

The exact pipeline looks like the following:

• For every recipe $$\mathbf{W}$$ there are $$\mathbf{K}$$ number of pictures. For each of these images feature vectors are extracted from a pre-trained Convolution Neural Network trained on 1000 categories in the ILSVRC 2014 image recognition competition with millions of images. The feature vectors form an internal representation of the image in the last fully connected layer before the 1000-category Softmax Layer, which was removed beforehand. These feature vectors are then dimensionally reduced by PCA (Principal Component Analysis) from an $$\mathbf{N} \times 4096$$ matrix to an $$\mathbf{N} \times 512$$ matrix, to reduce matrix size but still maintaining $$\sim 99 \%$$ of information energy. As a result, one chooses the top 5 images with the smallest Euclidean distance to the input image (Approximate nearest neighbor), i.e. the top 5 optical, just from the picture information, similar pictures to the query image.
• Furthermore, a CNN is trained with $$\mathbf{C}$$ number of categories with pictures of $$\mathbf{W}$$ recipes. $$\mathbf{C}$$ has been determined dynamically using topic modeling and semantic analysis of recipe names. As a result we obtain for each category a probability to which the input image could belong.
• The top-k categories from the CNN (2.) are compared with the categories from the top-k optically similar images (1.) with Kendall Tau correlation.
The schema to visualize the method looks like this:

## Individual Parts

We will break this work up into smaller, more digestible chunks of information constisting of multiple parts:

 Recognizing Food Data preparation Clearing data Data augmentation Data analysis and visualization, split data(Train, Valid, Test) Topic Modeling Latent Dirichlet Allocation (LDA) Non-negative Matrix Factorization Feature Extraction k-nearest neighbors t-SNE visualization Transfer Learning: Training pre-trained CNN (Convolutional Neural Network) AlexNet, VGG, ResNet, GoogLeNet Deploying with Flask on now.sh, a serverless application deployment 

Each part contains Jupyter notebooks which you can view on the Github page.

## Scraping and preparing the data

In order to be able to train a model at all, you need enough data (so-called data augmentation and fine-tuning of pre-trained models can be used as a remedy). Only because of this amount of data generalization of the training set can be continually increased to some degree and high accuracy can be achieved in a test set. The first part of this tutorial deals with the data acquisition, analysis and visualization of features and their relationships.

Shameless plug: I’m working on a python code editor which simplifies data analysis and data plotting. More information is available under: Möbius Code Editor

We do not have better algorithms. We just have more data.
Without exception, the quality and quantity of the data set is not negligible. That’s why Europe’s biggest cooking platform will be scraped: each recipe, finally 316'756 recipes (as of December 2017), are downloaded with a total of 879'620 images. It is important not to proceed too fast when downloading and to protect the servers with too many queries, since otherwise a ban of the own IP address would make the data collection more difficult.

More data leads to more dimensions, but more dimensions do not necessarily lead to a better model and its representation. Deviating patterns in the data set which disturb the learning can be unintentionally amplified by more dimensions, a generalization and learning of the data record is impaired for the neural network, the signal-to-noise ratio decreases.

All 300k recipes sorted by date: http://www.chefkoch.de/rs/s30o3/Rezepte.html

When doing website scrapping, it is important to respect the robots.txt file. Some administrators do not want visits from bots to specific directories. https://www.chefkoch.de/robots.txt provides:

Listed are directories that do not interest us, so you can confidently continue. Nevertheless, measures such as random headers and enough big pauses between the individual requests are recommended to avoid a possible ban from the website (I learned this working on another project the hard way).

Total pages: 10560

A next important step is feature selection to disadvantage unimportant data. Preparing raw data for the neural net is commonplace in practice. In the first pass, the recipe name, the average application for the recipe, the number of ratings, the difficulty level, the preparation time and the publication date are downloaded. In the second pass, then the ingredient list, the recipe text, all images, and the number of times the recipe has been printed. With these features, the data record can be described very well and helps to gain a strong understanding of the data set, which is important to select the algorithms.

Data such as recipe name, rating, date from the upload of the recipe, etc. are stored in a csv file. If the recipe has an image, the thumbnail is placed in the search_thumbnails folder. We will make usage of multiprocessing to ensure shorter download time. For further information visit Python’s Documentation.

Next we need to scrape the list of ingredients, the preparation, the tags and all images of each recipe.

If everything went smoothly with the download, our data looks like this:

 Data A total of 879'620 images (35 GB) 316'756 recipes Of which 189'969 contain one or more pictures Of which 107,052 recipes contain more than 2 images 126'787 contain no picture 

## Data analysis and visualization

In order to get a first impression, we usually plot a heatmap to get first insights which possible features are interesting.

The highest correlation have votes and average_rating. Figure 2 shows the pair plot on the 1st column, 2nd row, and it stands out that the higher the number of ratings, the better the rating of the recipe. Also interesting is the comparison between preparation time and number of ratings. Most reviews are based on recipes with short preparation time. It seems that the ChefKoch community prefers easy recipes. Another idea is to compare the number of newly uploaded recipes per year.

A comparison of the curves (bottom graphic ) shows that there was a direct correlation between the world’s rising prices and the supply of recipes. My hypothesis is that demand rose for recipes because one stayed at home and cooked for himself and his family in order to save budget and make ends meet as much as possible.

### Ingredients

Altogether 316'755 recipes share 3'248'846 ingredients. If you remove all ingredients that occur more than once, there are 63'588 unique ingredients. For the association analysis of the ingredients the APRIORI algorithm is used. This provides the frequency of what ingredients in combination with other ingredients occur in total how often.

Leader of the ingredients is salt with 60 percent representation in all recipes. In third place you can see the first tuple, the combination of two ingredients, namely pepper and salt with just over 40 percent they are by far the most common pair. The most common triplets, quadruplets and even quintuplets can be found in the corresponding Jupyter Notebook.

## Topic Modelling

The goal of this procedure is to divide all recipe names into n-categories. For a supervised classification problem, we have to provide the neural network with labeled images. It is only with these labels that learning becomes possible. The problem is that Chefkoch.de does not categorize their pictures. So we have to do this on our own. Possible procedures to split the 316'755 recipe names are shown below.

Take the following example:

• Pizza with mushrooms
• Stuffed peppers with peas and tuna
• Pizza with seafood
• Paprika with peas
The four recipe names above must be divided into n categories. Obviously, 1st and 3rd recipe need to be in the same category called pizza. The 2nd and 4th can also be divided into a new category due to the peas. But how do you manage more than 300 thousand recipe names?

### Latent Dirichlet Allocation (LDA)

LDA is a probability model which assumes that each name can be assigned to a topic. First, the name body must be cleaned, i.e. stop words are removed and words are reduced to their root. The clean vocabulary serves as input.

For the sake of simplicity, the exact mathematical definition is not discussed. As a result, we have a list of probabilities of how certain the model is that it would fit the topic. Example: ‘0.363 *’ scalloped ‘+ 0.165 *’ spicy ‘+ 0.124 *’ summer “+ 0.006 *” taboulé “+ 0.004 *” oatmeal biscuits “. An interactive graph to browse through each of the 300 topics can be found at 04_01_topic_modeling.ipynb in the Github Repo.

### Non-negative Matrix Factorization

The first step is to calculate the tf-idf (term frequency-inverse document frequency). This represents nothing more than the importance of a word in a recipe name, considering the importance in the whole text corpus. The four most important words are:

• spaghetti (2429.36)
• torte (2196.21)
• cake (1970.08)

The NMF algorithm takes as input the tf-idf and simultaneously performs dimension reduction and clustering. This effort provides excellent results as declared below for the first 4 topics:

• Topic #0:
spaghetti carbonara alla olio aglio al sabo puttanesca di mare
• Topic #1:
• Topic #2:
noodles chinese asia mie asian wok udon basil black light
• Topic #3:
muffins blueberry hazelnut cranberry savory juicy sprinkles johannisbeer oatmeal chocolate

The result can be visualized using t-SNE. It is important that a record with several dimensions is reduced to 2D, which allows to find a coordinate for each recipe name.

## Feature Extraction

Decoupled from nature, neural networks work by reflecting the model of the human brain. The idea is that it learns from its mistakes, gradually adjusting the weights of the neuron to adapt to the data. With CNNs, the image information is first summarized to reduce the number of parameters. We assume that the first layers in a CNN recognize rough structures in the picture. The further you proceed to the last Softmax layer, the finer the learned features become. We can take advantage of this and takes pre-trained CNNs which have been trained with millions of pictures and remove the last layers to train them with our own data. This saves us millions of parameters and thus reduces computing time. The CNN chosen here is the VGG-16 which was trained in a classification competition 2014 on 1000 categories.

If you remove the last layer, we get a feature extractor of the second-to-last layer. This forms a n x 4096 matrix, where n is the number of input pictures.

We let the VGG-16 calculate the vector for every image we have. This vector is, so to speak, the fingerprint of the picture: an internal representation the neural network builds.

Now all we have to do is for every new given input image pass it through the VGG-16, get the fingerprint vector and calculate the nearest neighbors with approximate nearest neighbor search. The library I will use for this is FALCONN. FALCONN is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures.

Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

We conduct an experiment by passing an input image of a brownie into our system and examine the output.

As expected we get related food images to our query image. We can even create a grid of images to view the interpretation of the neural network. The following picture is only a small part of the whole image. You can see cooking dishes that have similar features are closer together. The whole grid can be found here

That’s a wrap for this tutorial! How to train your own neural network from scratch without pre-training and turning our system into a web application with Flask (Part V and Part VI), will be up in the next tutorial.

## Convolutional Models in TensorFlow

Gain an intuition about convolutions and understand different steps used in convolutional neural network architectures.

Jun 16 · 8 min read

## Generative Adversarial Networks (GAN) in TensorFlow

Muriz Serifovic in TENSORFLOW
Feb 20 2019 · 9 min read

## The Transformer: Attention Is All You Need

Muriz Serifovic in TENSORFLOW
Jun 01 2019 · 11 min read

## Optimization Tricks for Jupyter Notebooks

Muriz Serifovic in PYTHON
Jan 01 2019 · 11 min read