Facebook AI researchers have built a system that can analyze a photo of food and then create a recipe from scratch.
Snap a photo of a particular dish and, within seconds, the system can analyze the image and generate a recipe with a list of ingredients and steps needed to create the dish. It can’t look at a photo of a particular pie or pancake and determine the exact type of flour used or the skillet or oven temperature, but the system will come up with a recipe for a very credible (and tasty) approximation. While the system is only for research, it has proved to be an interesting challenge for the broader project of teaching machines to see and understand the world.
Their “inverse cooking” system uses computer vision, technology that extracts information from digital images and videos to give computers a high level of understanding of the visual world.
But this is no ordinary computer vision system: It leverages not one but two neural networks, algorithms that are designed to recognize patterns in digital images, whether they are fern fronds, long muzzles or embossed characters. Michal Drozdzal, a Research Scientist at Facebook AI Research, explains that the inverse cooking system splits the image-to-recipe problem into two parts: One neural network identifies the ingredients that it sees in the dish, while the other devises a recipe from the list.
Drozdzal says this enhanced computer vision system is more effective than retrieval image-to-recipe techniques, which work to recognize the tasty treat in question and then search a database of preexisting recipes. “Our system outperformed the retrieval system both on ingredient predictions and on generating plausible recipes,” says Drozdzal.
Drozdzal and fellow Facebook AI Research scientist Adriana Romero, who met while studying for doctorates at the University of Barcelona, claim their system might even trot out a decent recipe for the paella, a devilishly complicated Spanish rice dish.
That is no mean feat, because food recognition is one of the toughest areas of natural image understanding. This is why any system of visual ingredient detection and recipe generation benefits from some high-level reasoning and prior knowledge: A standard paella contains some quantity of chopped and fried onion, a cake will likely contain sugar and no more than a pinch of salt, and a croissant will presumably include butter.
The team trained its AI not only to predict the most plausible ingredients but also to recognize certain ingredients often appear together, like cinnamon and sugar. That’s probably why the AI predicts ingredients even though they don’t seem to appear in a photo.
Naturally, the success of this method depends on the size and quality of the cookbook, the handiwork of both the photographer and the chef, and some pot luck. “It is hard to match if a recipe isn’t in the data set and the image or dish appearance are different to the data set” says Drozdzal. In other words, the retrieval approach is like finding a needle in a haystack when the system doesn’t know what a needle looks like.
Drozdzal and Romero were convinced there was a better method. They wondered what would happen if they built in an extra step to the recipe generation pipeline: a system that could predict the ingredients.
The ingredient-predicting network works more or less according to the problem-solving principle of Occam’s razor: that the most plausible-seeming explanation is probably correct. For example, Drozdzal, Romero, and their team took the Recipe1M data set, which has nearly 17,000 ingredients, and whittled it down to a more manageable 1,500. They also trained the model to predict that certain ingredients often appear together, like salt and pepper, cheese and tomato, and cinnamon and sugar.
The recipe-generating network also works from the Recipe1M data set, which the team slimmed down from around 1 million recipes to approximately 350,000. Recipes that made the cut all contained images and had two or more ingredients or instructions. The data set furnishes the neural network with a vocabulary of nearly 25,000 unique words in addition to the information from the image and the ingredient list. The network also analyzes the interplay between image and ingredients for insights on how food was processed to produce the resulting dish.
In case of the Spanish paella, the first neural network might recognize rice, onions, tomatoes, and, depending on the generosity of the chef, some seafood. The second neural network starts generating a recipe from the inferred ingredients: Slice and fry the onion; stir in bomba rice; add chopped tomatoes and, finally, some prawns and mussels. The entire system is bringing its own high-level reasoning to bear on three sources of information: the image, the corresponding list of ingredients, and the system’s own prior knowledge. It makes well-educated guesses rather than turning recipe generation into a giant identity parade.
The inverse cooking project is already outperforming the retrieval approach. Drozdzal and Romero’s paper cites a recipe for an English muffin laden with cheese, broccoli and tomato. The inverse cooking system aced all the ingredients, while the retrieval system identified only the cheese and tomato. (The retrieval system also saw a cracker, some lettuce, and some Miracle Whip.) Around 55% of humans also judged the inverse cooking system’s recipes to be successful, compared with approximately 48% for the retrieval approach.
The inverse cooking creators are continuing to fine-tune the system. “Sometimes it can’t predict an ingredient, which means that it won’t be present in the recipe,” says Drozdzal. They also want to train the system to deal with the problem of visually similar foods, whether they’re spaghetti and noodles, mayonnaise and sour cream, or tofu and paneer.
Romero adds that she and Drozdzal still haven’t taken the final, most important step in their inverse cooking system: “We haven’t got around to cooking yet.”