Humans are remarkably good at identifying the same face across illuminations, positions, deformations, and depths. The same face can even be identified through fences, glass, and water. The possible number of contexts for a face to appear in are infinite, yet we can identify it instantaneously. For whatever reason, we are really good at identifying objects, but researchers have struggled to make computers even semi-competent at it. One of the more valiant efforts is Yann LeCun’s use of convolutional nets, but its primary successes are in controlled situations. Any reasonable person in the field would agree that any human can wipe the floor with even the best algorithm running on the best supercomputer (programmed by the best programmer in the best department in the best state in the best country!). So what gives?
A recent article from Pinto, Cox and DiCarlo points to a fundamental flaw in the current approach: the metric. Most algorithms for object recognition are built with the famed Cal Tech 101 database or Cal Tech 256 in mind. Looking through the datasets, they seem to be perfectly natural tests for anything that purports to recognize objects. Particular objects, e.g. cockroaches, are presented in a panoply of contexts. So, if my algorithm can recognize a cockroach on a tree, on a piece of bark, and on a white background, it’s doing its job, right? Well, it turns out that using only natural images, a recent craze in image processing, allows algorithms to leverage statistics in the image that are not part of the object itself. This is fine, and such statistics ought to be used. Yet, Pinto et al. demonstrate that even a decidedly “stupid” algorithm can perform as well as the latest and greatest when using Cal Tech 101 as a metric.
More precisely, they used a bank of linear filters to grossly approximate the function of V1, along with an off-the-shelf support vector machine library. At the very least, neuroscientists have implicated V1, V2, V4, IT, PFC, and perhaps parietal cortex in object recognition. What this demonstrates, among other things, is that the Cal Tech 101 database is not as hard as it seems. If this simple null model can perform on par with the state-of-the-art, then either object recognition algorithms have gone nowhere or the metric is all wrong. I side with the latter view, along with Pinto et al. I presume. Though, I would venture to guess that more than a few algorithms get by due to the severely lacking metric. Still, Pinto et al.’s proposed dataset of precisely and parametrically varied ray traced images helps to fill the gaps of Cal Tech 101/256. Namely, an algorithm cannot “cheat” and use information from the environment or the particularly artistic lighting of the photographer. Instead, that algorithm must recognize an object across all views, rotations, scales, illuminations, and in noise. Thankfully, their null model completely chokes on this dataset, but I’d like to see how more trumped-up models fare.
PLoS Computational Biology, 2008. DOI: 10.1371/journal.pcbi.0040027
( Image from Flickr user Max Kiesler )