Dog and cat images share enough similarities to make identification hard for computer vision algorithms. But knowing that one image came from catlovers.com and the other from myfavoritedog.org makes identification easy. Images courtesy of Flickr users Drab Makyo and Tambako the Jaguar.
The Internet multimedia world is complicated and several recent studies raise an interesting question: Does content matter?
The content is the audio and image data that makes us smile (or not.) Two examples provide interesting food for thought.
The Netflix competition
was a wildly successful attempt to motivate hundreds of the best machine-learning researchers to develop a better movie recommendation algorithm. The winning team, with a large contribution from Yahoo Researcher Yehuda Koren, combined hundreds of sources of information to make the best recommendations. But none of the sources directly related to the audio and video signals! You'd think that these multimedia signals, which is what we watch, would be the key information needed to recommend movies, but no. They found useful information in, for example, the length of time between the movie's release and when the user rated the movie, but not the content. For movies, the most important information is actually the users’ ratings (or, user interaction with the movies, such as the ones they watched), which greatly exceeds the importance of genre, actors and directors. Reference http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf.
Similarly, a University of Alberta team co-led by Yahoo visitor Russ Greiner developed the best algorithm to recognize children and young adults with attention-deficit hyperactivity disorder (ADHD) from fMRI brain images. But they didn't do it the way the organizers hoped. They found the best signals to classify the patients were the meta data about the patient. They ignored the image data and instead looked at individual's age, gender, handedness, IQ, and the institution where they had their fMRI scan. [Reference http://www.talyarkoni.org/blog/2011/10/12/brain-based-prediction-of-adhd-now-with-100-fewer-brains/
None of this is to suggest that the people working on these problems are dummies. They are the best and the brightest in our field (and Netflix's $1M is a powerful motivator.) Audio, image and videos signals are complicated and beyond our ability, so far, to truly understand. Academic competitions often provide simple metrics based on a simple source of information, such as an image. These metrics are good because they focus attention on one particular type of solution. But the real world demands better solutions. Sometimes the best solution is not what is expected.
Yahoo researchers asked our colleagues at Flickr if they needed help applying the latest computer vision algorithms to their problem of detecting adult images. They quickly responded no, and suggested a better solution based on the network graph. They have labels for some images. These photographers have friends and their images are labeled the same way. The resulting image graph is well contained. When asked to display a search result they are very conservative in the images that are shown. Good images might be in the wrong part of the graph, but there are so many images on so many different topics that it would be hard to tell an image is missing.
Yahoo reseacher Dhruv Mahajan turned this idea into a formal solution using optimization theory. Images of cats are more likely to appear at catlovers.com, while images of cricket are found in India. This is a rather obvious statement, but simple changes to a problem description often have dramatic effects. Images always
have a context. We know who took the picture, what other kinds of pictures were taken by the same photographer, which Web page contains the image, and how that Web page is linked to the rest of the graph. Mahajan's system classified images and found the best single
source of information was the web context. Not the text around the image or the image itself, but the labels of images nearby on the web graph. Of course, the best classifier used all three signals but his results are a dramatic illustration. Yahoo Researcher Olivier Chapelle described a similar solution to decide whether a Web page is spam. Reference http://dl.acm.org/citation.cfm?id=1874131 and http://research.yahoo.comhttps://s.yimg.com/ge/labs/v1/files/2008-001_abernethy.pdf
Another approach that appears promising is the work by Yahoo Researcher Jia Li. Humans are very good at telling the difference between a house cat and a hungry tiger. We know how to interpret context, and understand what the socks on the floor suggest. The ImageBank system combines many good enough detectors for simple ideas like sky and person. These simple detectors, over a wide range of topics, then give good information to help make the final high-level judgements (is this basketball?). Reference http://vision.stanford.edu/lijiali/JiaLi_files/LiSuXingFeiFeiNIPS2010.pdf
Thus, context is key. Of course content is king, that is what we pay to see and hear. But often the best sources of information about a multimedia signal is its ancillary information. Now we need to find better ways to incorporate it into our systems.