Yahoo! Research has teamed up with Yahoo! Mail engineering to develop state-of-the-art spam detection that has dramatically reduced the amount of spam mail that can leak through to the in-boxes of Yahoo! Mail users. The project originated in 2007 when the mail engineering team approached the research team about collaborating on an improved content classifier.
Yahoo! Researchers Raghu Ramakrishnan, Kilian Weinberger, Martin Zinkevich, Anirban Dasgupta, and Dan Kifer immediately took the matter to heart and formed the Sparta team. The name “Sparta” is a derivative of the phrase “spam research task force.”
As a first step, the team learned all about the current efforts in spam filtering. They soon realized that the most important thing to do was to make significant improvements to the current content-based spam filter – to make it more noise tolerant and cutting edge. They put their minds together with Vish Tumkur Ramarao, Raghav Jeyaraman, Jay Pujara, and Sharat Narayan from the mail engineering team.
“The mail team and research teams complete each other,” said Weinberger. “We know a lot about machine learning and they know absolutely everything there is to know about spam.” It was the perfect marriage for such a significant endeavor.
Zinkevich saw this as a monumental opportunity to make an impact on one of the most visible Yahoo! properties in the world. “We are the largest email service provider in the world,” said Zinkevich. “Because of this, we are a constant target of spammers who spend their livelihoods keeping up with trends to circumvent the latest spam-fighting techniques.”
The teams became quickly aware that the challenges with developing a spam classifier for the largest mail service in the world are very different from those that they were used to from academic settings of document classification. “Each email has to be handled in a couple of milliseconds,” said Dasgupta. “Some spammers try to trick our algorithm by labeling their own emails as non-spam so that they leak through.”
Currently implemented for a segment of premium English-language users, Sparta will soon expand in scope and be implemented worldwide with internationalization. In order to make the spam filter future-proof, it is explicitly designed so that it can be extended and modified in many ways, and to adapt to new trends in spam emails or improved classification algorithms during the upcoming years.
“In addition to making a dent on spam for premium users, Sparta, for the first time, provides us a medium to long term content classifier that can scale well even to the free email user segment,” said Ramarao.
Aside from dramatically reducing the volume of spam email leaks to users’ in-boxes, Sparta also has strict criteria to minimize the number of legitimate email slips into bulk email folders.
Sparta has been a major milestone for the research and engineering teams. The research team spent significant time on intellectual aspects of the problem, focusing on getting a cutting edge classifier that can work in an adversarial setting. The engineering team, on the other hand, worked on bringing the research team up to speed on domain knowledge and possible pitfalls -- and also worked on making sure the code was scalable, bug free, and lightning fast.
“I think the research and engineering engagement in this project was an excellent example of how both teams can work to bring sufficient thought in science and engineering to product development,” said Ramarao. “We hope to continue solving high-value business problems that are also challenging from a research perspective.”