Building Better Online Communities
The Web is teeming with communities, each with its own set of interests and personalities. There are communities for video game addicts, tennis enthusiasts, academic researchers, and coin collectors, to name just a few.
Today, members of these communities can retrieve unstructured information on the subject of their choice via keyword searches. For example, a graduate student focusing on database research can find web pages of individuals, academic departments, research groups, projects, papers, and conferences—all related to databases.
But what if you want to know who in your community is about to publish a new paper or give a big talk? Or which course recently cited a paper of interest to you? Or what the connections are between two particular database researchers? This kind of structured information is significantly harder to access.
Enter Cimple, a joint initiative between the University of Wisconsin and Yahoo! Research, focused on community information management (the CIM of Cimple). An example of Cimple in action is the DbLife prototype (http://dblife.wisc.edu), which illustrates the power of structured information to manage communities of database enthusiasts. DbLife allows members to aggregate community data, then query, monitor, and discover certain information about other members.
The problem, however, is that DbLife provides extracted information for just one community of interest—namely, the database community. But what about other academic communities like statistics or sociology? Or what about other communities of interest like video games players and tennis buffs?
This is where a new project by Yahoo! Research called Purple SOX enters the picture. PSOX seeks to enable “do-it-yourself” information extraction within certain general areas known as “templates”. “The purely algorithmic approaches will not suffice,” says Raghu Ramakrishnan, Chief Scientist, Audience and Research Fellow. “We have to empower engaged users such as community moderators to participate in and refine the information gathering process.”
In fact, the need to leverage engaged users is so critical that the “SOX” in PSOX stands for Socially Oriented eXtraction.
For the coin collecting community, for instance, this could mean that members of the community could give suggestions of new sites with information about coins and coin collections, coin shows, etc. Given this input, PSOX could crawl the web and populate the community site with extracted information about the coin world. At this point, community members could give feedback about quality and interest, allowing the quality of the extracted information to be improved until only the latest and greatest information about the coin world is shown. (Of course, this extraction would only be done from publicly available web pages.)
PSOX has two parallel thrusts: one on building advanced and transferable extraction technologies, and the other on managing the extraction effort in a manner that facilitates community input.
On the first front, PSOX has the ability to efficiently and accurately extract structured information from raw data. PSOX leverages a Yahoo!-developed cross-vertical information extraction platform called Vertex, which was built by the Yahoo! Advanced Technology Group based in Bangalore. Vertex can scrape web pages such as product offers and extract information from them according to a pre-specified schema. The structured data is then stored in a database and can be easily indexed and searched, thus providing users with an enriched search and browse experience.
PSOX team members are collaborating with the ATG to bring new information extraction technologies to bear. “A key goal of Purple SOX is to develop new learning methodologies for structured prediction that support adaptability across domains,” says Srujana Merugu, Research Scientist.
For example, a typical machine-learning operator can learn to extract web-based bibliographic information from a particular research domain, such as physics. But the operator would have to learn how to do that all over again for a new domain, like chemistry or biomedicine. With domain adaptation, PSOX could more quickly and efficiently move into new domains and communities.
On the second front of managing the extraction effort, PSOX aims to make extraction technology expandable and explainable. Making it expandable means that new extraction technologies can be easily added through clean, declarative interfaces, and automatically incorporated by the system into its extraction tasks.
Making it explainable means that the history of extraction is tracked by the system so that over time feedback from users can be used to improve the extraction process. For example, if information extracted about a particular coin collector is incorrect, the community could simply click a button to flag that data—ultimately allowing the system to assign greater confidence to higher quality extraction results.
“The main goal of PSOX is to provide a platform that helps us scale information extraction to more domains with less expert activity and knowledge,” concludes Philip Bohannon, Principal Research Scientist, who works on the project along with Raghu Ramakrishnan, Vipul Agarwal, Arun Iyer, Vinay Kakade, Srujana Merugu, Bo Pang, and Saathiya Keerthi, Cong Yu, Nilesh Dalvi, Srinivasan H Sengamedu, Krishna Prasad Chitrapura, and Yahoo! summer interns Pedro DeRose, Ashwin Machanavajjhala and Warren Shen.