A Novel, Diverse Dataset for Automatic Video Summarization

Jun 4, 2015


 Thumbnail images of the 50 videos in the TVSum50 dataset.

By Yale Song

This year, US adults will spend an average of 1 hour, 16 minutes each day with video on digital devices.1 Yahoo serves over 1.5 billion content videos streams across its various sites,2 and on YouTube over 300 hours of video are uploaded every minute.3 Now more than ever, it is crucial to help viewers use their time wisely and navigate all of their video options thoughtfully. Today, we are excited to help aid in that effort with the release of our new video summarization dataset, TVSum50.

In order to provide a best-in-class video user experience so that our audience can easily browse and find interesting content, the Computer Vision group at Yahoo Labs has been working tirelessly on automatic video summarization, where the goal is to create the gist of a video by extracting highlights and important moments. We believe that well-designed video summaries have a great potential to improve many aspects of a user’s video experience. By allowing users to glance through many videos in a short period of time, we can help them make quicker decisions on what content to watch next so as not to waste their time. To that end, we've developed new algorithms for automatic video summarization and published two research papers appearing in the proceedings of next week’s IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in Boston.

Our papers describe two effective and scalable ways to automatically identify important moments in a video, regardless of video genres, by leveraging the vast amount of images and videos across the Web. Our paper entitled “TVSum: Summarizing Web Videos Using Titles” proposes a system that uses title-based image search results to summarize Web videos. The approach is motivated by the observation that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic.

Our second paper, entitled “Video Co-summarization: Video Summarization by Co-occurrence,” proposes a system that exploits visual co-occurrence across multiple videos. Motivated by the observation that important visual concepts tend to appear repeatedly across videos of the same topic, the system summarizes a video by finding shots that co-occur most frequently across videos collected from the Web using topical keywords. In each paper, we describe a novel and interesting computational algorithm designed to deal with challenges unique to each scenario, and show their effectiveness by evaluating them on various genres of videos, including our TVSum50 dataset.

The TVSum50 dataset, available to the academic research community via our Webscope data-sharing program, contains 50 videos and their content importance scores annotated via crowdsourcing. The videos with their corresponding scores allow researchers who work in video summarization to rapidly evaluate their own algorithms and to iterate the development process multiple times, without having to conduct extensive user studies each time an evaluation is needed.

 Video 1: Full version of “Will a Cat Eat Dog Food?” by ElPerfecto.com. License: CC-BY 3.0.

 Video 2: Summary version of “Will a Cat Eat Dog Food?” video after applying a version of our TVSum video summarization algorithm.

The 50 videos represent various genres including news, interviews, how-to instructions, documentaries, and user-generated content (e.g., vlog, egocentric). The variety of genres helps ensure the generalizability of summarization techniques in various settings.

The content importance scores included in the dataset are annotated using our carefully designed annotation task user interface, which mimics the conventional video watching experience. The scores were measured at every two-second interval, on a scale of one to five, in order to indicate which part of a video is more important than others, and thus “summary worthy.” We've also incorporated several novel concepts to the task interface so that the resulting annotation is of high quality (meaning, it has a high level of inter-rater reliability).

There are only a few similar benchmark datasets for video summarization available in the community, and most of them are either smaller in scale, biased toward specific video genres (e.g., egocentric), or have a low inter-rater reliability. We hope the release of our TVSum50 dataset will give researchers a new, dynamic tool to evaluate their video summarization algorithms rapidly and with a significant variety of genres to choose from.

1 eMarketer, April 2015: US Adults Spend 5.5 Hours with Video Content Each Day
2 comScore VideoMetrix, April 2015, content video streams only for April 2015