Mail

Mind the Gap: Large-Scale Frequent Sequence Mining

Publication
Jun 22, 2013
[Work published prior to Yahoo]
Abstract

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a "projected database" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.

  • ACM SIGMOD International Conference on Management of Data (SIGMOD 2013)
  • Conference/Workshop Paper

BibTeX