That’s Not My Question: Learning to Weight Unmatched Terms in CQA Vertical Search

Jul 18, 2016

A fundamental task in Information Retrieval (IR) is term weighting. Early IR theory considered both the presence or absence of all terms in the lexicon for ranking and needed to weight them all. Yet, as the size of lexicons grew and models became too complex, common weighting models preferred to aggregate only the weights of the query terms that are matched in candidate documents. Thus, unmatched term contribution in these models is only considered indirectly, such as in probability smoothing with corpus distribution, or in weight normalization by document length. In this work we propose a novel term weighting model that directly assesses the weights of unmatched terms, and show its benefits. Specifically, we propose a Learning To Rank framework, in which features corresponding to matched terms are also "mirrored" in similar features that account only for unmatched terms. The relative importance of each feature is learned via a click-through query log. As a test case, we consider vertical search in Community-based Question Answering (CQA) sites from Web queries. Queries that result in viewing CQA content often contain fine grained information needs and benefit more from unmatched term weighting. We assess our model both via manual evaluation and via automatic evaluation over a clickthrough log. Our results show consistent improvement in retrieval when unmatched information is taken into account. This holds both when only identical terms are considered matched, and when related terms are matched via distributional similarity.

  • Special Interest Group on Information Retrieval (SIGIR 2016)
  • Conference/Workshop Paper