In a typical shared-nothing parallel text retrieval system, an inverted index is distributed over the processors in the system. This distribution is usually achieved via term-based partitioning of the index. That is, the responsibility of processing each term in the vocabulary is uniquely assigned to a processor. The disadvantage of this approach is the high amount of communication overhead incurred during parallel query processing. In this work, we propose a novel inverted index partitioning model for communication-e?cient query processing on parallel text retrieval systems that adopt the term-based inverted index organization. The proposed model formulates the index partitioning problem as a hypergraph partitioning problem. The model aims to balance the storage loads of processors while trying to minimize the volume of communication during parallel query processing. We report performance results over a TREC document collection, containing 210,157 documents.
ACM COPYRIGHT NOTICE. Copyright © 2012 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM, Inc., fax +1 (212) 869-0481, or email@example.com.