DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

Aug 3, 2017

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec—two pop-ular models for learning distributed rep- resentation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys ad- vantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effec- tiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods. 

  • The 2nd Workshop on Representation Learning for NLP 2017 (ACL Rep4NLP 2017)
  • Conference/Workshop Paper