Publications > Advertising Science > Hadoop-MTA: a system for Multi Data-center Trillion Concepts Auto-ML atop Hadoop

Hadoop-MTA: a system for Multi Data-center Trillion Concepts Auto-ML atop Hadoop

Publication

Dec 15, 2021

Abstract

The ever-growing computation capability dis- tributed infrastructure brings tremendous opportunities for mining and analysis of data that was impossible otherwise. Meanwhile, the inherent computation model of distributed system also brings unique and non-trivial challenges for traditional Auto- ML, including the explosion of data dimensions, the expected absence of features, and the heterogeneity of information. This is especially the case in modern Internet enterprises, where data in the scale of trillions are stored in multiple data centers, and the discovery of subtle signals could incur significant impact in revenue and welfare. How can we best harness the large scale distributed machine learning, but without keeping engineers constantly in the loop? In this work, we present Hadoop-MTA, a system for Multi Data-center, Trillion Concepts, Auto-ML on top of the Hadoop distributed computation environment that leverages sparsity aware heterogeneous knowledge graph representation and dimensionality agnostic parallel learning. Through multiple large scale experiments, we find that Hadoop- MTA significantly output-performs competitive state of the art distributed learning algorithms and scales well to trillion scale data-sets. Our model is rolled out to Hadoop serving infrastructure in Yahoo covering billions of unique identities and shows improvements 129.5% accuracy and 106.5 % weighted F1-score (more than 2x) on key targeting use cases.

Download

Venue:

IEEE International Conference On Big Data (BigData 2021)

Type:

Conference/Workshop Paper

Authors:

BibTeX

@inproceedings{ author = {Keqian Li and Yifan Hu and Manisha Verma and Fei Tan and Changwei Hu and Tejaswi Kasturi and Kevin Yen}, title = {Hadoop-MTA: a system for Multi Data-center Trillion Concepts Auto-ML atop Hadoop}, booktitle = {Proceedings of IEEE International Conference On Big Data}, year = {2021} }

- Help
- About our ads

Hadoop-MTA: a system for Multi Data-center Trillion Concepts Auto-ML atop Hadoop

Publication

Abstract

IEEE International Conference On Big Data (BigData 2021)

Conference/Workshop Paper

Keqian Li

Yifan Hu

Manisha Verma

Fei Tan

Changwei Hu

Tejaswi Kasturi

Kevin Yen

BibTeX