A semantic relatedness preserved subset extraction method for language corpora based on pseudo-Boolean optimization

THEORETICAL COMPUTER SCIENCE(2020)

引用 3|浏览16
暂无评分
摘要
As language corpora have been playing an increasingly important role in the field of Artificial Intelligence (AI) research, lots of extremely large corpora are created. However, a larger corpora size not only increases power and accuracy but also brings redundancy. Therefore, researchers began to emphasize the study of appropriate subset extraction methods. Due to the trade-off between data sufficiency and redundancy, a group of interesting and challenging problems are emerged that are studied in this paper: (1) How to make the resulting subset include as much data as possible under some necessary constraints? (2) How to preserve the potential useful semantic relatedness included in the original corpora while reducing the size of the corpora? For these two problems, existing work mainly focuses on the methods to construct particular subsets for special usage. These methods are limited in their focus. In this paper, we try to address the problems listed above. First, considering the cubic and binary semantic relatedness among tokens, we construct a general system model and formulate the mix problem as a cubic pseudo-Boolean optimization problem. Then, by analyzing the characteristics of the objective function, we transfer the problem into the maximum flow problem of a corresponding graph. Third, we propose a new algorithm by introducing discrete Lagrangian iteration method. We prove that the objective function is supermodular, which allows us to use fast minimum cut algorithms in each iteration step to propose another fast algorithm. Finally, we experimentally validate our new algorithms on several randomly created corpora. (C) 2020 Elsevier B.V. All rights reserved.
更多
查看译文
关键词
Semantic relatedness,Subset extraction,Language intelligence,PseudoBoolean optimization,Discrete Lagrangian method
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要