Joint Distributed Representation of Text and Structure of Semi-Structured Documents.

HT(2018)

引用 4|浏览119
暂无评分
摘要
Majority of textual data over web is in the form of semi-structured documents. Thus, structural skeleton of such documents plays important role in determining the semantics of the data content. Presence of structure sometimes allows us to write simple rules to extract such information, but it may not be always possible due to flexibility in the structure and the frequency with which such structures are altered. In this paper, we propose a joint modeling of text and the associated structure to effectively capture the semantics of the semi-structure documents. The model simultaneously learns the dense continuous representation for word tokens and the structure associated with them. We utilize the context of structures for projection such that similar structures containing semantically similar topics are close to each other in vector space. We explore two semantic text mining tasks over web data to test the effectiveness of our representation viz., document similarity, and table semantic component identification. In context of traditional rule-based approaches, both these tasks demand rich, domain-specific knowledge sources, homogeneous schema for the documents, and rules that capture the semantics. On the other hand, our approach is unsupervised and resource conscious in nature. Despite of working without knowledge resources and large training data, it performs at par with state-of-the-art rule based and other unsupervised approaches.
更多
查看译文
关键词
Semantic Document Representation, Document structure, Text mining, Classification and Clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要