Automatic Pipeline for Information of Curve Graphs in Papers Based on Deep Learning

crossref(2024)

引用 0|浏览4
暂无评分
摘要
Abstract Extracting information from the vast amount of literature can help researchers quickly grasp the current state of development. Literature is a carrier of multiple forms of data while most researchers only pay attention to the text. Especially like the curve graphs include a great deal of critical numerical information that is not expressed in other data. The paper proposes a method to mine information from curve graphs in the literature. With this method, numerical values and coordinate axis entities of the curve graphs are extracted from the graphs and text. Foremost, curve graphs are cut out from literature with Yolov5s. Then, the exact title text corresponding to each curve graph is matched through operating Sentence-Bert. After obtaining the title text, the X-axis and the Y-axis name of a curve graph are extracted in the title with SCI-Bert. Meanwhile, techniques such as optical character recognition (OCR) are employed to parse the numerical data reflected on the charts automatically. Moreover, some principles are adopted to improve the performance. We validate each step with a dataset of 6042 articles from Elsevier. The accuracies of curve graph detecting and title matching with our principles are 96.4\% and 95.8\%. Both outperform better than inital models, proving the effectiveness of our principles. Entity and numerical data extraction achieve 76.3\% and 28.2\%. The experimental results show that our method can achieve large-scale extraction of the knowledge of those curve graphs from the literature.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要