Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop
Journal of Web Semantics, pp. 1005462020.
EI
Keywords:
Relation extractionWeb miningKnowledge graph miningHuman-in-the-loop
Weibo:
Abstract:
The Semantic Web movement has produced a wealth of curated collections of entities and facts, often referred as Knowledge Graphs. Creating and maintaining such Knowledge Graphs is far from being a solved problem: it is crucial to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. In this w...More
Code:
Data:
Introduction
- The vision of the Semantic Web (SW) is to make information on the Web machine understandable.
- Many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest.
- Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG.
- Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs
Highlights
- The vision of the Semantic Web (SW) is to make information on the Web machine understandable
- Rex +HUML) and the web mining approach with Human-inthe-Loop combined with the knowledge graph mining approach (W-Rex +HUML+G-Rex)
- In this work we show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%
- In this work we focused on the dataset provided by the ISWC 2018 Semantic Web Challenge, with the objective to identify supply-chain relations among organizations in the Thomson Reuters Knowledge Graph, achieving best performance among all the competing systems
- We have to note that our system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs
- Each module in our pipeline is built independent of the domain and the application, which allows it to be used in many domains and applications for largescare relation extraction
Results
- The authors evaluate the system using the official evaluation system provided by the challenge organizers.14 The evaluation is performed using standard evaluation metrics were used, i.e., Precision: how many of the predicted relations were correct; Recall: : how many of the relations were predicted; F-1 — the harmonic average of the precision and the recall.
- Using the GRex approach the authors were able to identify 30,183 relations, which combined with the W-Rex + HUML results in total of 352,774 relations.
- Combining the web mining approach and the human-in-the-loop filtering and the knowledge graph mining approach the authors are able to improve the results for additional 20%.
- The authors can see that this improvement is achieved by identifying additional 32,183, which is only 9.1% of the total number of relations
- This confirms that the quality of the relations extracted from DBpedia is significantly higher than the relations extracted from Web documents
Conclusion
- Conclusion and future work
Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement. - In this work the authors show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%.
- Adding a Knowledge Graph Mining component resulted in an ‘‘orthogonal’’ increase in the accuracy, complementing the advantage of the human-in-the-loop approach.
- The authors have to note that the system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs.
- The system was reused in the pharmaceutical domain for extracting ‘‘drug drug interaction’’ and ‘‘drug - adverse drug event’’ relations from a set of PDF documents [4]
Summary
Introduction:
The vision of the Semantic Web (SW) is to make information on the Web machine understandable.- Many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest.
- Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG.
- Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs
Results:
The authors evaluate the system using the official evaluation system provided by the challenge organizers.14 The evaluation is performed using standard evaluation metrics were used, i.e., Precision: how many of the predicted relations were correct; Recall: : how many of the relations were predicted; F-1 — the harmonic average of the precision and the recall.- Using the GRex approach the authors were able to identify 30,183 relations, which combined with the W-Rex + HUML results in total of 352,774 relations.
- Combining the web mining approach and the human-in-the-loop filtering and the knowledge graph mining approach the authors are able to improve the results for additional 20%.
- The authors can see that this improvement is achieved by identifying additional 32,183, which is only 9.1% of the total number of relations
- This confirms that the quality of the relations extracted from DBpedia is significantly higher than the relations extracted from Web documents
Conclusion:
Conclusion and future work
Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement.- In this work the authors show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%.
- Adding a Knowledge Graph Mining component resulted in an ‘‘orthogonal’’ increase in the accuracy, complementing the advantage of the human-in-the-loop approach.
- The authors have to note that the system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs.
- The system was reused in the pharmaceutical domain for extracting ‘‘drug drug interaction’’ and ‘‘drug - adverse drug event’’ relations from a set of PDF documents [4]
Tables
- Table1: Results of the web mining approach (Web Mining), the web mining approach with Human-in-the-Loop (Web Mining +HUML) and the web mining approach with Human-in-the-Loop combined with the knowledge graph mining approach (Web Mining +HUML+KG Mining)
Related work
- 2.1. Relation extraction
The task of Relation Extraction has been very well addressed in literature. There is no one-size-fits-all model to solve the task as much depends on the specific relation to extract and the data at hand. State of the art systems range from early solutions based on SVMs and tree kernels [5,6,7,8,9] to most recent ones exploiting neural architectures [10,11,12]. Regardless of the model, one of the key hurdles – as in many machine learning tasks – is obtaining sufficient relevant training data. Distant supervision has been successfully used in literature: the key idea is to exploit large knowledge bases to automatically label entities in text [13,14,15,16,17]. One of the main problems with distant supervision is poor coverage for tail entities [15]. One way to tackle the problem is to use targeted human annotations to expand the large pool of examples labeled with distant supervision [18]. This combination approach produced good results in the 2013 KBP English Slot Filling task.2 It was further shown that targeted training for use in entity identification step of relationships extraction can result in extremely fast learning [19]. We exploit this idea and focus our human-in-the-loop strategy in an entity expansion phase, that helps achieving top performance on the final relation extraction task.
Reference
- H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semant. web 8 (3) (2017) 489–508.
- P. Ristoski, H. Paulheim, Semantic web in data mining and knowledge discovery: A comprehensive survey, Web Semant.: Sci. Serv. Agents World Wide Web 36 (2016) 1–22.
- S. Auer, V. Kovtun, M. Prinz, A. Kasprzik, M. Stocker, M.E. Vidal, Towards a knowledge graph for science, in: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, in: WIMS ’18, ACM, New York, NY, USA, 2018, pp. 1:1–1:6, http://dx.doi.org/10.1145/3227609.3227689, http://doi.acm.org/10.1145/3227609.3227689.
- A.L. Gentile, D. Grul, P. Ristoski, S. Welch, Personalized knowledge graphs for the pharmaceutical domain, in: International Semantic Web Conference, Springer, 2019.
- R.C. Bunescu, R.J. Mooney, A shortest path dependency kernel for relation extraction, in: HLT/EMNLP, ACL, 2005, pp. 724–731.
- A. Culotta, J. Sorensen, Dependency tree kernels for relation extraction, in: ACL, ACL, 2004, p. 423.
- R.J. Mooney, R.C. Bunescu, Subsequence kernels for relation extraction, in: NIPS, 2006, pp. 171–178.
- D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, J. Mach. Learn. Res. 3 (2003) 1083–1106.
- S. Zhao, R. Grishman, Extracting relations with integrated information using kernel methods, in: ACL, ACL, 2005, pp. 419–426.
- T.H. Nguyen, R. Grishman, Relation extraction: Perspective from convolutional neural networks, in: VS@ HLT-NAACL, 2015, pp. 39–48.
- D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al., Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344.
- N.T. Vu, H. Adel, P. Gupta, et al., Combining recurrent and convolutional neural networks for relation classification, in: NAACL-HLT, 2016, pp. 534–539.
- I. Augenstein, D. Maynard, F. Ciravegna, Distantly supervised web relation extraction for knowledge base population, Semant. Web 7 (4) (2016) 335–349.
- A.L. Gentile, Z. Zhang, I. Augenstein, F. Ciravegna, Unsupervised wrapper induction using linked data, in: K-CAP, ACM, 2013, pp. 41–48.
- G. Ji, K. Liu, S. He, J. Zhao, Distant supervision for relation extraction with sentence-level attention and entity descriptions, in: AAAI, 2017, pp. 3060–3066.
- A.J. Ratner, C.D. Sa, S. Wu, D. Selsam, C. Ré, Data programming: Creating large training sets, quickly, in: NIPS, 2016, pp. 3567–3575.
- B. Roth, T. Barth, M. Wiegand, D. Klakow, A survey of noise reduction methods for distant supervision, in: AKBC, ACM, 2013, pp. 73–78.
- G. Angeli, J. Tibshirani, J. Wu, C.D. Manning, Combining distant and partial supervision for relation extraction, in: EMNLP, 2014, pp. 1556–1567.
- G. Stanovsky, D. Gruhl, P. Mendes, Recognizing mentions of adverse drug reaction in social media using knowledge-infused recurrent models, in: EACL, 2017, pp. 142–151.
- R. Reinanda, E. Meij, M. de Rijke, Document filtering for long-tail entities, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, in: CIKM ’16, ACM, New York, NY, USA, 2016, pp. 771–780, http://dx.doi.org/10.1145/2983323.2983728, URL http://doi.acm.org/10.1145/2983323.2983728.
- M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni, Open information extraction from the web, in: Proceedings of the 20th International Joint Conference on Artifical Intelligence, in: IJCAI’07, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007, pp. 2670–2676, URL http://dl.acm.org/citation.cfm?id=1625275.1625705.
- O. Etzioni, A. Fader, J. Christensen, S. Soderland, M. Mausam, Open information extraction: The second generation, in: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Volume Volume One, in: IJCAI’11, AAAI Press, 2011, pp. 3–10, http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-012.
- V. Presutti, A.G. Nuzzolese, S. Consoli, A. Gangemi, D. Reforgiato Recupero, From hyperlinks to semantic web properties using open knowledge extraction, Semant. Web 7 (4) (2016) 351–378.
- H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8 (3) (2017) 489–508.
- G. Weikum, J. Hoffart, F. Suchanek, Ten years of knowledge harvesting: Lessons and challenges, Data Eng. 5 (2016) 41–50.
- R. Grishman, B. Sundheim, Message understanding conference-6: A brief history, in: Proceedings of the 16th Conference on Computational Linguistics - Vol. 1, in: COLING ’96, Association for Computational Linguistics, Stroudsburg, PA, USA, 1996, pp. 466–471, http://dx.doi.org/10.3115/992628.992709.
- E.F. Tjong Kim Sang, Introduction to the conll-2002 shared task: Languageindependent named entity recognition, in: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, in: COLING-02, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 1–4, http://dx.doi.org/10.3115/1118853.1118877.
- E.F. Tjong Kim Sang, F. De Meulder, Introduction to the conll-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Language Learning At HLT-NAACL 2003 - Vol. 4, in: CONLL ’03, Association for Computational Linguistics, Stroudsburg, PA, USA, 2003, pp. 142–147, http://dx.doi.org/10.3115/1119176.1119195.
- G.R. Doddington, A. Mitchell, M.A. Przybocki, L.A. Ramshaw, S. Strassel, R.M. Weischedel, The automatic content extraction (ACE) program-tasks, data, and evaluation, in: LREC, 2004.
- A.G. Nuzzolese, A.L. Gentile, V. Presutti, A. Gangemi, D. Garigliotti, R. Navigli, Open knowledge extraction challenge, in: F. Gandon, E. Cabrio, M. Stankovic, A. Zimmermann (Eds.), Semantic Web Evaluation Challenges Second SemWebEval Challenge At ESWC 2015, Portorož, Slovenia, May 31 - June 4, 2015, Revised Selected Papers, in: Communications in Computer and Information Science, vol. 548, Springer, 2015, pp. 3–15, http://dx.doi.org/10.1007/978-3-319-25518-7_1.
- A.G. Nuzzolese, A.L. Gentile, V. Presutti, A. Gangemi, R. Meusel, H. Paulheim, The second open knowledge extraction challenge, in: H. Sack, S. Dietze, A. Tordai, C. Lange (Eds.), Semantic Web Challenges - Third SemWebEval Challenge At ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers, in: Communications in Computer and Information Science, vol. 641, Springer, 2016, pp. 3–16, http://dx.doi.org/10.1007/978-3-319-46565-4_1.
- P. Ristoski, C. Bizer, H. Paulheim, Mining the web of linked data with rapidminer, Web Semant.: Sci. Serv. Agents World Wide Web 35 (2015) 142–151, http://dx.doi.org/10.1016/j.websem.2015.06.004, Semantic Web Challenge 2014, URL http://www.sciencedirect.com/science/article/pii/S1570826815000505.
- O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, C. Bizer, The mannheim search join engine, Web Semant. 35 (P3) (2015) 159–166, http://dx.doi.org/10.1016/j.websem.2015.05.001.
- V. Bryl, C. Bizer, H. Paulheim, Gathering alternative surface forms for dbpedia entities, in: NLP-DBPEDIA@ ISWC, 2015, pp. 13–24.
- A. Alba, D. Gruhl, P. Ristoski, S. Welch, Interactive dictionary expansion using neural language models, in: HumL18 at ISWC, 2018.
- A.L. Gentile, D. Gruhl, P. Ristoski, S. Welch, Explore and exploit. dictionary expansion with human-in-the-loop, in: P. Hitzler, M. Fernández, K. Janowicz, A. Zaveri, A.J. Gray, V. Lopez, A. Haller, K. Hammar (Eds.), The Semantic Web, Springer International Publishing, 2019, pp. 131–145.
- T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.
- Y. Kim, Convolutional neural networks for sentence classification, 2014, arXiv preprint arXiv:1408.5882.
- J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al., Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia, Semant. Web 6 (2) (2015) 167–195.
- P. Ristoski, J. Rosati, T. Di Noia, R. De Leone, H. Paulheim, Rdf2vec: Rdf graph embeddings and their applications, Semant. Web (Preprint) (2018) 1–32.
- I. Lourentzou, A.L. Gentile, D. Gruhl, J. Fortner, M. Freemon, K. Grande, Difficult relations: Extracting novel facts from text, in: ISWC18, 2018.
Tags
Comments