Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop

Journal of Web Semantics, pp. 1005462020.

Cited by: 0|Bibtex|Views76|Links
EI
Keywords:
Relation extractionWeb miningKnowledge graph miningHuman-in-the-loop
Weibo:
In this work we focused on the dataset provided by the ISWC 2018 Semantic Web Challenge, with the objective to identify supply-chain relations among organizations in the Thomson Reuters Knowledge Graph, achieving best performance among all the competing systems

Abstract:

The Semantic Web movement has produced a wealth of curated collections of entities and facts, often referred as Knowledge Graphs. Creating and maintaining such Knowledge Graphs is far from being a solved problem: it is crucial to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. In this w...More

Code:

Data:

0
Introduction
  • The vision of the Semantic Web (SW) is to make information on the Web machine understandable.
  • Many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest.
  • Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG.
  • Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs
Highlights
  • The vision of the Semantic Web (SW) is to make information on the Web machine understandable
  • Rex +HUML) and the web mining approach with Human-inthe-Loop combined with the knowledge graph mining approach (W-Rex +HUML+G-Rex)
  • In this work we show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%
  • In this work we focused on the dataset provided by the ISWC 2018 Semantic Web Challenge, with the objective to identify supply-chain relations among organizations in the Thomson Reuters Knowledge Graph, achieving best performance among all the competing systems
  • We have to note that our system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs
  • Each module in our pipeline is built independent of the domain and the application, which allows it to be used in many domains and applications for largescare relation extraction
Results
  • The authors evaluate the system using the official evaluation system provided by the challenge organizers.14 The evaluation is performed using standard evaluation metrics were used, i.e., Precision: how many of the predicted relations were correct; Recall: : how many of the relations were predicted; F-1 — the harmonic average of the precision and the recall.
  • Using the GRex approach the authors were able to identify 30,183 relations, which combined with the W-Rex + HUML results in total of 352,774 relations.
  • Combining the web mining approach and the human-in-the-loop filtering and the knowledge graph mining approach the authors are able to improve the results for additional 20%.
  • The authors can see that this improvement is achieved by identifying additional 32,183, which is only 9.1% of the total number of relations
  • This confirms that the quality of the relations extracted from DBpedia is significantly higher than the relations extracted from Web documents
Conclusion
  • Conclusion and future work

    Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement.
  • In this work the authors show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%.
  • Adding a Knowledge Graph Mining component resulted in an ‘‘orthogonal’’ increase in the accuracy, complementing the advantage of the human-in-the-loop approach.
  • The authors have to note that the system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs.
  • The system was reused in the pharmaceutical domain for extracting ‘‘drug drug interaction’’ and ‘‘drug - adverse drug event’’ relations from a set of PDF documents [4]
Summary
  • Introduction:

    The vision of the Semantic Web (SW) is to make information on the Web machine understandable.
  • Many organizations build their own Knowledge Graphs (KG) [3], i.e. curated collections of interlinked descriptions of entities and factual information in their business domain of interest.
  • Maintenance of such graphs is crucial: whenever new data is available it needs to be added in the KG.
  • Knowledge graph population relies on extracting new entities – or new relations between entities – from different sources, which can be unstructured text or other existing knowledge graphs
  • Results:

    The authors evaluate the system using the official evaluation system provided by the challenge organizers.14 The evaluation is performed using standard evaluation metrics were used, i.e., Precision: how many of the predicted relations were correct; Recall: : how many of the relations were predicted; F-1 — the harmonic average of the precision and the recall.
  • Using the GRex approach the authors were able to identify 30,183 relations, which combined with the W-Rex + HUML results in total of 352,774 relations.
  • Combining the web mining approach and the human-in-the-loop filtering and the knowledge graph mining approach the authors are able to improve the results for additional 20%.
  • The authors can see that this improvement is achieved by identifying additional 32,183, which is only 9.1% of the total number of relations
  • This confirms that the quality of the relations extracted from DBpedia is significantly higher than the relations extracted from Web documents
  • Conclusion:

    Conclusion and future work

    Extracting knowledge from heterogeneous data sources remains a fundamentally hard problem, especially when high accuracy is a requirement.
  • In this work the authors show how adding a human-in-the-loop component in the extraction system nearly doubled the precision as well as increasing the recall by nearly 50%.
  • Adding a Knowledge Graph Mining component resulted in an ‘‘orthogonal’’ increase in the accuracy, complementing the advantage of the human-in-the-loop approach.
  • The authors have to note that the system and processing pipeline is not limited to this domain and dataset, but rather can be used for any type of relation extraction from Web Documents and Knowledge Graphs.
  • The system was reused in the pharmaceutical domain for extracting ‘‘drug drug interaction’’ and ‘‘drug - adverse drug event’’ relations from a set of PDF documents [4]
Tables
  • Table1: Results of the web mining approach (Web Mining), the web mining approach with Human-in-the-Loop (Web Mining +HUML) and the web mining approach with Human-in-the-Loop combined with the knowledge graph mining approach (Web Mining +HUML+KG Mining)
Download tables as Excel
Related work
  • 2.1. Relation extraction

    The task of Relation Extraction has been very well addressed in literature. There is no one-size-fits-all model to solve the task as much depends on the specific relation to extract and the data at hand. State of the art systems range from early solutions based on SVMs and tree kernels [5,6,7,8,9] to most recent ones exploiting neural architectures [10,11,12]. Regardless of the model, one of the key hurdles – as in many machine learning tasks – is obtaining sufficient relevant training data. Distant supervision has been successfully used in literature: the key idea is to exploit large knowledge bases to automatically label entities in text [13,14,15,16,17]. One of the main problems with distant supervision is poor coverage for tail entities [15]. One way to tackle the problem is to use targeted human annotations to expand the large pool of examples labeled with distant supervision [18]. This combination approach produced good results in the 2013 KBP English Slot Filling task.2 It was further shown that targeted training for use in entity identification step of relationships extraction can result in extremely fast learning [19]. We exploit this idea and focus our human-in-the-loop strategy in an entity expansion phase, that helps achieving top performance on the final relation extraction task.
Reference
  • H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semant. web 8 (3) (2017) 489–508.
    Google ScholarLocate open access versionFindings
  • P. Ristoski, H. Paulheim, Semantic web in data mining and knowledge discovery: A comprehensive survey, Web Semant.: Sci. Serv. Agents World Wide Web 36 (2016) 1–22.
    Google ScholarLocate open access versionFindings
  • S. Auer, V. Kovtun, M. Prinz, A. Kasprzik, M. Stocker, M.E. Vidal, Towards a knowledge graph for science, in: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, in: WIMS ’18, ACM, New York, NY, USA, 2018, pp. 1:1–1:6, http://dx.doi.org/10.1145/3227609.3227689, http://doi.acm.org/10.1145/3227609.3227689.
    Locate open access versionFindings
  • A.L. Gentile, D. Grul, P. Ristoski, S. Welch, Personalized knowledge graphs for the pharmaceutical domain, in: International Semantic Web Conference, Springer, 2019.
    Google ScholarFindings
  • R.C. Bunescu, R.J. Mooney, A shortest path dependency kernel for relation extraction, in: HLT/EMNLP, ACL, 2005, pp. 724–731.
    Google ScholarFindings
  • A. Culotta, J. Sorensen, Dependency tree kernels for relation extraction, in: ACL, ACL, 2004, p. 423.
    Google ScholarFindings
  • R.J. Mooney, R.C. Bunescu, Subsequence kernels for relation extraction, in: NIPS, 2006, pp. 171–178.
    Google ScholarFindings
  • D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, J. Mach. Learn. Res. 3 (2003) 1083–1106.
    Google ScholarLocate open access versionFindings
  • S. Zhao, R. Grishman, Extracting relations with integrated information using kernel methods, in: ACL, ACL, 2005, pp. 419–426.
    Google ScholarFindings
  • T.H. Nguyen, R. Grishman, Relation extraction: Perspective from convolutional neural networks, in: VS@ HLT-NAACL, 2015, pp. 39–48.
    Google ScholarFindings
  • D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al., Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344.
    Google ScholarFindings
  • N.T. Vu, H. Adel, P. Gupta, et al., Combining recurrent and convolutional neural networks for relation classification, in: NAACL-HLT, 2016, pp. 534–539.
    Google ScholarFindings
  • I. Augenstein, D. Maynard, F. Ciravegna, Distantly supervised web relation extraction for knowledge base population, Semant. Web 7 (4) (2016) 335–349.
    Google ScholarLocate open access versionFindings
  • A.L. Gentile, Z. Zhang, I. Augenstein, F. Ciravegna, Unsupervised wrapper induction using linked data, in: K-CAP, ACM, 2013, pp. 41–48.
    Google ScholarFindings
  • G. Ji, K. Liu, S. He, J. Zhao, Distant supervision for relation extraction with sentence-level attention and entity descriptions, in: AAAI, 2017, pp. 3060–3066.
    Google ScholarFindings
  • A.J. Ratner, C.D. Sa, S. Wu, D. Selsam, C. Ré, Data programming: Creating large training sets, quickly, in: NIPS, 2016, pp. 3567–3575.
    Google ScholarFindings
  • B. Roth, T. Barth, M. Wiegand, D. Klakow, A survey of noise reduction methods for distant supervision, in: AKBC, ACM, 2013, pp. 73–78.
    Google ScholarFindings
  • G. Angeli, J. Tibshirani, J. Wu, C.D. Manning, Combining distant and partial supervision for relation extraction, in: EMNLP, 2014, pp. 1556–1567.
    Google ScholarFindings
  • G. Stanovsky, D. Gruhl, P. Mendes, Recognizing mentions of adverse drug reaction in social media using knowledge-infused recurrent models, in: EACL, 2017, pp. 142–151.
    Google ScholarFindings
  • R. Reinanda, E. Meij, M. de Rijke, Document filtering for long-tail entities, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, in: CIKM ’16, ACM, New York, NY, USA, 2016, pp. 771–780, http://dx.doi.org/10.1145/2983323.2983728, URL http://doi.acm.org/10.1145/2983323.2983728.
    Locate open access versionFindings
  • M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni, Open information extraction from the web, in: Proceedings of the 20th International Joint Conference on Artifical Intelligence, in: IJCAI’07, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2007, pp. 2670–2676, URL http://dl.acm.org/citation.cfm?id=1625275.1625705.
    Locate open access versionFindings
  • O. Etzioni, A. Fader, J. Christensen, S. Soderland, M. Mausam, Open information extraction: The second generation, in: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Volume Volume One, in: IJCAI’11, AAAI Press, 2011, pp. 3–10, http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-012.
    Locate open access versionFindings
  • V. Presutti, A.G. Nuzzolese, S. Consoli, A. Gangemi, D. Reforgiato Recupero, From hyperlinks to semantic web properties using open knowledge extraction, Semant. Web 7 (4) (2016) 351–378.
    Google ScholarLocate open access versionFindings
  • H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8 (3) (2017) 489–508.
    Google ScholarLocate open access versionFindings
  • G. Weikum, J. Hoffart, F. Suchanek, Ten years of knowledge harvesting: Lessons and challenges, Data Eng. 5 (2016) 41–50.
    Google ScholarFindings
  • R. Grishman, B. Sundheim, Message understanding conference-6: A brief history, in: Proceedings of the 16th Conference on Computational Linguistics - Vol. 1, in: COLING ’96, Association for Computational Linguistics, Stroudsburg, PA, USA, 1996, pp. 466–471, http://dx.doi.org/10.3115/992628.992709.
    Locate open access versionFindings
  • E.F. Tjong Kim Sang, Introduction to the conll-2002 shared task: Languageindependent named entity recognition, in: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, in: COLING-02, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 1–4, http://dx.doi.org/10.3115/1118853.1118877.
    Locate open access versionFindings
  • E.F. Tjong Kim Sang, F. De Meulder, Introduction to the conll-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Language Learning At HLT-NAACL 2003 - Vol. 4, in: CONLL ’03, Association for Computational Linguistics, Stroudsburg, PA, USA, 2003, pp. 142–147, http://dx.doi.org/10.3115/1119176.1119195.
    Locate open access versionFindings
  • G.R. Doddington, A. Mitchell, M.A. Przybocki, L.A. Ramshaw, S. Strassel, R.M. Weischedel, The automatic content extraction (ACE) program-tasks, data, and evaluation, in: LREC, 2004.
    Google ScholarFindings
  • A.G. Nuzzolese, A.L. Gentile, V. Presutti, A. Gangemi, D. Garigliotti, R. Navigli, Open knowledge extraction challenge, in: F. Gandon, E. Cabrio, M. Stankovic, A. Zimmermann (Eds.), Semantic Web Evaluation Challenges Second SemWebEval Challenge At ESWC 2015, Portorož, Slovenia, May 31 - June 4, 2015, Revised Selected Papers, in: Communications in Computer and Information Science, vol. 548, Springer, 2015, pp. 3–15, http://dx.doi.org/10.1007/978-3-319-25518-7_1.
    Locate open access versionFindings
  • A.G. Nuzzolese, A.L. Gentile, V. Presutti, A. Gangemi, R. Meusel, H. Paulheim, The second open knowledge extraction challenge, in: H. Sack, S. Dietze, A. Tordai, C. Lange (Eds.), Semantic Web Challenges - Third SemWebEval Challenge At ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers, in: Communications in Computer and Information Science, vol. 641, Springer, 2016, pp. 3–16, http://dx.doi.org/10.1007/978-3-319-46565-4_1.
    Locate open access versionFindings
  • P. Ristoski, C. Bizer, H. Paulheim, Mining the web of linked data with rapidminer, Web Semant.: Sci. Serv. Agents World Wide Web 35 (2015) 142–151, http://dx.doi.org/10.1016/j.websem.2015.06.004, Semantic Web Challenge 2014, URL http://www.sciencedirect.com/science/article/pii/S1570826815000505.
    Locate open access versionFindings
  • O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, C. Bizer, The mannheim search join engine, Web Semant. 35 (P3) (2015) 159–166, http://dx.doi.org/10.1016/j.websem.2015.05.001.
    Findings
  • V. Bryl, C. Bizer, H. Paulheim, Gathering alternative surface forms for dbpedia entities, in: NLP-DBPEDIA@ ISWC, 2015, pp. 13–24.
    Google ScholarFindings
  • A. Alba, D. Gruhl, P. Ristoski, S. Welch, Interactive dictionary expansion using neural language models, in: HumL18 at ISWC, 2018.
    Google ScholarFindings
  • A.L. Gentile, D. Gruhl, P. Ristoski, S. Welch, Explore and exploit. dictionary expansion with human-in-the-loop, in: P. Hitzler, M. Fernández, K. Janowicz, A. Zaveri, A.J. Gray, V. Lopez, A. Haller, K. Hammar (Eds.), The Semantic Web, Springer International Publishing, 2019, pp. 131–145.
    Google ScholarLocate open access versionFindings
  • T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.
    Google ScholarLocate open access versionFindings
  • Y. Kim, Convolutional neural networks for sentence classification, 2014, arXiv preprint arXiv:1408.5882.
    Findings
  • J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer, et al., Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia, Semant. Web 6 (2) (2015) 167–195.
    Google ScholarLocate open access versionFindings
  • P. Ristoski, J. Rosati, T. Di Noia, R. De Leone, H. Paulheim, Rdf2vec: Rdf graph embeddings and their applications, Semant. Web (Preprint) (2018) 1–32.
    Google ScholarLocate open access versionFindings
  • I. Lourentzou, A.L. Gentile, D. Gruhl, J. Fortner, M. Freemon, K. Grande, Difficult relations: Extracting novel facts from text, in: ISWC18, 2018.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments