Improving Event Detection via Open-domain Trigger Knowledge

ACL, pp. 5887-5897, 2020.

Cited by: 0|Bibtex|Views296|Links
EI
Keywords:
training datumopen domain trigger knowledgeEnrichment Knowledge Distillationevent extractionEvent DetectionMore(9+)
Weibo:
We introduce the proposed Enrichment Knowledge Distillation model, which leverages open-domain trigger knowledge to improve Event Detection

Abstract:

Event Detection (ED) is a fundamental task in automatically structuring texts. Due to the small scale of training data, previous methods perform poorly on unseen/sparsely labeled trigger words and are prone to overfitting densely labeled trigger words. To address the issue, we propose a novel Enrichment Knowledge Distillation (EKD) model ...More
0
Introduction
  • Event Detection (ED) aims at detecting trigger words in sentences and classifying them into predefined event types, which shall benefit numerous applications, such as summarization (Li et al, 2019) and reading comprehension (Huang et al, 2019).
  • In S1 of Figure 1, ED aims to identify the word fire as the event trigger and classify its event type as Attack.
  • It is crucial to identify trigger words correctly as the preliminary step.
  • Take the benchmark ACE2005 as an example: trigger words with frequency less than 5 account for 78.2% of the
Highlights
  • Event Detection (ED) aims at detecting trigger words in sentences and classifying them into predefined event types, which shall benefit numerous applications, such as summarization (Li et al, 2019) and reading comprehension (Huang et al, 2019)
  • We propose a novel teacher-student model (EKD) that can learn from both labeled and unlabeled data, so as to improve Event Detection performance by reducing the in-built biases in annotations
  • We introduce the proposed Enrichment Knowledge Distillation (EKD) model, which leverages open-domain trigger knowledge to improve Event Detection
  • A trigger is considered correct if both its type and offsets match the annotation
  • All reported results are the average results of ten runs
  • We leverage the wealth of the open-domain trigger knowledge to address the long-tail issue in ACE2005
Methods
  • 84.8 83.5 84.1 85.9 83.8 84.3 90.6 83.5 86.9 92.5 82.2 87.1 syntactic, the authors take the first-order neighbor of trigger word on dependency parse tree.
  • 3) For argument, the authors focus on the words played as the ARG0-4 roles of the trigger in AMR parser following (Huang et al, 2017).
  • As the authors do not know trigger words on unlabeled data, the authors use pseudo labels generated by pre-trained BERT instead.
  • Syntactic and argument knowledge into sentences with the same Marking Mechanism in Section 3.2.
  • The authors only use that knowledge in the training procedure
Results
  • Recall and micro-averaged F1 scores in the form of percentage over all 33 events.
  • The batch size of labeled data is 32, and the authors set the proportion of labeled and unlabeled data to 1:6.
  • For most of the experiments, the authors set the learning rate 3e-5, the maximum sequence length 128 and the λ in joint training 1.
  • Balancing the performance and training efficiency, the authors use 40,236 unlabeled data for knowledge distillation unless otherwise stated.
Conclusion
  • EKD forces the student model to learn open-domain trigger knowledge from teacher model by mimicking the Sentence.
  • S1: Mr Caste leaves at 5 A.M. for a train trek to manhatten and does not return utill 6 P.M. S2: Militants in the region escalate their attacks in the weeks leading up to the inauguration of Nigeria’s president.
  • S3: Mr.Mason, who will be president of CBS radio, said that it would play to radio’s strengths in delivering local news
Summary
  • Introduction:

    Event Detection (ED) aims at detecting trigger words in sentences and classifying them into predefined event types, which shall benefit numerous applications, such as summarization (Li et al, 2019) and reading comprehension (Huang et al, 2019).
  • In S1 of Figure 1, ED aims to identify the word fire as the event trigger and classify its event type as Attack.
  • It is crucial to identify trigger words correctly as the preliminary step.
  • Take the benchmark ACE2005 as an example: trigger words with frequency less than 5 account for 78.2% of the
  • Objectives:

    3.1 Notation Given the labeled corpus L = {(Si, Yi)}Ni=L1 and abundant unlabeled corpus U = {(Sk)}kN=TNL+1, the goal is to jointly optimize two objections: 1) maximize the prediction probability P (Yi|Si) on labeled corpus L, 2) minimize the prediction probability discrepancy between the teacher P (Yk |Sk+) and student model P (Yk |Sk−) on both L and U , where NT stand for the total number of sentences in both labeled and unlabeled data
  • Methods:

    84.8 83.5 84.1 85.9 83.8 84.3 90.6 83.5 86.9 92.5 82.2 87.1 syntactic, the authors take the first-order neighbor of trigger word on dependency parse tree.
  • 3) For argument, the authors focus on the words played as the ARG0-4 roles of the trigger in AMR parser following (Huang et al, 2017).
  • As the authors do not know trigger words on unlabeled data, the authors use pseudo labels generated by pre-trained BERT instead.
  • Syntactic and argument knowledge into sentences with the same Marking Mechanism in Section 3.2.
  • The authors only use that knowledge in the training procedure
  • Results:

    Recall and micro-averaged F1 scores in the form of percentage over all 33 events.
  • The batch size of labeled data is 32, and the authors set the proportion of labeled and unlabeled data to 1:6.
  • For most of the experiments, the authors set the learning rate 3e-5, the maximum sequence length 128 and the λ in joint training 1.
  • Balancing the performance and training efficiency, the authors use 40,236 unlabeled data for knowledge distillation unless otherwise stated.
  • Conclusion:

    EKD forces the student model to learn open-domain trigger knowledge from teacher model by mimicking the Sentence.
  • S1: Mr Caste leaves at 5 A.M. for a train trek to manhatten and does not return utill 6 P.M. S2: Militants in the region escalate their attacks in the weeks leading up to the inauguration of Nigeria’s president.
  • S3: Mr.Mason, who will be president of CBS radio, said that it would play to radio’s strengths in delivering local news
Tables
  • Table1: F score on unseen/sparsely and densely labeled triggers. DMBERT (<a class="ref-link" id="cChen_et+al_2015_a" href="#rChen_et+al_2015_a">Chen et al, 2015</a>) refers to a supervised-only model with dynamic multi-pooling to capture contextual features; BOOTSTRAP (<a class="ref-link" id="cHe_2017_a" href="#rHe_2017_a">He and Sun, 2017</a>) expands training data via bootstrapping. DGBERT expands training data with Freebase (<a class="ref-link" id="cChen_et+al_2017_a" href="#rChen_et+al_2017_a">Chen et al, 2017</a>)
  • Table2: Overall Performance on ACE2005 dataset (%). The results of baselines are adapted from their original papers
  • Table3: Performance of test set with or without opendomain trigger knowledge
  • Table4: Performance on domain adaption. We train our model on two source domains bn and nw, and test our model on three target domains bc, cts and wl
  • Table5: Performance of our method on various labeling frequencies trigger words
  • Table6: Error analysis: How and When does the open-domain trigger knowledge improve ED? GT refers to the ground truth labels. On the unlabeled data, we use a majority vote of three humans as the ground truth
  • Table7: Knowledge-Agnostic
Download tables as Excel
Related work
  • 2.1 Event Detection

    Traditional feature-based methods exploit both lexical and global features to detect events (Li et al, 2013). As neural networks become popular in NLP (Cao et al, 2018), data-driven methods use various superior DMCNN, DLRNN and PLMEE model (Duan et al, 2017; Nguyen and Grishman, 2018; Yang et al, 2019) for end-to-end event detection. Recently, weakly-supervised methods (Judea and Strube, 2016; Huang et al, 2017; Zeng et al, 2018; Yang et al, 2018) has been proposed to generate more labeled data. (Gabbard et al, 2018) identifies informative snippets of text as expending annotated data via curated training. (Liao and Grishman, 2010a; Ferguson et al, 2018) rely on sophisticated pre-defined rules to bootstrap from the paralleling news streams. (Wang et al, 2019a) limits the data range of adversarial learning to trigger words appearing in labeled data. Due to the long tail issue of labeled data and the homogeneity of the generated data, previous methods perform badly on unseen/sparsely labeled data and turn to overfitting densely labeled data. With open-domain trigger knowledge, our model is able to perceive the unseen/sparsely labeled trigger words from abundant unlabeled data, and thus successfully improve the recall of the trigger words.

    2.2 Knowledge Distillation

    Knowledge Distillation, initially proposed by (Hinton et al, 2015), has been widely adopted in NLP to distill external knowledge into the model (Laine and Aila, 2016; Saito et al, 2017; Ruder and Plank, 2018). The main idea is to adopt a student model to learn from a robust pre-trained teacher model. (Lee et al, 2018; Gong et al, 2018) reinforces the connection between teacher and student model by singular value decomposition and the laplacian regularized least squares. (Tarvainen and Valpola, 2017; Huang et al, 2018) stabilize the teacher model by a lazy-updated mechanism to enable student model not susceptible to external disturbances. (Liu et al, 2019) uses an adversarial imitation approach to enhance the learning procedure. Unlike previous methods that relied on golden annotations, our method is able to learn from pseudo labels and effectively extract knowledge from both labeled and unlabeled corpus.
Funding
  • This work is supported by the National Key Research and Development Program of China (2018YFB1005100 and 2018YFB1005101), NSFC Key Projects (U1736204, 61533018)
  • This research is supported by the National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative
Reference
  • Jun Araki and Teruko Mitamura. 2018. Open-domain event detection using distant supervision. In Proceedings of the 27th International Conference on Computational Linguistics, pages 878–891.
    Google ScholarLocate open access versionFindings
  • Tiberiu Boro, Stefan Daniel Dumitrescu, and Ruxandra Burtica. 2018. NLP-cube: End-to-end raw text processing with neural networks. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 171–179.
    Google ScholarLocate open access versionFindings
  • Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2018. Neural collective entity linking. In COLING.
    Google ScholarLocate open access versionFindings
  • Yixin Cao, Zikun Hu, Tat-seng Chua, Zhiyuan Liu, and Heng Ji. 2019. Low-resource name tagging learned with weakly labeled data. In EMNLP.
    Google ScholarFindings
  • Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. 2017. Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–419.
    Google ScholarLocate open access versionFindings
  • Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multipooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 167–176.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Findings
  • Shaoyang Duan, Ruifang He, and Wenli Zhao. 2017. Exploiting document level information to improve event detection via recurrent neural networks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 352–361.
    Google ScholarLocate open access versionFindings
  • Xiaocheng Feng, Bing Qin, and Ting Liu. 2018. A language-independent neural network for event detection. Science China Information Sciences, 61(9):092106.
    Google ScholarLocate open access versionFindings
  • James Ferguson, Colin Lockard, Daniel Weld, and Hannaneh Hajishirzi. 2018. Semi-supervised event extraction with paraphrase clusters. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 359–364, New Orleans, Louisiana. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ruan Gabbard, Jay DeYoung, and Marjorie Freedman. 2018. Events beyond ace: Curated training for events. arXiv preprint arXiv:1809.05576.
    Findings
  • Chen Gong, Xiaojun Chang, Meng Fang, and Jian Yang. 2018. Teaching semi-supervised classifier via generalized distillation. In IJCAI, pages 2156–2162.
    Google ScholarLocate open access versionFindings
  • Zellig S. Harris. 1954. Distributional structure. ¡i¿WORD¡/i¿, 10(2-3):146–162.
    Google ScholarLocate open access versionFindings
  • Hangfeng He and Xu Sun. 2017. A unified model for cross-domain and semi-supervised named entity recognition in chinese social media. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 20Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
    Findings
  • Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277.
    Findings
  • Lifu Huang, Heng Ji, Kyunghyun Cho, and Clare R Voss. 20Zero-shot transfer learning for event extraction. arXiv preprint arXiv:1707.01066.
    Findings
  • Mingkun Huang, Yongbin You, Zhehuai Chen, Yanmin Qian, and Kai Yu. 20Knowledge distillation for sequence model. In Interspeech, pages 3703–3707.
    Google ScholarLocate open access versionFindings
  • Alex Judea and Michael Strube. 2016. Incremental global event extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2279– 2289.
    Google ScholarLocate open access versionFindings
  • Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
    Findings
  • Seung Hyun Lee, Dae Ha Kim, and Byung Cheol Song. 2018. Self-supervised knowledge distillation using singular value decomposition. In European Conference on Computer Vision, pages 339–354. Springer.
    Google ScholarLocate open access versionFindings
  • Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 73–82.
    Google ScholarLocate open access versionFindings
  • Wei Li, Dezhi Cheng, Lei He, Yuanzhuo Wang, and Xiaolong Jin. 2019. Joint event extraction based on hierarchical event schemas from framenet. IEEE Access, 7:25001–25015.
    Google ScholarLocate open access versionFindings
  • Shasha Liao and Ralph Grishman. 2010a. Filtered ranking for bootstrapping in event extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 680–688. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jian Liu, Yubo Chen, and Kang Liu. 2019. Exploiting the ground-truth: An adversarial imitation based knowledge distillation approach for event detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6754–6761.
    Google ScholarLocate open access versionFindings
  • Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2018a. Event detection via gated multilingual attention mechanism. Statistics, 1000:1250.
    Google ScholarLocate open access versionFindings
  • Shaobo Liu, Rui Cheng, Xiaoming Yu, and Xueqi Cheng. 2018b. Exploiting contextual information via dynamic memory network for event detection. arXiv preprint arXiv:1810.03449.
    Findings
  • Shulin Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2017. Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1789–1798.
    Google ScholarLocate open access versionFindings
  • Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. 2019. Distilling discrimination and generalization knowledge for event detection via deltarepresentation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4366–4376.
    Google ScholarLocate open access versionFindings
  • Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
    Google ScholarLocate open access versionFindings
  • David McClosky, Mihai Surdeanu, and Christopher D Manning. 2011. Event extraction as dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1626–1635. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. 1990. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235– 244.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 300–309.
    Google ScholarLocate open access versionFindings
  • Shasha Liao and Ralph Grishman. 2010b. Using document level cross-event inference to improve event extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 789–797, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen and Ralph Grishman. 2014. Employing word representations and regularization for domain adaptation of relation extraction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 68–74.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 365–371, Beijing, China. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Thien Huu Nguyen and Ralph Grishman. 2018. Graph convolutional networks with argument-aware pooling for event detection. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Barbara Plank and Alessandro Moschitti. 2013. Embedding semantic similarity in tree kernels for domain adaptation of relation extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1498–1507.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder and Barbara Plank. 2018. Strong baselines for neural semi-supervised learning under domain shift. arXiv preprint arXiv:1804.09530.
    Findings
  • Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2988–2997. JMLR. org.
    Google ScholarLocate open access versionFindings
  • Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
    Google ScholarLocate open access versionFindings
  • Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. arXiv preprint arXiv:1906.03158.
    Findings
  • Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204.
    Google ScholarLocate open access versionFindings
  • Meihan Tong, Shuai Wang, Yixin Cao, Bin Xu, Juaizi Li, Lei Hou, and Tat-Seng Chua. 2020. Image enhanced event detection in news articles.
    Google ScholarFindings
  • Chuan Wang, Nianwen Xue, and Sameer Pradhan. 2015. A transition-based algorithm for AMR parsing. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 366–375, Denver, Colorado. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019b. Adversarial training for weakly supervised event detection. In NAACL.
    Google ScholarFindings
  • Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848.
    Findings
  • Hang Yang, Yubo Chen, Kang Liu, Yang Xiao, and Jun Zhao. 2018. Dcfee: A document-level chinese financial event extraction system based on automatically labeled training data. In Proceedings of ACL 2018, System Demonstrations, pages 50–55.
    Google ScholarLocate open access versionFindings
  • Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, and Dongsheng Li. 2019. Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5284– 5294.
    Google ScholarLocate open access versionFindings
  • Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui Yan, Chongde Shi, and Dongyan Zhao. 2018. Scale up event extraction learning via automatic training data generation. In Thirty-Second AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Kuo Zhang, Juan Zi, and Li Gang Wu. 2007. New event detection based on indexing-tree and named entity. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 215–222. ACM.
    Google ScholarLocate open access versionFindings
  • Tongtao Zhang, Heng Ji, and Avirup Sil. 2019. Joint entity and event extraction with generative adversarial imitation learning. Data Intelligence, 1(2):99– 120.
    Google ScholarLocate open access versionFindings
  • Yue Zhao, Xiaolong Jin, Yuanzhuo Wang, and Xueqi Cheng. 2018. Document embedding enhanced event detection with hierarchical and supervised attention. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 414–419.
    Google ScholarLocate open access versionFindings
  • Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pages 78–83, Uppsala, Sweden. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019a. Adversarial training for weakly supervised event detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 998–1008.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments