AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
This paper introduces a combinatorial “balls-andurns” model that computes the impact of sample size, redundancy, and corroboration from multiple distinct extraction rules on the probability that an extraction is correct

A probabilistic model of redundancy in information extraction

IJCAI, pp.1034-1041, (2006)

Cited by: 248|Views163
EI
Full Text
Bibtex
Weibo

Abstract

Unsupervised Information Extraction (UIE) is the task of extracting knowledge from text without using hand-tagged training examples. A fundamental problem for both UIE and supervised IE is assessing the probability that extracted information is correct. In massive corpora such as the Web, the same extraction is found repeatedly in differe...More

Code:

Data:

0
Introduction
  • Information Extraction (IE) is the task of automatically extracting knowledge from text.
  • Unsupervised IE (UIE) is IE in the absence of hand-tagged training data.
  • A fundamental problem for both supervised IE and UIE is assessing the probability that extracted information is correct.
  • As explained in Section 5, previous IE work has used a variety of techniques to address this problem, but has yet to provide an adequate formal model of the impact of redundancy—repeatedly obtaining the same extraction from different documents—on the probability of correctness.
  • In massive corpora such as the Web, redundancy is one of the main sources of confidence in extractions
Highlights
  • Information Extraction (IE) is the task of automatically extracting knowledge from text
  • We describe methods for estimating the model’s parameters in practice and demonstrate experimentally that for Unsupervised Information Extraction the model’s log likelihoods are 15 times better, on average, than those obtained by Pointwise Mutual Information (PMI) and the noisy-or model used in previous work
  • For Unsupervised Information Extraction, our model is a factor of 15 closer to the correct log likelihood than the noisy-or model used in previous work; the model is 20 times closer than KNOWITALL’s Pointwise Mutual Information (PMI) method [Etzioni et al, 2004], which is based on Turney’s Pointwise Mutual Information-IR algorithm [Turney, 2001]
  • The extraction patterns were partitioned into urns based on the name they employed for their target relation (e.g. “country” or “nation”) and whether they were left-handed (e.g. “countries including x”) or right-handed (e.g. “x and other countries”)
  • In the Unsupervised Information Extraction experiments, we evaluate our algorithms on all 1000 examples, and in the supervised Information Extraction experiments we perform 10fold cross validation
  • Pointwise Mutual Information computed over search engine hit counts has been used to determine synonymy [Turney, 2001], and for question answering [Magnini et al, 2002]
Results
  • This section describes the experimental results under two settings: unsupervised and supervised.
  • The authors evaluated the algorithms on extraction sets for the relations City(x), Film(x), Country(x), and MayorOf(x,y), taken from experiments performed in [Etzioni et al, 2005].
  • The sample size n was 64,581 for City, 134,912 for Film, 51,313 for Country and 46,129 for MayorOf. The extraction patterns were partitioned into urns based on the name they employed for their target relation (e.g.
  • In the UIE experiments, the authors evaluate the algorithms on all 1000 examples, and in the supervised IE experiments the authors perform 10fold cross validation
Conclusion
  • Discussion of UIE Results

    The results of the unsupervised experiments are shown in Figure 2.
  • URNS is substantially more efficient as shown in Table 1.
  • URNS computes probabilities directly from the set of extractions—requiring no additional queries, which cuts KNOWITALL’s queries by factors ranging from 1.9 to 17.Conclusions and Future Work.
  • PMI computed over search engine hit counts has been used to determine synonymy [Turney, 2001], and for question answering [Magnini et al, 2002].
  • Comparing URNS with PMI on these tasks is a topic for future work
Tables
  • Table1: Improved Efficiency Due to URNS. The top row reports the number of search engine queries made by KNOWITALL using PMI divided by the number of queries for KNOWITALL using URNS. The bottom row shows that PMI’s queries increase with k—the average number of distinct labels for each relation. Thus, speedup tends to vary inversely with the average number of times each label is drawn
  • Table2: Supervised IE experiments. Deviation from the ideal log likelihood for each method and each relation (lower is better). The overall performance differences are small, with URNS 19% closer to the ideal than noisy-or, on average, and 10% closer than logistic regression. The overall performance of SVM is close to that of URNS
Download tables as Excel
Related work
  • In contrast to the bulk of previous IE work, our focus is on unsupervised IE (UIE) where URNS substantially outperforms previous methods (Figure 2).

    In addition to the noisy-or models we compare against in our experiments, the IE literature contains a variety of heuristics using repetition as an indication of the veracity of extracted information. For example, Riloff and Jones [Riloff and Jones, 1999] rank extractions by the number of distinct patterns generating them, plus a factor for the reliability of the patterns. Our work is intended to formalize these heuristic techniques, and unlike the noisy-or models, we explicitly model the distribution of the target and error sets (our num(C) and num(E)), which is shown to be important for good performance in Section 4.1. The accuracy of the probability estimates produced by the heuristic and noisy-or methods is rarely evaluated explicitly in the IE literature, although most systems make implicit use of such estimates. For example, bootstrap-learning systems start with a set of seed instances of a given relation, which are used to identify extraction patterns for the relation; these patterns are in turn used to extract further instances (e.g. [Riloff and Jones, 1999; Lin et al, 2003; Agichtein and Gravano, 2000]). As this process iterates, random extraction errors result in overly general extraction patterns, leading the system to extract further erroneous instances. The more accurate estimates of extraction probabilities produced by URNS would help prevent this “concept drift.”
Funding
  • This research was supported in part by NSF grant IIS0312988, DARPA contract NBCHD030010, ONR grant N00014-02-1-0324, and a gift from Google
Reference
  • [Agichtein and Gravano, 2000] E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. of the 5th ACM Intl. Conf. on Digital Libraries, 2000.
    Google ScholarLocate open access versionFindings
  • [Chang and Lin, 2001] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001.
    Google ScholarFindings
  • [Culotta and McCallum, 2004] A. Culotta and A. McCallum. Confidence estimation for information extraction. In HLT-NAACL, 2004.
    Google ScholarLocate open access versionFindings
  • [Etzioni et al., 2004] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in system x: (preliminary results). In WWW, 2004.
    Google ScholarLocate open access versionFindings
  • [Etzioni et al., 2005] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. In To appear in AIJ, 2005.
    Google ScholarFindings
  • [Gale and Sampson, 1995] W. A. Gale and G. Sampson. Good-turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3):217–237, 1995.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科