## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# A theory of term weighting based on exploratory data analysis

SIGIR, pp.11-19, (1998)

EI

Abstract

Techniques of exploratory data analysis areused to study the weight of evidence that the occurrenceof a query term provides in support of the hypothesisthat a document is relevant to an information need. Inparticular, the relationship between the document frequencyand the weight of evidence is investigated. Acorrelation between document ...More

Code:

Data:

Introduction

- In 1972, Spark Jones demonstrated that document frequency can be used e ectively for the weighting of query terms 23].
- In this paper a theory of why inverse document frequency has been so e ective is developed.
- Both the approach taken and the conclusions drawn di er from theories previously put forth.
- The result is an explanatory theory of inverse document frequency, idf, derived from observed statistical regularities of extensive retrieval data

Highlights

- In 1972, Spark Jones demonstrated that document frequency can be used e ectively for the weighting of query terms 23]
- We have shown that by accepting some, empirically motivated, assumptions concerning query terms the quantity woe can be approximated by MI( ). occ; rel By further assuming that MI rel is roughly linear in log ( ), O occ we showed that traditional idf formulations should perform well
- If we accept the hypothesis that the plot of gure 6 is representative of the general behavior of query terms for the types of queries and collections we study, we should expect improved retrieval performance from a term weighting formula that accounts for the observed attening
- We have shown strong empirical support for concluding that MI( ) occ; rel as a function of log ( ) O occ is roughly linear, with a that this can slope of be used the order of magnitude to explain why inverse doofc21u;maenndt frequency has been found to be so useful for term weighting
- Previous probabilistic explanations have started from plausible a priori assumptions, in particular assumptions concerning the probability of a query term occurring in a relevant document

Results

- A value close to 1 for -log ( j ) p occ rel is achieved by only a small percentage of query terms { those which appear in more than 25% of all documents.

Conclusion

- The authors have shown strong empirical support for concluding that MI( ) occ; rel as a function of log ( ) O occ is roughly linear, with a that this can slope of be used the order of magnitude to explain why inverse doofc21u;maenndt frequency has been found to be so useful for term weighting.
- With the availability of large numbers of conscientiously formulated queries, systematically judged against diverse, voluminous document collections, pertinent information becomes accessible.
- Inspection of this data supplies them with su cient reason for assigning unequal probabilities for ( j ) p occ rel based on a term's document frequency

Related work

- In 1972, Sparck Jones, convincingly demonstrated that the weighting of query terms can signi cantly improve retrieval performance compared to unweighted coordination match ranking 23]. The weighting formula she proposed was an approximation of: wsj = log N (3)

n where n is the document frequency of the term (the number of documents in which the term appears); and N is the number of documents in the entire collection.

3.1 Probabilistic Explanations

In a letter to the Journal of Documentation, Robertson pointed out that, viewed as a function of the probability of term occurrence, the sum of weights could be interpreted as the probability of mutual occurrence of multiple query terms 17]; thus providing theoretical arguments for the Sparck

Juosneesofprwessje.nteTdogthetehBeri,nainry1I9n7d6e,pRenodbeenrctseoMn oadnedl

18], in which terms are weighted by:

? j j ? jj = log (1( ( ) (1)) (( ))) wrsj p occ rel p occ rel p occ rel p occ rel (4)

where ( j ) p occ rel is the probability of the term occurring in relevant documents1, and ( j ) p occ rel is the corresponding probability for non-relevant documents. Use of the model depends on the availability of relevance feedback information, on which estimates of the two conditional probabilities can be based.

Applying the probabilistic approach of Robertson and Sparck Jones, Croft and Harper 5] work with an equivalent formulation of wrsj:

? j j ? ? j j = log 1 ( ( ) ) wrsj p occ rel p occ rel ( ) p occ rel log 1 ( ) p occ rel (5)

Their goal is the development of a probabilistically justi ed weighting formula that can be used in a retrieval setting in the absence of, or prior to, relevance feedback. They make two assumptions: 1) there \is no information about the relevant documents and we could therefore assume that all the query terms had equal probabilities of occurring in the relevant documents" 5, p. 287]; and np2)oonrtt-hiroeenlepovrafondbtoabcduiolmictuyem,ntpesn(otthcccaajtrneclob)ne, teoasfintiamthtaeetretmderbmoyccinNunrt,rhitnehgeenipntriroaecollection. With these two assumptions, the combination match formula: wch = k + log N ? n (6)

Funding

- This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, and also supported in part by United States Patent and Trademark O ce and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-C-0235

Reference

- 1] J. P. Callan, W. B. Croft, and S. M. Harding. The inquery retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78{83, 1992.
- 2] Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. Using statistics in lexical analysis. In Uri Zernik, editor, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pages 115{ 164, Hillsdale, NJ, 1991. Lawrence Erlbaum Associates.
- 3] W. S. Cooper, D. Dabney, and F. Gey. Probabilistic retrieval based on staged logistic regression. In Nicholas Belkin, Peter Ingwersen, and Annelise Mark Mejtersen, editors, Proceedings of the 15th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 198{210, Copenhagen, Denmark, June 1992.
- 4] Wm. S. Cooper, Aitao Chen, and Fredric C. Gey. Full text retrieval based on probabilistic equations with coe cients tted by logistic regression. In D. K. Harman, editor, The Second Text REtreival Conference (TREC-2), pages 57{66, Gaithersburg, Md., March 199NIST Special Publication 500-215.
- 5] W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285{ 295, December 1979.
- 6] W. B. Croft and Jinxi Xu. Corpus-speci c stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147{159, Las Vegas, Nevada, April 1995.
- 7] Robert M. Fano. Transmission of Information; a Statistical Theory of Communications. MIT Press, Cambridge, MA, 1961.
- 8] N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems, 7(3):183{204, 1989.
- 9] Norbert Fuhr and Chris Buckley. Probabilistic document indexing from relevance feedback data. ACM Transactions on Information Systems, 9(2):45{61, 1991.
- 10] Fredric C. Gey. Inferring probability of relevance using the method of logistic regression. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 222{231, Dublin, Ireland, July 1994.
- 11] I. J. Good. Probability and the Weighing of Evidence. Charles Gri n, London, 1950.
- 12] I. J. Good. Weight of evidence: A brief survey. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics 2, pages 249{269. North-Holland, Amsterdam, 1983.
- 13] Donna Harman. Overview of the rst Text REtrieval Conference (TREC-1). In D. K. Harman, editor, The First Text REtrieval Conference (TREC1), pages 1{ 20, Gaithersburg, Md., February 1993. NIST Special Publication 500-207.
- 14] Donna Harman. Overview of the fth Text REtrieval Conference (TREC-5). In E. M. Voorhees and D. K. Harman, editors, The Fifth Text REtreival Conference (TREC-5), pages 1{28, Gaithersburg, Md. 500-238, November 1997. NIST Special Publication 500-238.
- 15] D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34(3):189{216, September 1978.
- 16] Frederick Hartwig and Brian E. Dearing. Exploratory Data Analysis. Sage Publications, 1979.
- 17] S. E. Robertson. Term speci city. Journal of Documentation, 28(2):164{165, 1972. Letter to the editor, with response by K. Sparck Jones.
- 18] S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129{146, 1977.
- 19] S. E. Robertson and S. Walker. On relevance weights with little relevance information. In Nicholas J. Belkin, A. Desai Narasimhalu, and Peter Willett, editors, Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 16{24, Philadelphia, Pennsylvania, July 1997.
- 20] G. Salton, A. Wong, and C. T. Yu. Automatic indexing using term discrimination and term precision measurements. Information Processing & Management, 12:43{51, 1976.
- 21] G. Salton, H. Wu, and C. Y. Yu. The measurement of term importance in automatic indexing. Journal of the American Society for Information Science, 32:175{186, 1981.
- 22] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
- 23] K. Sparck-Jones. A statistical interpretation of term speci city and its application in retrieval. Journal of Documentation, 28:11{21, 1972.
- 24] John W. Tukey. Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, MA, 1977.
- 25] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 2 edition, 1979.
- 26] S. K. M. Wong and Y. Y. Yao. An informationtheoretic measure of term speci city. Journal of the American Society for Information Science, 43(1):54{ 61, 1992.
- 27] C. T. Yu, K. Lam, and G. Salton. Term weighting in information retrieval using the term precision model. Journal of the ACM, 29(1):152{170, January 1982.
- 28] Clement T. Yu and Ilirotaka Mizuno. Two learning schemes in information retrieval. In Yves Chiaramella, editor, Proceedings of the 11th International Conference on Research and Development in Information Retrieval, pages 201{215, Grenoble, France, June 1988.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn