AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We have shown strong empirical support for concluding that MI( ) occ; rel as a function of log O occ is roughly linear, with a that this can slope of be used the order of magnitude to explain why inverse doofc21u;maenndt frequency has been found to be so useful for term weighting

A theory of term weighting based on exploratory data analysis

SIGIR, pp.11-19, (1998)

Cited by: 121|Views102
EI

Abstract

Techniques of exploratory data analysis areused to study the weight of evidence that the occurrenceof a query term provides in support of the hypothesisthat a document is relevant to an information need. Inparticular, the relationship between the document frequencyand the weight of evidence is investigated. Acorrelation between document ...More

Code:

Data:

Introduction
  • In 1972, Spark Jones demonstrated that document frequency can be used e ectively for the weighting of query terms 23].
  • In this paper a theory of why inverse document frequency has been so e ective is developed.
  • Both the approach taken and the conclusions drawn di er from theories previously put forth.
  • The result is an explanatory theory of inverse document frequency, idf, derived from observed statistical regularities of extensive retrieval data
Highlights
  • In 1972, Spark Jones demonstrated that document frequency can be used e ectively for the weighting of query terms 23]
  • We have shown that by accepting some, empirically motivated, assumptions concerning query terms the quantity woe can be approximated by MI( ). occ; rel By further assuming that MI rel is roughly linear in log ( ), O occ we showed that traditional idf formulations should perform well
  • If we accept the hypothesis that the plot of gure 6 is representative of the general behavior of query terms for the types of queries and collections we study, we should expect improved retrieval performance from a term weighting formula that accounts for the observed attening
  • We have shown strong empirical support for concluding that MI( ) occ; rel as a function of log ( ) O occ is roughly linear, with a that this can slope of be used the order of magnitude to explain why inverse doofc21u;maenndt frequency has been found to be so useful for term weighting
  • Previous probabilistic explanations have started from plausible a priori assumptions, in particular assumptions concerning the probability of a query term occurring in a relevant document
Results
  • A value close to 1 for -log ( j ) p occ rel is achieved by only a small percentage of query terms { those which appear in more than 25% of all documents.
Conclusion
  • The authors have shown strong empirical support for concluding that MI( ) occ; rel as a function of log ( ) O occ is roughly linear, with a that this can slope of be used the order of magnitude to explain why inverse doofc21u;maenndt frequency has been found to be so useful for term weighting.
  • With the availability of large numbers of conscientiously formulated queries, systematically judged against diverse, voluminous document collections, pertinent information becomes accessible.
  • Inspection of this data supplies them with su cient reason for assigning unequal probabilities for ( j ) p occ rel based on a term's document frequency
Related work
  • In 1972, Sparck Jones, convincingly demonstrated that the weighting of query terms can signi cantly improve retrieval performance compared to unweighted coordination match ranking 23]. The weighting formula she proposed was an approximation of: wsj = log N (3)

    n where n is the document frequency of the term (the number of documents in which the term appears); and N is the number of documents in the entire collection.

    3.1 Probabilistic Explanations

    In a letter to the Journal of Documentation, Robertson pointed out that, viewed as a function of the probability of term occurrence, the sum of weights could be interpreted as the probability of mutual occurrence of multiple query terms 17]; thus providing theoretical arguments for the Sparck

    Juosneesofprwessje.nteTdogthetehBeri,nainry1I9n7d6e,pRenodbeenrctseoMn oadnedl

    18], in which terms are weighted by:

    ? j j ? jj = log (1( ( ) (1)) (( ))) wrsj p occ rel p occ rel p occ rel p occ rel (4)

    where ( j ) p occ rel is the probability of the term occurring in relevant documents1, and ( j ) p occ rel is the corresponding probability for non-relevant documents. Use of the model depends on the availability of relevance feedback information, on which estimates of the two conditional probabilities can be based.

    Applying the probabilistic approach of Robertson and Sparck Jones, Croft and Harper 5] work with an equivalent formulation of wrsj:

    ? j j ? ? j j = log 1 ( ( ) ) wrsj p occ rel p occ rel ( ) p occ rel log 1 ( ) p occ rel (5)

    Their goal is the development of a probabilistically justi ed weighting formula that can be used in a retrieval setting in the absence of, or prior to, relevance feedback. They make two assumptions: 1) there \is no information about the relevant documents and we could therefore assume that all the query terms had equal probabilities of occurring in the relevant documents" 5, p. 287]; and np2)oonrtt-hiroeenlepovrafondbtoabcduiolmictuyem,ntpesn(otthcccaajtrneclob)ne, teoasfintiamthtaeetretmderbmoyccinNunrt,rhitnehgeenipntriroaecollection. With these two assumptions, the combination match formula: wch = k + log N ? n (6)
Funding
  • This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, and also supported in part by United States Patent and Trademark O ce and Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-C-0235
Reference
  • 1] J. P. Callan, W. B. Croft, and S. M. Harding. The inquery retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78{83, 1992.
    Google ScholarLocate open access versionFindings
  • 2] Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. Using statistics in lexical analysis. In Uri Zernik, editor, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pages 115{ 164, Hillsdale, NJ, 1991. Lawrence Erlbaum Associates.
    Google ScholarLocate open access versionFindings
  • 3] W. S. Cooper, D. Dabney, and F. Gey. Probabilistic retrieval based on staged logistic regression. In Nicholas Belkin, Peter Ingwersen, and Annelise Mark Mejtersen, editors, Proceedings of the 15th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 198{210, Copenhagen, Denmark, June 1992.
    Google ScholarLocate open access versionFindings
  • 4] Wm. S. Cooper, Aitao Chen, and Fredric C. Gey. Full text retrieval based on probabilistic equations with coe cients tted by logistic regression. In D. K. Harman, editor, The Second Text REtreival Conference (TREC-2), pages 57{66, Gaithersburg, Md., March 199NIST Special Publication 500-215.
    Google ScholarLocate open access versionFindings
  • 5] W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285{ 295, December 1979.
    Google ScholarLocate open access versionFindings
  • 6] W. B. Croft and Jinxi Xu. Corpus-speci c stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147{159, Las Vegas, Nevada, April 1995.
    Google ScholarLocate open access versionFindings
  • 7] Robert M. Fano. Transmission of Information; a Statistical Theory of Communications. MIT Press, Cambridge, MA, 1961.
    Google ScholarFindings
  • 8] N. Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems, 7(3):183{204, 1989.
    Google ScholarLocate open access versionFindings
  • 9] Norbert Fuhr and Chris Buckley. Probabilistic document indexing from relevance feedback data. ACM Transactions on Information Systems, 9(2):45{61, 1991.
    Google ScholarLocate open access versionFindings
  • 10] Fredric C. Gey. Inferring probability of relevance using the method of logistic regression. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 222{231, Dublin, Ireland, July 1994.
    Google ScholarLocate open access versionFindings
  • 11] I. J. Good. Probability and the Weighing of Evidence. Charles Gri n, London, 1950.
    Google ScholarFindings
  • 12] I. J. Good. Weight of evidence: A brief survey. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics 2, pages 249{269. North-Holland, Amsterdam, 1983.
    Google ScholarLocate open access versionFindings
  • 13] Donna Harman. Overview of the rst Text REtrieval Conference (TREC-1). In D. K. Harman, editor, The First Text REtrieval Conference (TREC1), pages 1{ 20, Gaithersburg, Md., February 1993. NIST Special Publication 500-207.
    Google ScholarLocate open access versionFindings
  • 14] Donna Harman. Overview of the fth Text REtrieval Conference (TREC-5). In E. M. Voorhees and D. K. Harman, editors, The Fifth Text REtreival Conference (TREC-5), pages 1{28, Gaithersburg, Md. 500-238, November 1997. NIST Special Publication 500-238.
    Google ScholarLocate open access versionFindings
  • 15] D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34(3):189{216, September 1978.
    Google ScholarLocate open access versionFindings
  • 16] Frederick Hartwig and Brian E. Dearing. Exploratory Data Analysis. Sage Publications, 1979.
    Google ScholarFindings
  • 17] S. E. Robertson. Term speci city. Journal of Documentation, 28(2):164{165, 1972. Letter to the editor, with response by K. Sparck Jones.
    Google ScholarLocate open access versionFindings
  • 18] S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129{146, 1977.
    Google ScholarLocate open access versionFindings
  • 19] S. E. Robertson and S. Walker. On relevance weights with little relevance information. In Nicholas J. Belkin, A. Desai Narasimhalu, and Peter Willett, editors, Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 16{24, Philadelphia, Pennsylvania, July 1997.
    Google ScholarLocate open access versionFindings
  • 20] G. Salton, A. Wong, and C. T. Yu. Automatic indexing using term discrimination and term precision measurements. Information Processing & Management, 12:43{51, 1976.
    Google ScholarLocate open access versionFindings
  • 21] G. Salton, H. Wu, and C. Y. Yu. The measurement of term importance in automatic indexing. Journal of the American Society for Information Science, 32:175{186, 1981.
    Google ScholarLocate open access versionFindings
  • 22] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
    Google ScholarFindings
  • 23] K. Sparck-Jones. A statistical interpretation of term speci city and its application in retrieval. Journal of Documentation, 28:11{21, 1972.
    Google ScholarLocate open access versionFindings
  • 24] John W. Tukey. Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, MA, 1977.
    Google ScholarFindings
  • 25] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 2 edition, 1979.
    Google ScholarFindings
  • 26] S. K. M. Wong and Y. Y. Yao. An informationtheoretic measure of term speci city. Journal of the American Society for Information Science, 43(1):54{ 61, 1992.
    Google ScholarLocate open access versionFindings
  • 27] C. T. Yu, K. Lam, and G. Salton. Term weighting in information retrieval using the term precision model. Journal of the ACM, 29(1):152{170, January 1982.
    Google ScholarLocate open access versionFindings
  • 28] Clement T. Yu and Ilirotaka Mizuno. Two learning schemes in information retrieval. In Yves Chiaramella, editor, Proceedings of the 11th International Conference on Research and Development in Information Retrieval, pages 201{215, Grenoble, France, June 1988.
    Google ScholarLocate open access versionFindings
Author
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科