AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Considering the complex interplay between probabilities and scores, instead of proposing a specific ranking function, we propose using two parameterized ranking functions, called PRFω and PRFe, which allow the user to control the tuples that appear in the top-k answers

A unified approach to ranking in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases, no. 2 (2011): 249-275

Cited by: 233|Views145
EI

Abstract

The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as...More

Code:

Data:

0
Introduction
  • Recent years have seen a dramatic increase in the number of applications domains that naturally generate uncertain data and that demand support for executing complex decision support queries over them
  • These include information retrieval [21], data integration and cleaning [2, 18], text analytics [25, 31], social network analysis [1], sensor data management [12, 17], financial applications, biological and scientific data management, etc.
  • Use of automated tools in data integration and information extraction can introduce significant uncertainty in the output
Highlights
  • Recent years have seen a dramatic increase in the number of applications domains that naturally generate uncertain data and that demand support for executing complex decision support queries over them
  • We develop novel algorithms based on generating functions to efficiently rank the tuples in a probabilistic dataset using any parameterized ranking function (PRF) ranking function
  • Considering the complex interplay between probabilities and scores, instead of proposing a specific ranking function, we propose using two parameterized ranking functions, called PRFω and PRFe, which allow the user to control the tuples that appear in the top-k answers
  • We developed novel algorithms for evaluating these ranking functions over large, possibly correlated, probabilistic datasets
  • We developed an approach for approximating a ranking function using a linear combination of PRFe functions enabling highly efficient, albeit approximate computation, and for learning a ranking function from user preferences
  • The issues of ranking have been studied for many years in disciplines ranging from economics to information retrieval; better understanding the connections between that work and ranking in probabilistic databases remains a fruitful direction for further research
Methods
  • First the authors note that the naive method gives them an O(n2) time algorithm by simple counting argument.
  • The time to multiply Pi and Pi+1 is O(d(Pi) · d(Pi+1)).
  • The authors can see the total time complexity is: k−1 k−1.
  • Divide-and-Conquer: the authors show how to use divide-and-conquer and FFT (Fast Fourier Transformation) to achieve an O(n log2 n) time algorithm.
  • It is well known that the multiplication of two polynomials of degree O(n) can be done in O(n log n) time using FFT.
  • The divide-and-conquer algorithm is as follows: If there exists any
Conclusion
  • In this article the authors presented a unified framework for ranking over probabilistic databases, and presented several novel and highly efficient algorithms for answering top-k queries.
  • The authors developed an approach for approximating a ranking function using a linear combination of PRFe functions enabling highly efficient, albeit approximate computation, and for learning a ranking function from user preferences.
  • Understanding the behavior of various ranking functions and their relationships across probabilistic databases with diverse uncertainties and correlation structures remains an important open problem in this area.
  • The issues of ranking have been studied for many years in disciplines ranging from economics to information retrieval; better understanding the connections between that work and ranking in probabilistic databases remains a fruitful direction for further research
Tables
  • Table1: Normalized Kendall distance between top-k answers according to various ranking functions for two datasets the second dataset (by looking into the results, it shares less than 15 tuples with the Top-100 answers of the others). We observed similar behavior for other datasets, and for datasets with correlations
  • Table2: Notation
  • Table3: Summary of the running times. n is the number of tuples. di is the depth of tuple ti in the and/xor tree
Download tables as Excel
Related work
  • There has been much work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see, e.g., [12, 14, 21, 24, 38, 41, 53]). The work in this area has spanned a range of issues from theoretical development of data models and data languages to practical implementation issues such as indexing techniques; several research efforts are underway to build systems to manage uncertain data (e.g., MYSTIQ [14], Trio [53], ORION [12], MayBMS [38], PrDB [49]). The approaches can be differentiated based on whether they support tuple-level uncertainty where “existence” probabilities are attached to the tuples of the database, or attribute-level uncertainty where (possibly continuous) probability distributions are attached to the attributes, or both. The proposed approaches differ further based on whether they consider correlations or not. Most work in probabilistic databases has either assumed independence [14, 21] or has restricted the correlations that can be modeled [2, 41, 48]. More recently, several approaches have been presented that allow representation of arbitrary correlations and querying over correlated databases [24, 39, 49].
Reference
  • E. Adar and C. Re. Managing uncertainty in social networks. IEEE Data Eng. Bull., 2007.
    Google ScholarLocate open access versionFindings
  • P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006.
    Google ScholarLocate open access versionFindings
  • Y. Azar and I. Gamzu. Ranking with Submodular Valuations. Arxiv preprint arXiv:1007.2503, 2010.
    Findings
  • Y. Azar, I. Gamzu, and X. Yin. Multiple intents re-ranking. In STOC, pages 669–678, 2009.
    Google ScholarLocate open access versionFindings
  • N. Bansal, K. Jain, A. Kazeykina, and J. Naor. Approximation Algorithms for Diversified Search Ranking. ICALP, pages 273–284, 2010.
    Google ScholarLocate open access versionFindings
  • G. Beskales, M. Soliman, and I. IIyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. VLDB, 2008.
    Google ScholarFindings
  • G. Beylkin and L. Monzon. On approximation of functions by exponential sums. Applied and Computational Harmonic Analysis, 19:17–48, 2005.
    Google ScholarLocate open access versionFindings
  • A. Bjorck and V. Pereyra. Solution of vandermonde systems of equations. Mathematics of Computation, 24(112):893–903, 1970.
    Google ScholarLocate open access versionFindings
  • C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89–96, 2005.
    Google ScholarLocate open access versionFindings
  • R. Cheng, J. Chen, M. Mokbel, and C. Chow. Probabilistic verifiers: Evaluating constrained nearestneighbor queries over uncertain data. In ICDE, 2008.
    Google ScholarFindings
  • R. Cheng, L. Chen, J. Chen, and X. Xie. Evaluating probability threshold k-nearest-neighbor queries over uncertain data. In EDBT, 2009.
    Google ScholarLocate open access versionFindings
  • R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003.
    Google ScholarLocate open access versionFindings
  • G. Cormode, F. Li, and K. Yi. Semantics of ranking queries for probabilistic data and expected ranks. In ICDE, 2009.
    Google ScholarLocate open access versionFindings
  • N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
    Google ScholarLocate open access versionFindings
  • N. Dalvi and D. Suciu. Management of probabilistic data: Foundations and challenges. In PODS, 2007.
    Google ScholarLocate open access versionFindings
  • O. Dekel, C. Manning, and Y. Singer. Log-linear models for label-ranking. In NIPS 16, 2004.
    Google ScholarLocate open access versionFindings
  • A. Deshpande, C. Guestrin, and S. Madden. Using probabilistic models for data management in acquisitional environments. In CIDR, 2005.
    Google ScholarLocate open access versionFindings
  • X. L. Dong, A. Halevy, and C. Yu. Data integration with uncertainty. In VLDB, 2007.
    Google ScholarFindings
  • C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW, 2001.
    Google ScholarLocate open access versionFindings
  • R. Fagin, R. Kumar, and D. Sivakumar. Comparing top-k lists. In SODA, 2003.
    Google ScholarLocate open access versionFindings
  • N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. on Info. Syst., 1997.
    Google ScholarLocate open access versionFindings
  • T. Ge, S. Zdonik, and S. Madden. Top-k queries on uncertain data: On score distribution and typical answers. In SIGMOD, pages 375–388, 2009.
    Google ScholarLocate open access versionFindings
  • T. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31–40, 2007.
    Google ScholarLocate open access versionFindings
  • T. Green and V. Tannen. Models for incomplete and probabilistic information. In EDBT, 2006.
    Google ScholarLocate open access versionFindings
  • R. Gupta, S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006.
    Google ScholarLocate open access versionFindings
  • J. F. Hauer, C. J. Demeure, and L. L. Scharf. Initial results in prony analysis of power system response signals. IEEE Transactions on Power Systems, 5(1):80–89, 1990.
    Google ScholarLocate open access versionFindings
  • R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer. Learning preference relations for information retrieval. In ICML-98 Workshop: Text Categorization and Machine Learning, page 8084, 1998.
    Google ScholarLocate open access versionFindings
  • M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: A probabilistic threshold approach. In SIGMOD, 2008.
    Google ScholarLocate open access versionFindings
  • I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 2008.
    Google ScholarLocate open access versionFindings
  • K. Jarvelin, J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4), 2002.
    Google ScholarLocate open access versionFindings
  • T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1), 2006.
    Google ScholarLocate open access versionFindings
  • F. Jensen and F. Jensen. Optimal junction trees. In UAI, pages 360–366, 1994.
    Google ScholarLocate open access versionFindings
  • C. Jin, K. Yi, L. Chen, J. Xu Yu, X. Lin. Sliding-window top-k queries on uncertain streams. In VLDB, 2008.
    Google ScholarLocate open access versionFindings
  • T. Joachims. Optimizing search engines using click-through data. In Proc. SIGKDD, pages 133–142, 2002.
    Google ScholarLocate open access versionFindings
  • B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, 2009.
    Google ScholarLocate open access versionFindings
  • B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD, 2009.
    Google ScholarLocate open access versionFindings
  • B. Kimelfeld and C. Re. Transducing markov sequences. In PODS, pages 15–26, 2010.
    Google ScholarLocate open access versionFindings
  • C. Koch. MayBMS: A System for Managing Large Uncertain and Probabilistic Databases. Managing and Mining Uncertain Data. Charu Aggarwal ed., 2009.
    Google ScholarFindings
  • C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313–325, 2008.
    Google ScholarLocate open access versionFindings
  • H.P. Kriegel, P. Kunath, M. Renz. Probabilistic nearest-neighbor query on uncertain objects. In DASFAA, 2007.
    Google ScholarLocate open access versionFindings
  • L. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: a flexible probabilistic database system. TODS, 1997.
    Google ScholarLocate open access versionFindings
  • J. Li and A. Deshpande. Consensus answers for queries over probabilistic databases. PODS, 2009.
    Google ScholarLocate open access versionFindings
  • J. Li and A. Deshpande. Ranking continuous probabilistic datasets. In VLDB, 2010.
    Google ScholarLocate open access versionFindings
  • T. Y. Liu. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
    Google ScholarLocate open access versionFindings
  • X. Liu, M. Ye, J. Xu, Y. Tian, and W. Lee. k-selection query over uncertain data. In DASFAA (1), pages 444–459, 2010.
    Google ScholarLocate open access versionFindings
  • C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, 2007.
    Google ScholarLocate open access versionFindings
  • C. Re, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, 2008.
    Google ScholarLocate open access versionFindings
  • A. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006.
    Google ScholarLocate open access versionFindings
  • P. Sen, A. Deshpande, and L. Getoor. PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J., 18(5):1065–1090, 2009.
    Google ScholarLocate open access versionFindings
  • M. Soliman, I. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, 2007.
    Google ScholarLocate open access versionFindings
  • M. Soliman and I. Ilyas. Ranking with uncertain scores. In ICDE, pages 317–328, 2009.
    Google ScholarLocate open access versionFindings
  • P. Talukdar, M. Jacob, M. Mehmood, K. Crammer, Z. Ives, F. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785–796, 2008.
    Google ScholarLocate open access versionFindings
  • J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, 2005.
    Google ScholarLocate open access versionFindings
  • K. Yi, F. Li, D. Srivastava, G. Kollios. Efficient processing of top-k queries in uncertain databases. ICDE, 2008.
    Google ScholarLocate open access versionFindings
  • X. Zhang and J. Chomicki. On the semantics and evaluation of top-k queries in probabilistic databases. In DBRank, 2008.
    Google ScholarLocate open access versionFindings
  • O. Zuk, L. Ein-Dor, and E. Domany. Ranking under uncertainty. In UAI, pages 466–473, 2007.
    Google ScholarLocate open access versionFindings
  • 2. The sum of two expressions, or
    Google ScholarFindings
  • 3. The product of two expressions.
    Google ScholarFindings
  • 2. Evaluate the polynomial at these points, i.e., compute f (xi). It is easy to see that each evaluation takes linear time (bottom-up over the tree). So this step takes O(n2) time in total.
    Google ScholarFindings
  • 3. Use any O(n2) polynomial interpolation algorithm to find the coefficient. In fact, the interpolation reduces to finding a solution for the following linear system:
    Google ScholarFindings
  • 2. Use (10) to compute the coefficients. This again takes O(n2) time.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科