## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# A unified approach to ranking in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases, no. 2 (2011): 249-275

EI

Full Text

Weibo

Abstract

The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as...More

Code:

Data:

Introduction

- Recent years have seen a dramatic increase in the number of applications domains that naturally generate uncertain data and that demand support for executing complex decision support queries over them
- These include information retrieval [21], data integration and cleaning [2, 18], text analytics [25, 31], social network analysis [1], sensor data management [12, 17], financial applications, biological and scientific data management, etc.
- Use of automated tools in data integration and information extraction can introduce significant uncertainty in the output

Highlights

- Recent years have seen a dramatic increase in the number of applications domains that naturally generate uncertain data and that demand support for executing complex decision support queries over them
- We develop novel algorithms based on generating functions to efficiently rank the tuples in a probabilistic dataset using any parameterized ranking function (PRF) ranking function
- Considering the complex interplay between probabilities and scores, instead of proposing a specific ranking function, we propose using two parameterized ranking functions, called PRFω and PRFe, which allow the user to control the tuples that appear in the top-k answers
- We developed novel algorithms for evaluating these ranking functions over large, possibly correlated, probabilistic datasets
- We developed an approach for approximating a ranking function using a linear combination of PRFe functions enabling highly efficient, albeit approximate computation, and for learning a ranking function from user preferences
- The issues of ranking have been studied for many years in disciplines ranging from economics to information retrieval; better understanding the connections between that work and ranking in probabilistic databases remains a fruitful direction for further research

Methods

- First the authors note that the naive method gives them an O(n2) time algorithm by simple counting argument.
- The time to multiply Pi and Pi+1 is O(d(Pi) · d(Pi+1)).
- The authors can see the total time complexity is: k−1 k−1.
- Divide-and-Conquer: the authors show how to use divide-and-conquer and FFT (Fast Fourier Transformation) to achieve an O(n log2 n) time algorithm.
- It is well known that the multiplication of two polynomials of degree O(n) can be done in O(n log n) time using FFT.
- The divide-and-conquer algorithm is as follows: If there exists any

Conclusion

- In this article the authors presented a unified framework for ranking over probabilistic databases, and presented several novel and highly efficient algorithms for answering top-k queries.
- The authors developed an approach for approximating a ranking function using a linear combination of PRFe functions enabling highly efficient, albeit approximate computation, and for learning a ranking function from user preferences.
- Understanding the behavior of various ranking functions and their relationships across probabilistic databases with diverse uncertainties and correlation structures remains an important open problem in this area.
- The issues of ranking have been studied for many years in disciplines ranging from economics to information retrieval; better understanding the connections between that work and ranking in probabilistic databases remains a fruitful direction for further research

- Table1: Normalized Kendall distance between top-k answers according to various ranking functions for two datasets the second dataset (by looking into the results, it shares less than 15 tuples with the Top-100 answers of the others). We observed similar behavior for other datasets, and for datasets with correlations
- Table2: Notation
- Table3: Summary of the running times. n is the number of tuples. di is the depth of tuple ti in the and/xor tree

Related work

- There has been much work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see, e.g., [12, 14, 21, 24, 38, 41, 53]). The work in this area has spanned a range of issues from theoretical development of data models and data languages to practical implementation issues such as indexing techniques; several research efforts are underway to build systems to manage uncertain data (e.g., MYSTIQ [14], Trio [53], ORION [12], MayBMS [38], PrDB [49]). The approaches can be differentiated based on whether they support tuple-level uncertainty where “existence” probabilities are attached to the tuples of the database, or attribute-level uncertainty where (possibly continuous) probability distributions are attached to the attributes, or both. The proposed approaches differ further based on whether they consider correlations or not. Most work in probabilistic databases has either assumed independence [14, 21] or has restricted the correlations that can be modeled [2, 41, 48]. More recently, several approaches have been presented that allow representation of arbitrary correlations and querying over correlated databases [24, 39, 49].

Reference

- E. Adar and C. Re. Managing uncertainty in social networks. IEEE Data Eng. Bull., 2007.
- P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006.
- Y. Azar and I. Gamzu. Ranking with Submodular Valuations. Arxiv preprint arXiv:1007.2503, 2010.
- Y. Azar, I. Gamzu, and X. Yin. Multiple intents re-ranking. In STOC, pages 669–678, 2009.
- N. Bansal, K. Jain, A. Kazeykina, and J. Naor. Approximation Algorithms for Diversified Search Ranking. ICALP, pages 273–284, 2010.
- G. Beskales, M. Soliman, and I. IIyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. VLDB, 2008.
- G. Beylkin and L. Monzon. On approximation of functions by exponential sums. Applied and Computational Harmonic Analysis, 19:17–48, 2005.
- A. Bjorck and V. Pereyra. Solution of vandermonde systems of equations. Mathematics of Computation, 24(112):893–903, 1970.
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89–96, 2005.
- R. Cheng, J. Chen, M. Mokbel, and C. Chow. Probabilistic verifiers: Evaluating constrained nearestneighbor queries over uncertain data. In ICDE, 2008.
- R. Cheng, L. Chen, J. Chen, and X. Xie. Evaluating probability threshold k-nearest-neighbor queries over uncertain data. In EDBT, 2009.
- R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003.
- G. Cormode, F. Li, and K. Yi. Semantics of ranking queries for probabilistic data and expected ranks. In ICDE, 2009.
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
- N. Dalvi and D. Suciu. Management of probabilistic data: Foundations and challenges. In PODS, 2007.
- O. Dekel, C. Manning, and Y. Singer. Log-linear models for label-ranking. In NIPS 16, 2004.
- A. Deshpande, C. Guestrin, and S. Madden. Using probabilistic models for data management in acquisitional environments. In CIDR, 2005.
- X. L. Dong, A. Halevy, and C. Yu. Data integration with uncertainty. In VLDB, 2007.
- C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW, 2001.
- R. Fagin, R. Kumar, and D. Sivakumar. Comparing top-k lists. In SODA, 2003.
- N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. on Info. Syst., 1997.
- T. Ge, S. Zdonik, and S. Madden. Top-k queries on uncertain data: On score distribution and typical answers. In SIGMOD, pages 375–388, 2009.
- T. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31–40, 2007.
- T. Green and V. Tannen. Models for incomplete and probabilistic information. In EDBT, 2006.
- R. Gupta, S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006.
- J. F. Hauer, C. J. Demeure, and L. L. Scharf. Initial results in prony analysis of power system response signals. IEEE Transactions on Power Systems, 5(1):80–89, 1990.
- R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer. Learning preference relations for information retrieval. In ICML-98 Workshop: Text Categorization and Machine Learning, page 8084, 1998.
- M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: A probabilistic threshold approach. In SIGMOD, 2008.
- I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 2008.
- K. Jarvelin, J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4), 2002.
- T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1), 2006.
- F. Jensen and F. Jensen. Optimal junction trees. In UAI, pages 360–366, 1994.
- C. Jin, K. Yi, L. Chen, J. Xu Yu, X. Lin. Sliding-window top-k queries on uncertain streams. In VLDB, 2008.
- T. Joachims. Optimizing search engines using click-through data. In Proc. SIGKDD, pages 133–142, 2002.
- B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, 2009.
- B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD, 2009.
- B. Kimelfeld and C. Re. Transducing markov sequences. In PODS, pages 15–26, 2010.
- C. Koch. MayBMS: A System for Managing Large Uncertain and Probabilistic Databases. Managing and Mining Uncertain Data. Charu Aggarwal ed., 2009.
- C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313–325, 2008.
- H.P. Kriegel, P. Kunath, M. Renz. Probabilistic nearest-neighbor query on uncertain objects. In DASFAA, 2007.
- L. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: a flexible probabilistic database system. TODS, 1997.
- J. Li and A. Deshpande. Consensus answers for queries over probabilistic databases. PODS, 2009.
- J. Li and A. Deshpande. Ranking continuous probabilistic datasets. In VLDB, 2010.
- T. Y. Liu. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.
- X. Liu, M. Ye, J. Xu, Y. Tian, and W. Lee. k-selection query over uncertain data. In DASFAA (1), pages 444–459, 2010.
- C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, 2007.
- C. Re, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, 2008.
- A. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006.
- P. Sen, A. Deshpande, and L. Getoor. PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J., 18(5):1065–1090, 2009.
- M. Soliman, I. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, 2007.
- M. Soliman and I. Ilyas. Ranking with uncertain scores. In ICDE, pages 317–328, 2009.
- P. Talukdar, M. Jacob, M. Mehmood, K. Crammer, Z. Ives, F. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785–796, 2008.
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, 2005.
- K. Yi, F. Li, D. Srivastava, G. Kollios. Efficient processing of top-k queries in uncertain databases. ICDE, 2008.
- X. Zhang and J. Chomicki. On the semantics and evaluation of top-k queries in probabilistic databases. In DBRank, 2008.
- O. Zuk, L. Ein-Dor, and E. Domany. Ranking under uncertainty. In UAI, pages 466–473, 2007.
- 2. The sum of two expressions, or
- 3. The product of two expressions.
- 2. Evaluate the polynomial at these points, i.e., compute f (xi). It is easy to see that each evaluation takes linear time (bottom-up over the tree). So this step takes O(n2) time in total.
- 3. Use any O(n2) polynomial interpolation algorithm to find the coefficient. In fact, the interpolation reduces to finding a solution for the following linear system:
- 2. Use (10) to compute the coefficients. This again takes O(n2) time.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn