Integrating text search and relational databases: functionality and performance
Integrating text search and relational databases: functionality and performance(2006)
摘要
Applications increasingly involve a mix of free-text documents and traditional relational tables [46]. Commercial relational database management system (RDBMS) store both types of data and support access through keyword search, traditional relational operators in SQL, or a mixed query that combines both. However, application developers lack tools that address functionality and performance concerns that are available for traditional, scalar data, but needed when integrating keyword search in an RDBMS. With regards to functionality, this thesis proposes TextViews as a fully declarative way to specify virtual collections of virtual documents for use with keyword search. For performance, this thesis proposes TEXTURE, a benchmark for comparing RDBMSs given a workload of mixed queries.Current RDBMSs store a document as a single attribute value and a single collection in a table. TextViews are an adaptation of relational views for defining documents that are composed of multiple documents, possibly stored in multiple tables. Such documents are grouped into a collection and ranked using keyword search. Keyword search can be evaluated by either materializing the TextView, then searching, or by using inverted indexes built on the base table. Inverted indexes do not take advantage of the scalar attributes used in selection and grouping operations that are specified in TextView definitions. Consequently, we propose several alternative indexes for which we demonstrate an order of magnitude improvement in response time for keyword search, with a modest increase in storage when compared to inverted indexes.The TEXTURE benchmark [28] compares RDBMSs by measuring the response time needed to evaluate a workload of mixed queries. A micro-benchmark design is used to allow fine-grained control for specifying the query workload and data set. In order to support database scale up experiments, TextGen, a novel synthetic text generator was developed and evaluated. TextGen is unique in that it is capable of accurately scaling up an input "seed" text collection, while preserving important data characteristics. The TEXTURE benchmark was used to evaluate three commercial RDBMSs, demonstrating large differences between them for a variety of workloads.
更多查看译文
关键词
mixed query,response time,keyword search,Current RDBMSs,relational view,important data characteristic,relational databases,commercial RDBMSs,Integrating text search,TEXTURE benchmark,commercial relational database management,inverted index
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络