AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We proposed a scheme based on judicious merging of posting lists and the optional use of jump indexes for faster conjunctive query processing
Trustworthy keyword search for regulatory-compliant records retention
VLDB, pp.1001-1012, (2006)
Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insuffcient to ensure that the records are trustworthy, i.e., able ...More
PPT (Upload PPT)
- Documents such as electronic mail, financial statements, meeting memos, drug development logs, and quality assurance documents are valuable assets.
- Key decisions in business operations and other critical activities are based on information in these documents, so they must be maintained in a trustworthy fashion—safe from improper destruction or modification, and readily accessible
- Businesses increasingly store these documents electronically, making them relatively easy to delete and modify without leaving much of a trace.
- The US alone has over 10,000 regulations that mandate how records should be managed
- Many of those focus on ensuring that records are trustworthy (e.g., Securities and Exchange Commission (SEC) Rule 17a 4 and the Sarbanes-Oxley Act)
- Documents such as electronic mail, financial statements, meeting memos, drug development logs, and quality assurance documents are valuable assets
- Through extensive simulations and experiments with an IBM intranet search engine, we demonstrate that the scheme achieves online update speed while maintaining good query performance
- We present and evaluate jump indexes, a novel trustworthy and efficient index for join operations on posting lists for multi-keyword queries
- We have presented a threat model for trustworthy recordkeeping for legislative compliance, and identified key requirements for trustworthy indexes for this environment. One such requirement is the need for immediate indexing of newly inserted documents, which renders inapplicable the traditional approach of logging new posting list entries and periodically rebuilding the posting lists from scratch
- We proposed a scheme based on judicious merging of posting lists and the optional use of jump indexes for faster conjunctive query processing
- Through extensive simulations and experiments with an IBM intranet search engine and a workload of IBM intranet queries and documents, we demonstrated that merged posting lists and jump indexes offer excellent performance
- WORM (e.g., [1, 7, 20, 24]) were motivated by the desire to store data on optical disks, which once had an advantage over magnetic disks in terms of cost and storage capacity.
- These methods were designed for minimizing storage overhead and maximizing performance, and do not provide trustworthy recordkeeping.
- If there is a cache miss, the least recently used cache block is written out, and the needed block is read
- The key jump index parameters are L, p, and B.
- At the end of each posting list block, the authors leave space to store jump pointers.
- The number of pointers per index block ((B −1) logB(N )) depends on N , the largest document ID expected to be indexed.
- To store more than N documents, the authors can chain additional blocks of jump pointers off the end of the posting list block.
- Figure 8(a) shows the space overhead of a jump index, computed as the ratio of the space allocated for pointers to the space occupied by actual posting elements.
- The authors have presented a threat model for trustworthy recordkeeping for legislative compliance, and identified key requirements for trustworthy indexes for this environment
- One such requirement is the need for immediate indexing of newly inserted documents, which renders inapplicable the traditional approach of logging new posting list entries and periodically rebuilding the posting lists from scratch.
- For a disjunctive keyword query workload, merged posting lists without jump indexes are only 14% slower than the baseline approach, and merged posting lists with jump indexes (32-way branching) are only 26% slower than the baseline approach (due to the
- Search engines typically use inverted indexes to support keyword search . As shown in Figure 1(a), an inverted index comprises a dictionary of keywords and associated posting lists of document identifiers (with additional metadata such as keyword frequency, type, position) for each keyword.
In a trustworthy index, the posting list entries for a document must be durable, and the path to each entry must also be durable. This can be achieved by keeping each posting list in an append-only file in WORM storage. The index can be updated when a new document is added, by appending its document ID to the posting lists of all the keywords it contains. Unfortunately, this operation can be prohibitively slow, as each file append will require a random I/O. For example, in the data set used in our experiments, each document contains almost 500 keywords on average. If each append incurs a 2 msec random I/O, it would take 1 second to index a document.
- ∗This research was partially supported by an IBM internship. †This research was supported by NSF under grants IIS0331707, CNS-0325951, and CNS-0524695
- B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An Asymptotically Optimal Multiversion B-tree. VLDB Journal, 5:264–275, 1996.
- K. Blibech and A. Gabillon. Chronos: an authenticated dictionary based on skip lists for timestamping systems. In Workshop on Secure Web Services, 2005.
- E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In VLDB, 1994.
- E. W. Brown, J. P. Callan, W. B. Croft, and J. E. B. Moss. Supporting full-text information retrieval with a persistent object store. In EDBT, 1994.
- P. Crescenzi and V. Kann. A compendium of NP optimization problems. Available at http://www.nada.kth.se/.
- D. Cutting and J. Pedersen. Optimization for dynamic inverted index maintenance. In SIGIR, 1990.
- M. C. Easton. Key-Sequence Data Sets on Indelible Storage. IBM J. Research & Development, May 1986.
- EMC Corp. EMC Centera Content Addressed Storage System, 2003. www.emc.com/products/ systems/centera ce.jsp.
- C. Faloutsos. Access methods for text. ACM Computing Surveys, vol. 17, pp. 49-74, 1985.
- C. Faloutsos and H. V. Jagadish. On B-tree indices for skewed distributions. In VLDB, 1992.
- M. F. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and J. Zien. High performance index build algorithms for intranet search engines. In VLDB, 2004.
- E. Goh, H. Shacham, N. Modadugu, and D. Boneh. Sirius: Securing remote untrusted storage. In NDSS, 2003.
- M. Goodrich, R. Tamassia, and A. Schwerin. Implementation of an authenticated dictionary with skip lists and commutative hashing. In DISCEX II, 2001.
- H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book. Prentice-Hall, 2000.
- H. Hacigumus, B. R. Iyer, and S. Mehrotra. Providing database as a service. In ICDE, 2002.
- S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. J. Am. Soc. for Info. Sci. & Tech., 54:8, Jun. 2003.
- L. Huang, W. Hsu, and F. Zheng. CIS: Content Immutable Storage for Trustworthy Record Keeping. In NASA MSST, 2006.
- IBM Corp. IBM TotalStorage DR550, 2004. Available at http://www-1.ibm.com/servers/storage/disk/dr.
- B. Klimt and Y. Yang. Introducing the Enron Corpus. In CEAS, 2004.
- T. Krijnen and L. G. L. T. Meertens. Making B-Trees Work for B.IW 219/83. The Mathematical Centre, Amsterdam, 1983.
- N. Lester, J. Zobel, and H. E. Williams. In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems. In Conf. on Australasian Computer Science, 2004.
- E. L. Miller, W. E. Freeman, D. D. E. Long, and B. C. Reed. Strong security for network-attached storage. In FAST, 2002.
- Network Appliance, Inc. SnapLockT M Compliance and SnapLock Enterprise Software, 2003. Available at http://www.netapp.com/products/filer/snaplock.html.
- P. Rathmann. Dynamic Data Structures on Optical Disks. In ICDE, 1984.
- S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. TREC, 1992.
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
- A. Tomasic, H. Garcıa-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In VLDB, 1994.
- I. H. Witten, A. Moffatt, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 1999.
- Q. Zhu and W. Hsu. Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. In ACM SIGMOD Conference, June 2005.
- G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, 1949.