AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We proposed a scheme based on judicious merging of posting lists and the optional use of jump indexes for faster conjunctive query processing

Trustworthy keyword search for regulatory-compliant records retention

VLDB, pp.1001-1012, (2006)

Cited by: 42|Views154
EI
Full Text
Bibtex
Weibo

Abstract

Recent litigation and intense regulatory focus on secure retention of electronic records have spurred a rush to introduce Write-Once-Read-Many (WORM) storage devices for retaining business records such as electronic mail. However, simply storing records in WORM storage is insuffcient to ensure that the records are trustworthy, i.e., able ...More

Code:

Data:

0
Introduction
  • Documents such as electronic mail, financial statements, meeting memos, drug development logs, and quality assurance documents are valuable assets.
  • Key decisions in business operations and other critical activities are based on information in these documents, so they must be maintained in a trustworthy fashion—safe from improper destruction or modification, and readily accessible
  • Businesses increasingly store these documents electronically, making them relatively easy to delete and modify without leaving much of a trace.
  • The US alone has over 10,000 regulations that mandate how records should be managed
  • Many of those focus on ensuring that records are trustworthy (e.g., Securities and Exchange Commission (SEC) Rule 17a 4 and the Sarbanes-Oxley Act)
Highlights
  • Documents such as electronic mail, financial statements, meeting memos, drug development logs, and quality assurance documents are valuable assets
  • Through extensive simulations and experiments with an IBM intranet search engine, we demonstrate that the scheme achieves online update speed while maintaining good query performance
  • We present and evaluate jump indexes, a novel trustworthy and efficient index for join operations on posting lists for multi-keyword queries
  • We have presented a threat model for trustworthy recordkeeping for legislative compliance, and identified key requirements for trustworthy indexes for this environment. One such requirement is the need for immediate indexing of newly inserted documents, which renders inapplicable the traditional approach of logging new posting list entries and periodically rebuilding the posting lists from scratch
  • We proposed a scheme based on judicious merging of posting lists and the optional use of jump indexes for faster conjunctive query processing
  • Through extensive simulations and experiments with an IBM intranet search engine and a workload of IBM intranet queries and documents, we demonstrated that merged posting lists and jump indexes offer excellent performance
Methods
  • WORM (e.g., [1, 7, 20, 24]) were motivated by the desire to store data on optical disks, which once had an advantage over magnetic disks in terms of cost and storage capacity.
  • These methods were designed for minimizing storage overhead and maximizing performance, and do not provide trustworthy recordkeeping.
  • If there is a cache miss, the least recently used cache block is written out, and the needed block is read
Results
  • The key jump index parameters are L, p, and B.
  • At the end of each posting list block, the authors leave space to store jump pointers.
  • The number of pointers per index block ((B −1) logB(N )) depends on N , the largest document ID expected to be indexed.
  • To store more than N documents, the authors can chain additional blocks of jump pointers off the end of the posting list block.
  • Figure 8(a) shows the space overhead of a jump index, computed as the ratio of the space allocated for pointers to the space occupied by actual posting elements.
Conclusion
  • The authors have presented a threat model for trustworthy recordkeeping for legislative compliance, and identified key requirements for trustworthy indexes for this environment
  • One such requirement is the need for immediate indexing of newly inserted documents, which renders inapplicable the traditional approach of logging new posting list entries and periodically rebuilding the posting lists from scratch.
  • For a disjunctive keyword query workload, merged posting lists without jump indexes are only 14% slower than the baseline approach, and merged posting lists with jump indexes (32-way branching) are only 26% slower than the baseline approach (due to the
Related work
  • Search engines typically use inverted indexes to support keyword search [9]. As shown in Figure 1(a), an inverted index comprises a dictionary of keywords and associated posting lists of document identifiers (with additional metadata such as keyword frequency, type, position) for each keyword.

    In a trustworthy index, the posting list entries for a document must be durable, and the path to each entry must also be durable. This can be achieved by keeping each posting list in an append-only file in WORM storage. The index can be updated when a new document is added, by appending its document ID to the posting lists of all the keywords it contains. Unfortunately, this operation can be prohibitively slow, as each file append will require a random I/O. For example, in the data set used in our experiments, each document contains almost 500 keywords on average. If each append incurs a 2 msec random I/O, it would take 1 second to index a document.
Funding
  • ∗This research was partially supported by an IBM internship. †This research was supported by NSF under grants IIS0331707, CNS-0325951, and CNS-0524695
Reference
  • B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An Asymptotically Optimal Multiversion B-tree. VLDB Journal, 5:264–275, 1996.
    Google ScholarLocate open access versionFindings
  • K. Blibech and A. Gabillon. Chronos: an authenticated dictionary based on skip lists for timestamping systems. In Workshop on Secure Web Services, 2005.
    Google ScholarLocate open access versionFindings
  • E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In VLDB, 1994.
    Google ScholarLocate open access versionFindings
  • E. W. Brown, J. P. Callan, W. B. Croft, and J. E. B. Moss. Supporting full-text information retrieval with a persistent object store. In EDBT, 1994.
    Google ScholarLocate open access versionFindings
  • P. Crescenzi and V. Kann. A compendium of NP optimization problems. Available at http://www.nada.kth.se/.
    Findings
  • D. Cutting and J. Pedersen. Optimization for dynamic inverted index maintenance. In SIGIR, 1990.
    Google ScholarLocate open access versionFindings
  • M. C. Easton. Key-Sequence Data Sets on Indelible Storage. IBM J. Research & Development, May 1986.
    Google ScholarLocate open access versionFindings
  • EMC Corp. EMC Centera Content Addressed Storage System, 2003. www.emc.com/products/ systems/centera ce.jsp.
    Findings
  • C. Faloutsos. Access methods for text. ACM Computing Surveys, vol. 17, pp. 49-74, 1985.
    Google ScholarLocate open access versionFindings
  • C. Faloutsos and H. V. Jagadish. On B-tree indices for skewed distributions. In VLDB, 1992.
    Google ScholarLocate open access versionFindings
  • M. F. Fontoura, A. Neumann, S. Rajagopalan, E. Shekita, and J. Zien. High performance index build algorithms for intranet search engines. In VLDB, 2004.
    Google ScholarLocate open access versionFindings
  • E. Goh, H. Shacham, N. Modadugu, and D. Boneh. Sirius: Securing remote untrusted storage. In NDSS, 2003.
    Google ScholarFindings
  • M. Goodrich, R. Tamassia, and A. Schwerin. Implementation of an authenticated dictionary with skip lists and commutative hashing. In DISCEX II, 2001.
    Google ScholarLocate open access versionFindings
  • H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book. Prentice-Hall, 2000.
    Google ScholarFindings
  • H. Hacigumus, B. R. Iyer, and S. Mehrotra. Providing database as a service. In ICDE, 2002.
    Google ScholarLocate open access versionFindings
  • S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. J. Am. Soc. for Info. Sci. & Tech., 54:8, Jun. 2003.
    Google ScholarLocate open access versionFindings
  • L. Huang, W. Hsu, and F. Zheng. CIS: Content Immutable Storage for Trustworthy Record Keeping. In NASA MSST, 2006.
    Google ScholarLocate open access versionFindings
  • IBM Corp. IBM TotalStorage DR550, 2004. Available at http://www-1.ibm.com/servers/storage/disk/dr.
    Findings
  • B. Klimt and Y. Yang. Introducing the Enron Corpus. In CEAS, 2004.
    Google ScholarLocate open access versionFindings
  • T. Krijnen and L. G. L. T. Meertens. Making B-Trees Work for B.IW 219/83. The Mathematical Centre, Amsterdam, 1983.
    Google ScholarFindings
  • N. Lester, J. Zobel, and H. E. Williams. In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems. In Conf. on Australasian Computer Science, 2004.
    Google ScholarLocate open access versionFindings
  • E. L. Miller, W. E. Freeman, D. D. E. Long, and B. C. Reed. Strong security for network-attached storage. In FAST, 2002.
    Google ScholarLocate open access versionFindings
  • Network Appliance, Inc. SnapLockT M Compliance and SnapLock Enterprise Software, 2003. Available at http://www.netapp.com/products/filer/snaplock.html.
    Findings
  • P. Rathmann. Dynamic Data Structures on Optical Disks. In ICDE, 1984.
    Google ScholarLocate open access versionFindings
  • S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. TREC, 1992.
    Google ScholarLocate open access versionFindings
  • C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
    Google ScholarLocate open access versionFindings
  • A. Tomasic, H. Garcıa-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In VLDB, 1994.
    Google ScholarLocate open access versionFindings
  • I. H. Witten, A. Moffatt, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 1999.
    Google ScholarFindings
  • Q. Zhu and W. Hsu. Fossilized Index: The Linchpin of Trustworthy Non-Alterable Electronic Records. In ACM SIGMOD Conference, June 2005.
    Google ScholarLocate open access versionFindings
  • G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, 1949.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科