Successful Scalability Techniques for Illinois Web Archive Search

msra(2007)

引用 23|浏览10
暂无评分
摘要
The Capturing Electronic Publications (CEP) web archive assembled since 2002 by the Electronic Archive Project group of Graduate School of Library and Information Science (GSLIS) at the University of Illinois, Urbana-Champaign (UIUC), for the Illinois State Library (ISL) currently contains over 37 million files and is increasing by over 900,000 files per month. In order for ISL to utilize this collection effectively in identifying, selecting, and migrating specific documents to permanent storage, some form of search mechanism had to be provided. However, the file inventory far exceeded the capacity of open source or freeware search tools. Detecting those files which had not changed between harvests allows the suppression of search surrogate generation for those files. With that substantial reduction in search surrogate count accomplished, existing provisions of the SWISH-E open-source search engine to use multiple search databases sequentially did not impose noticeable delays on search engine users. Combined, these approaches enable SWISH-E search across the entire collection, despite an assumed initial design limit of one million files.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要