AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
We find that whole-file deduplication together with sparseness is a highly efficient means of lowering storage consumption, even in a backup scenario

A Study of Practical Deduplication

ACM Transactions on Storage, no. 4 (2012)

Cited by: 518|Views198
EI

Abstract

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the s...More

Code:

Data:

0
Introduction
  • File systems often contain redundant copies of information: identical files or subfile regions, possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage.
  • Deduplication systems decrease storage consumption by identifying distinct chunks of data with identical content.
  • They store a single copy of the chunk along with metadata about how to reconstruct the original files from the chunks.
  • The authors set the minimum and maximum parameters to 4K and 128K, respectively, while the authors varied the expected chunk size from 8K to 64K by powers-of-two
Highlights
  • File systems often contain redundant copies of information: identical files or subfile regions, possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage
  • We find that while blockbased deduplication of our dataset can lower storage consumption to as little as 32% of its original requirements, nearly three quarters of the improvement observed could be captured through whole-file deduplication and sparseness
  • Whole file deduplication is an obvious alternative to block-based deduplication because it is lightweight and, as we have shown, nearly as effective at reclaiming space
  • We studied file system data, metadata, and layout on nearly 1000 Windows file systems in a commercial environment
  • We find that whole-file deduplication together with sparseness is a highly efficient means of lowering storage consumption, even in a backup scenario
  • The ten most popular files extensions account for less than 45% of the total files compared with over 50% in 2000
  • The environment we studied, despite being homogeneous, shows a large diversity in file systems and file sizes
Methods
  • Potential participants were selected randomly from Microsoft employees. Each was contacted with an offer to install a file system scanner on their work computer(s) in exchange for a chance to win a prize.
  • The authors contacted 10,500 people in this manner to reach the target study size of about 1000 users.
  • This represents a participation rate of roughly 10%, which is smaller than the rates of 22% in similar prior studies [Agrawal et al 2007; Douceur and Bolosky 1999].
  • Many potential participants declined explicitly because the scanning process was quite invasive
Results
  • The standard deviation of the results was less than 2% for each of the data points, so the authors don’t believe that the authors would have gained much more precision by running more trials.2.
  • The most aggressive chunking algorithm (8K Rabin) reclaimed between 18% and 20% more of the total file size than did whole file deduplication.
  • The 8K fixed-block algorithm reclaimed between 10% and 11% more space than the whole file.
  • Mean utilization is 43%, only somewhat less than the 53% found in 2000.
  • The CDF shows a nearly linear relationship, with 50% of users having drives no more than 40% full; 70% at less than 60% utilization; and 90% at less than 80%.
  • The ten most popular files extensions account for less than 45% of the total files compared with over 50% in 2000
Conclusion
  • The authors studied file system data, metadata, and layout on nearly 1000 Windows file systems in a commercial environment.
  • The authors find that whole-file deduplication together with sparseness is a highly efficient means of lowering storage consumption, even in a backup scenario.
  • It approaches the effectiveness of conventional deduplication at a much lower cost in performance and complexity.
  • The environment the authors studied, despite being homogeneous, shows a large diversity in file systems and file sizes.
  • At least one problem – that of file fragmentation, appears to be solved, provided that a machine has periods of inactivity in which defragmentation can be run
Tables
  • Table1: Whole File Duplicates by Extension
  • Table2: Nonwhole File, Nonzero Duplicate Data as a Fraction of File System Size by File Extension, 8K
Download tables as Excel
Related work
  • Studies of live deployed system behavior and usage have long been a key component of storage systems research. Workload studies [Vogels 1999] are helpful in determining what file systems do in a given slice of time, but provide little guidance as to the longterm contents of files or file systems. Prior file system content studies [Agrawal et al 2007; Douceur and Bolosky 1999] have considered collections of machines similar to those observed here. The most recent such study uses 7-year old data, while data from the study before it is 11 years old, which we believe justifies the file system portion of this work. However, this research also captures relevant results that the previous work does not.

    Policroniades and Pratt [2004] studied duplication rates using various chunking strategies on a dataset about 0.1% of the size of ours, finding little whole-file duplication and a modest difference between fixed-block and content-based chunking. Kulkarni et al [2004] found combining compression, eliminating duplicate identicalsized chunks, and delta-encoding across multiple datasets to be effective. Their corpus was about 8GB.
Funding
  • We thank the hundreds of Microsoft employees who were willing to allow us to install software that read the entire contents of their disks, Richard Draves for helping us with the Microsoft bureaucracy, Microsoft as whole for funding and enabling this kind of research, our program committee shepherd, Keith Smith, and the anonymous reviewers for their guidance as well as detailed and helpful comments, and Fred Douglis for some truly last-minute comments and proof-reading
Reference
  • AGRAWAL, N., BOLOSKY, W., DOUCEUR, J., AND LORCH, J. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies.
    Google ScholarLocate open access versionFindings
  • BACKUPREAD. 2010. Microsoft Corp. BackupRead function. MSDN. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspx
    Locate open access versionFindings
  • BHADKAMKAR, M., GUERRA, J., USECHE, L., BURNETT, S., LIPTAK, J., RANGASWAMI, R., AND HRISTIDIS, V. 2009. Borg: Block-reorganization for self-optimizing storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies.
    Google ScholarLocate open access versionFindings
  • BHAGWAT, D., ESHGHI, K., LONG, D., AND LILLIBRIDGE, M. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.
    Google ScholarLocate open access versionFindings
  • BLOOM, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422–426.
    Google ScholarLocate open access versionFindings
  • BOLOSKY, W., CORBIN, S., GOEBEL, D., AND DOUCEUR, J. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium. CLEMENTS, A., AHMAD, I., VILAYANNUR, M., AND LI, J. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference. DONG, W., DOUGLIS, F., LI, K., PATTERSON, H., REDDY, S., AND SHILANE, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technology. DORWARD, S. AND QUINLAN, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st
    Google ScholarLocate open access versionFindings
  • USENIX Conference on File and Storage Technologies. DOUCEUR, J. AND BOLOSKY, W. 1999. A large-scale study of file-system contents. In Proceeedings of the ACM
    Google ScholarLocate open access versionFindings
  • MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER, A., TOMAS, A., AND VIVIER, L. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium
    Google ScholarLocate open access versionFindings
  • MS ATIME. 2010. Microsoft Corp. Disabling last access time in Windows Vista to improve NTFS perfomance. The Storage Team Blog. http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-timein-windows-vista-to-improve-ntfs-performance.aspx.
    Findings
  • MS FILESYSTEM. 20Microsoft Corp. File systems. Microsoft TechNet. http://technet.microsoft.com/enus/library/cc938929.aspx.
    Locate open access versionFindings
  • VSS. 2010. Microsoft Corp.Volume shadow copy service. MSDN. http://msdn.microsoft.com/en-us/library/bb968832(VS.85).aspx.
    Locate open access versionFindings
  • MILLER, D. R. 2009. Storage economics: Four principles for reducing total cost of ownership. Hitachi Corporate Web Site. http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf.
    Findings
  • MURPHY, N. AND SELTZER, M. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems.
    Google ScholarLocate open access versionFindings
  • NAGAR, R. 1997. Windows NT File System Internals. O’Reilly.
    Google ScholarFindings
  • POLICRONIADES, C. AND PRATT, I. 2004. Alternatives for detecting redundancy in storage systems. In Proceedings of the. USENIX Annual Technical Conference.
    Google ScholarLocate open access versionFindings
  • RABIN, M. 1981. Fingerprinting by random polynomials. Tech. rep. TR-CSE-03-01. Harvard University Center for Research in Computing Technology.
    Google ScholarFindings
  • RIVEST, R. 1992. The MD5 message-digest algorithm. http://tools.ietf.org/rfc/rfc1321.txt.
    Findings
  • SATYANARAYANAN, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles.
    Google ScholarLocate open access versionFindings
  • SCHEDULED TASKS. 2010. Microsoft Corp. description of the scheduled tasks in Widows Vista. Microsoft support. http://support.microsoft.com/kb/939039.
    Findings
  • SELTZER, M. AND SMITH, K. 1997. File system aging: Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS, ACM, New York.
    Google ScholarLocate open access versionFindings
  • SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON, C., NISHIMOTO, M., AND PECK, G. 1996. Scalability in the XFS file system. In Proceedings of the USENIX Annual Technical Conference.
    Google ScholarLocate open access versionFindings
  • VOGELS, W. 1999. File system usage in windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York.
    Google ScholarLocate open access versionFindings
  • UNGUREANU, C., ATKIN, B., ARANYA, A., GOKHALE, S., RAGO, S., CAKOWSKI, G., DUBNICKI, C., AND BOHRA, A. 2010. Hydrafs: A high-throughput file system for the Hydrastor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies.
    Google ScholarLocate open access versionFindings
  • UNGUREANU, E. AND KRUUS, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies.
    Google ScholarLocate open access versionFindings
  • ZHU, B., LI, K., AND PATTERSON, H. 2008 Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, 1–14.
    Google ScholarLocate open access versionFindings
  • Received September 2011; accepted September 2011
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科