AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
High-rate sampling reduces the amount of time a user must gather profiles before using analysis tools

Continuous profiling: where have all the cycles gone?

Special Interest Group on Operating Systems, no. 4 (1997): 1-14

Cited by: 713|Views271
EI

Abstract

This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system ke...More

Code:

Data:

0
Introduction
  • The performance of programs running on modern high-performance computer systems is often hard to understand.
  • When a single program or an entire system does not perform as well as desired or expected, it can be difficult to pinpoint the reasons.
  • The system consists of two parts, each with novel features: (1) a data collection subsystem that samples program counters and records them in an on-disk database and (2) a suite of analysis tools that analyze the stored profile information at several levels, from the fraction of CPU time consumed by each program to the number of stall cycles for each individual instruction.
  • The information produced by the analysis tools guides users to time-critical sections of code and explains in detail the static and dynamic delays incurred by each instruction
Highlights
  • The performance of programs running on modern high-performance computer systems is often hard to understand
  • The system consists of two parts, each with novel features: (1) a data collection subsystem that samples program counters and records them in an on-disk database and (2) a suite of analysis tools that analyze the stored profile information at several levels, from the fraction of CPU time consumed by each program to the number of stall cycles for each individual instruction
  • The information produced by the analysis tools guides users to time-critical sections of code and explains in detail the static and dynamic delays incurred by each instruction
  • We evaluated our profiling system’s performance under three different configurations: cycles, in which the system monitors only cycles; default, in which the system monitors both cycles and instruction cache misses; and mux, in which the system monitors cycles with one performance counter and uses multiplexing to monitor instruction cache misses, data cache misses, and branch mispredictions with another counter
  • We summarize the cycles spent in the procedure, showing how many have gone to I-cache misses, how many to D-cache misses, etc., by aggregating instruction-level data
  • High-rate sampling reduces the amount of time a user must gather profiles before using analysis tools
Methods
  • Dcpiprof provides a high-level view of the performance of a workload.
  • It reads the profile data gathered by the system and displays a listing of the number of samples per procedure, sorted by decreasing number of samples.
  • The ffb8ZeroPolyArc routine accounts for 33.87% of the cycles for this workload
  • Notice that this profile includes code in the kernel (/vmunix) as well as code in shared libraries.
  • The figure has columns for the cumulative percent of cycle samples consumed by the procedure and all those preceding it in the listing, as well as information about the total number and fraction of instruction cache miss samples that occurred in each procedure
Results
  • The authors' system has been used to analyze and improve the performance of a wide range of complex commercial applications, including graphics systems, databases, industry benchmark suites, and compilers.
  • The interrupt handler has to be fast; for example, if the interrupt handler takes 1000 cycles, it will consume more than 1.5% of the CPU.
  • Most samples whose estimates are off by more than 15% are marked low confidence
Conclusion
  • The Digital Continuous Profiling Infrastructure transparently collects complete, detailed profiles of entire systems.
  • The authors' system demonstrates that it is possible to collect profile samples at a high rate and with low overhead.
  • High-rate sampling reduces the amount of time a user must gather profiles before using analysis tools.
  • This is especially important when using tools that require samples at the granularity of individual instructions rather than just basic blocks or procedures.
  • Low overhead is important because it reduces the amount of time required to gather samples and improves the accuracy of the samples by minimizing the perturbation of the profiled code
Tables
  • Table1: Profiling Systems
  • Table2: Overall Slowdown for Multiprocessor Workloads (in percent)
  • Table3: Description of Uniprocessor Workloads
  • Table4: Overall Slowdown for Uniprocessor Workloads (in percent)
  • Table5: Time Overhead Components
  • Table6: Description of Multiprocessor Workloads
  • Table7: Daemon Space Overhead
Download tables as Excel
Related work
  • Few other profiling systems can monitor complete system activity with high-frequency sampling and low overhead; only ours and Morph [Zhang et al 1997] are designed to run continuously for long periods on production systems, something that is essential for obtaining useful profiles of large complex applications such as databases. In addition, we know of no other system that can analyze time-biased samples to produce accurate finegrained information about the number of cycles taken by each instruction and the reasons for stalls; the only other tools that can produce similar information use simulators, at much higher cost.

    Table I compares several profiling systems. The overhead column describes how much profiling slows down the target program; low overhead is defined arbitrarily as less than 20%. The scope column shows whether the profiling system is restricted to a single application (app) or can measure full system activity (sys). The grain column indicates the range over which an individual measurement applies. For example, gprof counts procedure executions, whereas pixie can count executions of each instruction; prof
Reference
  • ANDERSON, T. E. AND LAZOWSKA, E. D. 1990. Quartz: A tool for tuning parallel program performance. In Proceedings of the ACM SIGMETRICS 1990 Conference on Measurement and Modeling of Computer Systems. ACM, New York, 115–125.
    Google ScholarLocate open access versionFindings
  • BALL, T. AND LARUS, J. 1994. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. 16, 4 (July), 1319 –1360.
    Google ScholarLocate open access versionFindings
  • BLICKSTEIN, D., CRAIG, P., DAVIDSON, C., FAIMAN, R., GLOSSOP, K., GROVE, R., HOBBS, S., AND NOYCE, W. 1992. The GEM optimizing compiler system. Digital Tech. J. 4, 4.
    Google ScholarLocate open access versionFindings
  • CARTA, D. 1990. Two fast implementations of the “minimal standard” random number generator. Commun. ACM 33, 1 (Jan.), 87– 88.
    Google ScholarLocate open access versionFindings
  • COHN, R. AND LOWNEY, P. G. 1996. Hot cold optimization of large Windows/NT applications. In 29th Annual International Symposium on Microarchitecture (Micro-29) (Paris, France, Dec.).
    Google ScholarLocate open access versionFindings
  • COHN, R., GOODWIN, D., LOWNEY, P. G., AND RUBIN, N. 1997. Spike: An optimizer for Alpha/NT executables. In USENIX Windows NT Workshop. USENIX Assoc., Berkeley, Calif.
    Google ScholarLocate open access versionFindings
  • DIGITAL. 1995a. Alpha 21164 microprocessor hardware reference manual. Digital Equipment Corp., Maynard, Mass.
    Google ScholarFindings
  • DIGITAL. 1995b. DECchip 21064 and DECchip 21064A Alpha AXP microprocessors hardware reference manual. Digital Equipment Corp., Maynard, Mass.
    Google ScholarFindings
  • GOLDBERG, A. J. AND HENNESSY, J. L. 1993. MTOOL: An integrated system for performance debugging shared memory multiprocessor applications. IEEE Trans. Parallel Distrib. Syst. 28 – 40.
    Google ScholarLocate open access versionFindings
  • GRAHAM, S., KESSLER, P., AND MCKUSICK, M. 1982. gprof: A call graph execution profiler. SIGPLAN Not. 17, 6 (June), 120 –126.
    Google ScholarLocate open access versionFindings
  • HALL, M., ANDERSON, J., AMARASINGHE, S., MURPHY, B., LIAO, S.-W., BUGNION, E., AND LAM, M. 1996. Maximizing multiprocessor performance with the SUIF compiler. IEEE Comput. 29, 12 (Dec.), 84 – 89.
    Google ScholarLocate open access versionFindings
  • JOHNSON, R., PEARSON, D., AND PINGALI, K. 1994. The program structure tree: Computing control regions in linear time. In Proceedings of the ACM SIGPLAN ’94 Conference on Programming Language Design and Implementation. ACM, New York, 171–185.
    Google ScholarLocate open access versionFindings
  • MCCALPIN, J. D. 1995. Memory bandwidth and machine balance in high performance computers. IEEE Tech. Comm. Comput. Arch. Newslett. See also http://www.cs.virginia.edu/stream.
    Locate open access versionFindings
  • 390 • Jennifer M. Anderson et al. MIPS. 1990. UMIPS-V reference manual (pixie and pixstats). MIPS Computer Systems, Sunnyvale, Calif. REISER, J. F. AND SKUDLAREK, J. P. 1994. Program profiling problems, and a solution via machine language rewriting. SIGPLAN Not. 29, 1 (Jan.), 37– 45. ROSENBLUM, M., HERROD, S., WITCHEL, E., AND GUPTA, A. 1995. Complete computer simulation: The SimOS approach. IEEE Parallel Distrib. Tech. 3, 3 (Fall). SITES, R. AND WITEK, R. 1995. Alpha AXP architecture reference manual. Digital Press, Newton, Mass. ZAGHA, M., LARSON, B., TURNER, S., AND ITZKOWITZ, M. 1996. Performance analysis using the
    Google ScholarLocate open access versionFindings
  • MIPS R10000 performance counters. In Proceedings of Supercomputing. ZHANG, X., WANG, Z., GLOY, N., CHEN, J. B., AND SMITH, M. D. 1997. Operating system support for automated profiling and optimization. In Proceedings of the 16th ACM Symposium on Operating Systems Principles. ACM, New York. Received July 1997; revised September 1997; accepted September 1997
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科