Fault-tolerant typed assembly language

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, no. 6 (2007): 42-53

被引用58|浏览160
EI
下载 PDF 全文
引用
微博一下

摘要

A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in th...更多

代码

数据

0
简介
  • A transient fault or soft error is a temporary hardware failure that alters a signal transfer, a register value, or some other processor component.
  • While transient faults are temporary, they corrupt computations and have led to costly failures in high-end systems in recent years.
  • To republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
  • Copyright c 2007 ACM 978-1-59593-633-2/07/0006. . . $5.00
重点内容
  • A transient fault or soft error is a temporary hardware failure that alters a signal transfer, a register value, or some other processor component
  • In order to prove properties of our type system, we extend our soifnngletr-asntespititorannsscitoinotnaiΣni1ng−e→xaskctΣly2kfrfoamultsSeΣc1ti−o−nn→2 sktoΣa2 sequence, where n is greater than or equal to zero, and k is still either 0 or 1
  • A machine state that is well-typed under the empty zap tag can take a non-faulty step to another ordinary, non-faulty machine state
  • The type system acts as a sound proof technique for verifying reliability properties of programs
  • Our two main formal results show that a single fault affecting observable behavior in a welltype program will always be detected, and that the system will not claim to have detected a fault when none has occurred
  • Despite the fact that well-typed programs essentially duplicate all computation, we provide simulation results showing a performance overhead of 1.34x
结果
  • Progress states that well-typed states can take a step.
  • A machine state that is well-typed under the empty zap tag can take a non-faulty step to another ordinary, non-faulty machine state.
  • A machine state that is well-typed under a zap tag of color c can take a step, but the result of that step may either be another ordinary machine state or the fault state.
  • If Σ Σ −→s0 Σ and Σ = fault.
结论
  • Transient faults are already a significant cause for concern at major semiconductor manufacturers and threaten to be more so in the coming years and decades.
  • This paper takes one step forward for the science of fault tolerance by presenting a principled and practical hybrid software-hardware scheme for detecting transient faults.
  • The authors identify four general principles for verifying correctness of fault tolerant systems and capture these in an assembly language type system.
  • The type system acts as a sound proof technique for verifying reliability properties of programs.
  • It can be used as a debugging aid within a compiler, strictly dominating any conventional testing technique.
  • Despite the fact that well-typed programs essentially duplicate all computation, the authors provide simulation results showing a performance overhead of 1.34x
相关工作
  • Fault tolerance based on software replication is a well-populated field with decades of history. TALFT differs from previous approaches in that it provides a type-theoretic framework for obtaining strong guarantees about the reliability of machine code.

    Most closely related to TALFT is our previous work on λzap, a highly abstract type-theoretic model for studying the basic principles of fault tolerance in the lambda calculus [26]. There are two important distinctions between TALFT and λzap. First, λzap, working at the level of the lambda calculus, is very far removed from real machine code. For instance, it lacks a program counter, a register file, memory, and load or store instructions. Memory references in particular constitute a key challenge in the current technical work. Second, the properties of the λzap type system are relatively weak compared with the properties of the current type system. The “end-to-end” fault tolerance property proven for λzap depends not only on the type system but also the nature of the translation from the ordinary simply-typed lambda calculus. In contrast, the type system of TALFT is much stronger, capable of ensuring a strong fault tolerance property independently of the process that compiles the code.
基金
  • This research is funded in part by NSF awards CNS-0627650, CNS-0615250, and CCF 0633268
引用论文
  • R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17–22, March 2001.
    Google ScholarLocate open access versionFindings
  • R. C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 01.1 – 121 01.14, April 2002.
    Google ScholarLocate open access versionFindings
  • S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. In IEEE Micro, volume 25, pages 10–16, December 2005.
    Google ScholarLocate open access versionFindings
  • M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98–109. ACM Press, 2003.
    Google ScholarLocate open access versionFindings
  • R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216–226, May 1990.
    Google ScholarLocate open access versionFindings
  • A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160–174, 1988.
    Google ScholarLocate open access versionFindings
  • S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in Los Alamos National Labratory’s ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329–335, September 2005.
    Google ScholarLocate open access versionFindings
  • G. Morrisett, D. Walker, K. Crary, and N. Glew. From System F to Typed Assembly Language. ACM Transactions on Programming Languages and Systems, 3(21):528–569, May 1999.
    Google ScholarLocate open access versionFindings
  • S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 99–110. IEEE Computer Society, 2002.
    Google ScholarLocate open access versionFindings
  • G. C. Necula. Compiling with Proofs. PhD thesis, Carnegie Mellon University, 1998.
    Google ScholarFindings
  • T. J. O’Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41–49, January 1996.
    Google ScholarLocate open access versionFindings
  • N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111–122, March 2002.
    Google ScholarLocate open access versionFindings
  • N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63–75, March 2002.
    Google ScholarLocate open access versionFindings
  • J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995.
    Google ScholarLocate open access versionFindings
  • F. Perry, L. Mackey, G. A. Reis, J. Ligatti, D. I. August, and D. Walker. Fault-tolerant typed assembly language. Technical Report TR-77607, Princeton University, 2007.
    Google ScholarFindings
  • S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 25–36. ACM Press, 2000.
    Google ScholarLocate open access versionFindings
  • G. A. Reis, J. Chang, and D. I. August. Automatic instructionlevel software-only recovery methods. In IEEE Micro Top Picks, volume 27, January 2007.
    Google ScholarLocate open access versionFindings
  • G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005.
    Google ScholarLocate open access versionFindings
  • G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32th Annual International Symposium on Computer Architecture, pages 148–159, June 2005.
    Google ScholarLocate open access versionFindings
  • P. P. Shirvani, N. Saxena, and E. J. McCluskey. Softwareimplemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273–284, 2000.
    Google ScholarLocate open access versionFindings
  • P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 389–399, June 2002.
    Google ScholarLocate open access versionFindings
  • T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM’s S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12–23, March 1999.
    Google ScholarLocate open access versionFindings
  • S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. August. A framework for unrestricted whole-program optimization. In ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, pages 61–71, June 2006.
    Google ScholarLocate open access versionFindings
  • R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137–143, July 2003.
    Google ScholarLocate open access versionFindings
  • T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 87–98. IEEE Computer Society, 2002.
    Google ScholarLocate open access versionFindings
  • D. Walker, L. Mackey, J. Ligatti, G. Reis, and D. I. August. Static typing for a faulty lambda calculus. In ACM International Conference on Functional Programming, Portland, Oregon, Sept. 2006.
    Google ScholarLocate open access versionFindings
  • Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293–307, February 1996.
    Google ScholarLocate open access versionFindings
  • J. F. Ziegler and H. Puchner. SER - History, Trends, and Challenges: A Guide for Designing with Memory ICs. 2004.
    Google ScholarFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科