Recovering the toolchain provenance of binary code.

ISSTA(2011)

引用 102|浏览307
暂无评分
摘要
ABSTRACTProgram binaries are an artifact of a production process that begins with source code and ends with a string of bytes representing executable code. There are many reasons to want to know the specifics of this process for a given binary---for forensic investigation of malware, to diagnose the role of the compiler in crashes or performance problems, or for reverse engineering and decompilation---but binaries are not generally annotated with such provenance details. Intuitively, the binary code should exhibit properties specific to the process that produced it, but it is not at all clear how to find such properties and map them to specific elements of that process. In this paper, we present an automatic technique to recover toolchain provenance: those details, such as the source language and the compiler and compilation options, that define the transformation process through which the binary was produced. We approach provenance recovery as a classification problem, discovering characteristics of binary code that are strongly associated with particular toolchain components and developing models that can infer the likely provenance of program binaries. Our experiments show that toolchain provenance can be recovered with high accuracy, approaching 100% accuracy for some components and yielding good results (90%) even when the binaries emitted by different components appear to be very similar.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要