Modeling and Discovering Vulnerabilities with Code Property Graphs by Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck.
Abstract:
The vast majority of security breaches encountered today are a direct result of insecure code. Consequently, the protection of computer systems critically depends on the rigorous identification of vulnerabilities in software, a tedious and error-prone process requiring significant expertise. Unfortunately, a single flaw suffices to undermine the security of a system and thus the sheer amount of code to audit plays into the attacker’s cards. In this paper, we present a method to effectively mine large amounts of source code for vulnerabilities. To this end, we introduce a novel representation of source code called a code property graph that merges concepts of classic program analysis, namely abstract syntax trees, control flow graphs and program dependence graphs, into a joint data structure. This comprehensive representation enables us to elegantly model templates for common vulnerabilities with graph traversals that, for instance, can identify buffer overflows, integer overflows, format string vulnerabilities, or memory disclosures. We implement our approach using a popular graph database and demonstrate its efficacy by identifying 18 previously unknown vulnerabilities in the source code of the Linux kernel.
I was running down references in the documentation for joern when I discovered this paper.
The recent SSH bug in the Apple iOS is used to demonstrate a code property graph that combines the perspectives of Abstract Syntax Trees, Control Flow Graphs, and Program Dependence Graphs.
In topic map lingo we would call those “universes of discourse,” but the essential fact to remember is that combining different perspectives (are you listening NSA?) is where a code property graph derives its power.
Note that I said “combining” (different perspectives are preserved) not “sanitizing” (different perspectives are lost).
Using Neo4j, the authors created a code property graph of the Linux kernel, 52 million nodes and 87 million edges. As a result of their analysis, they discovered 18 previously undiscovered bugs.
Important: Patterns discovered in a code property graph can be used to identify vulnerabilities in other source code. Searching for bugs in source code can become cumulative and and less episodic.
Comparison of source and bug histories of the Linux kernel, Apache http server, Sendmail, etc. will provide some of the common graph patterns signaling vulnerabilities in source code.
Will the white or black hat community will be the first to build a public repository for graph patterns showing source code vulnerabilities?
Hiding security information hasn’t worked so far but I think you know the most likely result.