Archive for the ‘Compilers’ Category

Apache Spark as a Compiler:… [This is wicked cool!]

Tuesday, May 24th, 2016

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop by Sameer Agarwal, Davies Liu and Reynold Xin.

From the post:

When our team at Databricks planned our contributions to the upcoming Apache Spark 2.0 release, we set out with an ambitious goal by asking ourselves: Apache Spark is already pretty fast, but can we make it 10x faster?

This question led us to fundamentally rethink the way we built Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), a majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading or writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in this useless work has been a long-time focus of modern compilers.

Apache Spark 2.0 will ship with the second generation Tungsten engine. Built upon ideas from modern compilers and MPP databases and applied to data processing queries, Tungsten emits (SPARK-12795) optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. As a result of this streamlined strategy, called “whole-stage code generation,” we significantly improve CPU efficiency and gain performance.

(emphasis in original)

How much better you ask?

cost per row (in nanoseconds, single thread)

primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8-bit entropy) 620 ns 5.3 ns
sort (64-bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Parquet decoding (single int column) 120 ns 13 ns

Don’t just stare at the numbers:

Try the whole-stage code generation notebook in Databricks Community Edition

What’s the matter?

Haven’t you ever seen a 1 billion record join in 0.8 seconds? (Down from 61.7 seconds.)

If all that weren’t impressive enough, the post walks you through the dominate (currently) query evaluation strategy as a setup to Spark 2.0 and then into why “whole-stage code generation is so powerful.”

A must read!

Rosetta’s Way Back to the Source

Friday, May 22nd, 2015

Rosetta’s Way Back to the Source – Towards Reverse Engineering of Complex Software by Herman Bos.

From the webpage:

The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).

In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.

Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used

Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.
… (emphasis added)

Herman gives a great thumbnail sketch of the difficulties and potential for this project.

Looking forward to news of a demonstration that “irreversibility of computation” is false.

One important use case being verification that software that claims to have used prevention of buffer overflow techniques has in fact done so. Not the sort of thing I would entrust to statements in marketing materials.

Decompiling Clojure

Thursday, April 24th, 2014

Guillermo Winkler has started a series of posts on decompiling Clojure.

Thus far, Decompiling Clojure I and Decompiling Clojure II, The Compiler have been posted.

From the first post:

This is the first in a series of articles about decompiling Clojure, that is, going from JVM bytecode created by the Clojure compiler, to some kind of higher level language, not necessarily Clojure.

This article was written in the scope of a larger project, building a better Clojure debugger, which I’ll probably blog about in the future.

These articles are going to build form the ground up, so you may skip forward if you find some of the stuff obvious.

Just in case you want to read something more challenging than the current FCC and/or security news.

Stanford Spring Offerings

Sunday, February 16th, 2014

Just a quick heads up on two Stanford Online courses that may be of interest:

Compilers, Alex Aiken, Monday, March 17, 2014

From the course page:

This course will discuss the major ideas used today in the implementation of programming language compilers, including lexical analysis, parsing, syntax-directed translation, abstract syntax trees, types and type checking, intermediate languages, dataflow analysis, program optimization, code generation, and runtime systems. As a result, you will learn how a program written in a high-level language designed for humans is systematically translated into a program written in low-level assembly more suited to machines. Along the way we will also touch on how programming languages are designed, programming language semantics, and why there are so many different kinds of programming languages.

The course lectures will be presented in short videos. To help you master the material, there will be in-lecture questions to answer, quizzes, and two exams: a midterm and a final. There will also be homework in the form of exercises that ask you to show a sequence of logical steps needed to derive a specific result, such as the sequence of steps a type checker would perform to type check a piece of code, or the sequence of steps a parser would perform to parse an input string. This checking technology is the result of ongoing research at Stanford into developing innovative tools for education, and we’re excited to be the first course ever to make it available to students.

An optional course project is to write a complete compiler for COOL, the Classroom Object Oriented Language. COOL has the essential features of a realistic programming language, but is small and simple enough that it can be implemented in a few thousand lines of code. Students who choose to do the project can implement it in either C++ or Java.

I hope you enjoy the course!

Machine Learning, Andrew Ng, Monday, March 3, 2014.

From the course page:

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.


Automata [Starts 4 Nov. 2013]

Tuesday, October 8th, 2013

Automata by Jeff Ullman.

From the course description:

Why Study Automata Theory?

This subject is not just for those planning to enter the field of complexity theory, although it is a good place to start if that is your goal. Rather, the course will emphasize those aspects of the theory that people really use in practice. Finite automata, regular expressions, and context-free grammars are ideas that have stood the test of time. They are essential tools for compilers. But more importantly, they are used in many systems that require input that is less general than a full programming language yet more complex than “push this button.”

The concepts of undecidable problems and intractable problems serve a different purpose. Undecidable problems are those for which no computer solution can ever exist, while intractable problems are those for which there is strong evidence that, although they can be solved by a computer, they cannot be solved sufficiently fast that the solution is truly useful in practice. Understanding this theory, and in particular being able to prove that a problem you are facing belongs to one of these classes, allows you to justify taking another approach — simplifying the problem or writing code to approximate the solution, for example.

During the course, I’m going to prove a number of things. The purpose of these proofs is not to torture you or confuse you. Neither are the proofs there because I doubt you would believe me were I merely to state some well-known fact. Rather, understanding how these proofs, especially inductive proofs, work, lets you think more clearly about your own work. I do not advocate proofs that programs are correct, but whenever you attempt something a bit complex, it is good to have in mind the inductive proofs that would be needed to guarantee that what you are doing really works in all cases.

Recommended Background

You should have had a second course in Computer Science — one that covers basic data structures (e.g., lists, trees, hashing), and basic algorithms (e.g., tree traversals, recursive programming, big-oh running time). In addition, a course in discrete mathematics covering propositional logic, graphs, and inductive proofs is valuable background.

If you need to review or learn some of these topics, there is a free on-line textbook Foundations of Computer Science, written by Al Aho and me, available at Recommended chapters include 2 (Recursion and Induction), 3 (Running Time of Programs), 5 (Trees), 6 (Lists), 7 (Sets), 9 (Graphs), and 12 (Propositional Logic). You will also find introductions to finite automata, regular expressions, and context-free grammars in Chapters 10 and 11. Reading Chapter 10 would be good preparation for the first week of the course.

The course includes two programming exercises for which a knowledge of Java is required. However, these exercises are optional. You will receive automated feedback, but the results will not be recorded or used to grade the course. So if you are not familiar with Java, you can still take the course without concern for prerequisites.

All of “Foundations of Computer Science” is worth reading but for this course:

Chapter 2 Iteration, Induction, and Recursion
Chapter 3 The Running Time of Programs
Chapter 5 The Tree Data Model
Chapter 6 The List Data Model
Chapter 7 The Set Data Model
Chapter 9 The Graph Data Model
Chapter 10 Patterns, Automata, and Regular Expressions
Chapter 11 Recursive Description of Patterns
Chapter 12 Propositional Logic

Six very intensive weeks but on the bright side, you will be done before the holiday season. 😉

Parsing with Pictures

Wednesday, October 17th, 2012

Parsing with Pictures by Keshav Pingali and Gianfranco Bilardi. (PDF file)

From an email that Keshav sent to the email list:

Gianfranco Bilardi and I have developed a new approach to parsing context-free languages that we call “Parsing with pictures”. It provides an alternative (and, we believe, easier to understand) approach to context-free language parsing than the standard presentations using derivations or pushdown automata. It also unifies Earley, SLL, LL, SLR, and LR parsers among others.

Parsing problems are formulated as path problems in a graph called the grammar flow graph (GFG) that is easily constructed from a given grammar. Intuitively, the GFG is to context-free grammars what NFAs are to regular languages. Among other things, the paper has :

(i) an elementary derivation of Earley’s algorithm for parsing general context-free grammars, showing that it is an easy generalization of the well-known reachability-based NFA simulation algorithm,

(ii) a presentation of look-ahead that is independent of particular parsing strategies, and is based on a simple inter-procedural dataflow analysis,

(iii) GFG structural characterizations of LL and LR grammars that are simpler to understand than the standard definitions, and bring out a symmetry between these grammar classes,

(iv) derivations of recursive-descent and shift-reduce parsers for LL and LR grammars by optimizing the Earley parser to exploit this structure, and

(v) a connection between GFGs and NFAs for regular grammars based on the continuation-passing style (CPS) optimization.

Or if you prefer the more formal abstract:

The development of elegant and practical algorithms for parsing context-free languages is one of the major accomplishments of 20th century Computer Science. These algorithms are presented in the literature using string rewriting systems or abstract machines like pushdown automata, but the resulting descriptions are unsatisfactory for several reasons. First, even a basic understanding of parsing algorithms for some grammar classes such as LR(k) grammars requires mastering a formidable number of difficult concepts and terminology. Second, parsing algorithms for different grammar classes are often presented using entirely different formalisms, so the relationships between these grammar classes are obscured. Finally, these algorithms seem unrelated to algorithms for regular language recognition even though regular languages are a subset of context-free languages.

In this paper, we show that these problems are avoided if parsing is reformulated as the problem of finding certain kinds of paths in a graph called the Grammar Flow Graph (GFG) that is easily constructed from a context-free grammar. Intuitively, GFG’s permit parsing problems for context-free grammars to be formulated as path problems in graphs in the same way that non-deterministic finite-state automata do for regular grammars. We show that the GFG enables a unified treatment of Earley’s parser for general context-free grammars, recursive-descent parsers for LL(k) and SLL(k) grammars, and shift-reduce parsers for LR(k) and SLR(k) grammars. Computation of look-ahead sets becomes a simple interprocedural dataflow analysis. These results suggest that the GFG can be a new foundation for the study of context-free languages.

Odd as it may sound, some people want to be understood.

If you think being understood isn’t all that weird, do a slow read on this paper and provide feedback to the authors.

Martin Odersky: Reflection and Compilers

Saturday, April 14th, 2012

Martin Odersky: Reflection and Compilers

From the description:

Reflection and compilers do tantalizing similar things. Yet, in mainstream, statically typed languages the two have been only loosely coupled, and generally share very little code. In this talk I explore what happens if one sets out to overcome their separation.

The first half of the talk addresses the challenge how reflection libraries can share core data structures and algorithms with the language’s compiler without having compiler internals leaking into the standard library API. It turns out that a component system based on abstract types and path-dependent types is a good tool to solve this challenge. I’ll explain how the “multiple cake pattern” can be fruitfully applied to expose the right kind of information.

The second half of the talk explores what one can do when strong, mirror-based reflection is a standard tool. In particular, the compiler itself can use reflection, leading to a particular system of low-level macros that rewrite syntax trees. One core property of these macros is that they can express staging, by rewriting a tree at one stage to code that produces the same tree at the next stage. Staging lets us implement type reification and general LINQ-like functionality. What’s more, staging can also be applied to the macro system itself, with the consequence that a simple low-level macro system can produce a high-level hygienic one, without any extra effort from the language or compiler.

Ignore the comments about the quality of the sound and video. It looks like substantial improvements have been made or I am less sensitive to those issues. Give it a try and see what you think.

Strikes me as being very close to Newcomb’s thoughts on subject identity being composed of other subject identities.

Such that you could have subject representatives that “merge” together and then themselves form the basis for merging other subject representatives.

Suggestions of literature on reflection, its issues and implementations? (Donated books welcome as well. Contact for physical delivery address.)

The Heterogeneous Programming Jungle

Saturday, March 24th, 2012

The Heterogeneous Programming Jungle by Michael Wolfe.

Michael starts off with one definition of “heterogeneous:”

The heterogeneous systems of interest to HPC use an attached coprocessor or accelerator that is optimized for certain types of computation.These devices typically exhibit internal parallelism, and execute asynchronously and concurrently with the host processor. Programming a heterogeneous system is then even more complex than “traditional” parallel programming (if any parallel programming can be called traditional), because in addition to the complexity of parallel programming on the attached device, the program must manage the concurrent activities between the host and device, and manage data locality between the host and device.

And while he returns to that definition in the end, another form of heterogeneity is lurking not far behind:

Given the similarities among system designs, one might think it should be obvious how to come up with a programming strategy that would preserve portability and performance across all these devices. What we want is a method that allows the application writer to write a program once, and let the compiler or runtime optimize for each target. Is that too much to ask?

Let me reflect momentarily on the two gold standards in this arena. The first is high level programming languages in general. After 50 years of programming using Algol, Pascal, Fortran, C, C++, Java, and many, many other languages, we tend to forget how wonderful and important it is that we can write a single program, compile it, run it, and get the same results on any number of different processors and operating systems.

So there is the heterogeneity of attached coprocessor and, just as importantly, of the processors with coprocessors.

His post concludes with:

Grab your Machete and Pith Helmet

If parallel programming is hard, heterogeneous programming is that hard, squared. Defining and building a productive, performance-portable heterogeneous programming system is hard. There are several current programming strategies that attempt to solve this problem, including OpenCL, Microsoft C++AMP, Google Renderscript, Intel’s proposed offload directives (see slide 24), and the recent OpenACC specification. We might also learn something from embedded system programming, which has had to deal with heterogeneous systems for many years. My next article will whack through the underbrush to expose each of these programming strategies in turn, presenting advantages and disadvantages relative to the goal.

These are languages that share common subjects (think of their target architectures) and so are ripe for a topic map that co-locates their approaches to a particular architecture. Being able to incorporate official and non-official documentation, tests, sample code, etc., might enable faster progress in this area.

The future of HPC processors is almost upon us. It will not do to be tardy.

A Distributed C Compiler System on MapReduce: Mrcc

Thursday, March 15th, 2012

A Distributed C Compiler System on MapReduce: Mrcc

Alex Popescu of myNoSQL points to software and a paper on distributed C code for compilation.

Changing to distributed architectures may uncover undocumented decisions made long ago and far away. Decisions that we may choose to make differently this time. Hopefully we will do a better job of documenting them. (Not that it will happen but there is no law against hoping.)