Tuesday, 29 November 2011

Scheme as an embedded DSL in Clojure

If you give someone Fortran, he has Fortran.
If you give someone Lisp, he has any language he pleases.
-- Guy Steele
Replace Fortran with whatever language you are currently using, and the quote still holds true today. Lisp has been around for a long time, and it's built in flexibility is still unmatched by other languages. In this post we will look at key Lisp concepts such as code-is-data and powerful macro semantics.

When you write programs in Lisp, you tend to solve problems very differently from how you would solve them with OO languages, and also different how you would in other functional languages. Where in ML you would write a set of types and functions to operate (match) on them, in Lisp and Clojure specifically you are more likely to stick to the core data types and write functions and macros that make up a Domain Specific Language (DSL) for the problem at hand. More specifically, an internal/embedded DSL using the Lisp syntax but with new functionality that makes the solution or logic simple and clear. The ability to transform code in Lisp is very powerful indeed, and makes you think of code in a different way.

One big benefit of internal DSLs are the speed of execution. Since we map to Clojures native constructs, the examples below will run at the same speed as one defined directly in Clojure (they are infact the same!). In future posts, when we look at external DSLs using interpreters which are much slower.

So here's an example of how (a subset of) Scheme can be written as an internal DSL in Clojure. The full code is available on github.

As you might suspect, this is pretty simple since a lot of functions are exactly the same in Scheme and Clojure.
Some Scheme functions simply has a different name in Clojure, and can be bound to a new var like so;
Scheme's define form is called def in Clojure, and since def is a special form, we can't use the same strategy as with display, rather we need a macro to transform define code to use def;
This uses syntax quotes to get a new list and unquote slicing to get back to the argument list that def requires. Note that this example ignores the fact that in Scheme "define" is used for simple var bindings and function definitions (see full source code for a macro that handles both cases).

A slightly more involved example is Scheme's cond form which uses a extra pair of parens for each case;
In Clojure cond is a macro that transforms the code to a list of nested if statements. So we can write a macro for the Scheme-cond that loops over the list of cases and transforms them directly to nested ifs.
As you can see we also replace the "else" symbol with a Clojure keyword (which actually can be any keyword since they are all "true", :else is used to clarity).

Clojure provides a simple but powerful "debug" functionality for macros, macroexpand. It returns the expanded code for each step in the recursion like so;
Here we can observe the recursive nature of the macro, any errors in the macro will be clear.

One note on this cond macro in particular is that replacing build-in function in the "host language" is a bad idea for an internal DSL. The user of the DSL expects to use the power of that host language and the extra functionality provided by the DSL. Replacing Clojure's cond can be very confusing!

Not everything is a macro, if we look at the cons funcion, there is a subtle difference between Scheme and Clojure;
The second parameter in clojure is a sequence. A function is more suited than a macro to translate this;
We put the second parameter in a vector unless it's already a collection. The trick here is the recursive nature of cons; (cons 1 (cons 2 3)).
This will call our new cons function twice, from the "inside out". The number 3 will be put in a vector, but the result of the nested cons will not.

Finally the read-eval-print-loop is very simple;
Since our DSL have name clashes with Clojure, we need to exclude those when defining the repl namespace. The REPL itself is a simple recursive loop that reads a line, evals it and prints the result. That's it!

Sunday, 20 November 2011

Tail Calls in F#, Clojure and Scala

I recently looked into Tail Call Optimisation/Elimination (TCO) and the implications for 3 modern languages, namely F#, Clojure and Scala. In this post I share my my findings. If you're new to the subject or just looking into some of these languages I hope this post can be of some use to you. I will mix code snippets in the 3 languages freely (and without warning! :)

TCO is a well documented topic in books and articles about functional programming and the TCO in .NET and the lack their of in the JVM has been debated "to death" on various programmer's boards. I don't indend to add any fuel to the fire here, rather some background and practical implications.

Recursion is a fundamental corner stone of programming and is particularly emphasised in functional programming. It is the idiomatic way to loop over sequences in languages like Clojure. Here's a classic example of a function calculating the sum of all natural number less or equal to n.
This implementation is both very easy to understand and correct, so what's the problem? Well, this implementation is not "tail recursive". A tail recursive function has the recursive call at it's tail and noting else (immediately returning the result of the call). In this case the result of the recursive function is used in an addition, and the result of the addition is returned. The practical implication of this is that during execution we are building up a chain of call-stacks, which cannot be freed until we reach n=0 and the results "bubble up".

When n is large this will lead to a "Stack Overflow" exception.
fsi> sum 1000000;;
Process is terminated due to StackOverflowException.
Every functional programmer have two handy tools in his toolbox to solve this problem; re-write using accumulators (folding) and continuation passing style.

If we look at the sum function's loop, and think how we would implement this using imperative programming, we would probably write a for loop like so;
The recursive variant of this will look like this;
By passing around the result we have a tail recursive solution.

Continuation passing style is a bit more tricky to understand, but basically instead of passing the result as parameter we are passing a calculation (or continuation). This continuation will be a chain of closures which can be completely resolved when we reach the end of the recursion. Here is an example;
If we call it with the id function as shown above, result of the "closure-chain" will be the result we are looking for. As you can see, this is also a tail recursive solution.

Clever compilers
We have now converted our simple example to a tail recursive version, so now it should run with very big n-s without any problem right? Well, not always. To understand why we need to dig into how our compiler and runtimes work (in this case .NET and JVM).

If we look at what the F# compiler produces for a accumulator-tail-recursive sum function above we'll see this (de-compiled into C# with ILSpy).
That's is great, the compiler has realised that the tail recursion can be converted to a while loop and removed any recursive calls. The Scala compiler does the same (de-compiled to java with Java Decompiler)
However, Clojure does not! The clojure compiler require an explicit form to convert "mundane" recursion into a non-recurisve loop (Scala also support explicit tail call checking with the @tailrec annotation). This is the loop/recur form;
Awesome, problem solved, what's all this fuss about TCO then?

Let's say we have two functions a and b calling each other recursively;
This is called mutual recursion, and are commonly used in functional implementations of finite state machines. Even though the a+b functions are tail recursive, we cannot convert to while loops but must proceed with the recursive calls. Quite naturally this will blow up with stack overflow quite quickly. Steele and Sussman realised in their famous lambda papers back in the 70's, that a tail-recursive functions' stack resources can be freed as soon as the call is made, there is not point of keeping that stack frame around.

For a tail recursive sum example, with TCO we don't build up a chain of stacks and can thus handle summations of any depth (value of n).

The "dealloc-on-call" functionality is something the runtime have to support, which .NET does. If we look at the byte code for the function "b" above (compiled with tailcall-optimization on) we see;
 IL_0000: nop
 IL_0001: newobj instance void Program/b@12::.ctor()
 IL_0006: ldarg.0
 IL_0007: ldc.i4.1
 IL_0008: add
 IL_0009: tail.
 IL_000b: call int32 Program::a(class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2, int32)
 IL_0010: ret
Please note the "tail." op code, that's the secret sauce! The F# compiler has found a tail call and inserted the "tail." op code. This tells the .NET runtime to destroy the caller's resources and proceed "in" the callee's stack. This allows the a/b example above to run indefinitely without any stack overflows.

So what about the JVM? The bad news is that there is no "tail." java byte code (even if experimental implementations do exist). Here is a what the Clojure compiler produced for the b function (the invokeinterface is the recursive call to a);
 0 getstatic #27 
 3 invokevirtual #54 
 6 checkcast #56 
 9 getstatic #31 
12 invokevirtual #54 
15 aload_1
16 aconst_null
17 astore_1
18 lconst_1
19 invokestatic #62 
22 invokeinterface #65  count 3
27 areturn
Clojure solves the mutual recursion problem with a "trampoline". The idea is that instead of a and b calling each other directly, they return a closure containing that call. The trampoline will then run those closures in it's own stack-frame eliminating the stack build up.
Similar examples exist for Scala, a trampoline is infact trivial to implement.

General TCO is always best, so F# and .NET has the upper hand here. However Clojure and Scala are still fit for use, even if you follow a "strict" functional paradigm with lots of recursion. You have to be more explicit in the JVM languages and be careful to remember trampolines in cases of mutual recursion (this is especially true for FSMs that change state slowly and can act as time bombs for your program). Being explicit about tail calls is not necessarily a bad thing, it shows the programmer have thought about his code, and highlighted the behaviour.

Update: I should point that using Continuation Passing Style in Clojure, although being tail recursive, still suffers from the TCO problem. You will get Stack Overflows when the "closure chain" gets too big.

Update2: A clever compiler can convert mutual recursion into while loops. The a/b example above can be transformed into something like this; However, I don't know of any compiler that does this!

Thursday, 10 November 2011

Applied Symbolic Execution with KLEE/LLVM

This is a follow up article to my previous post on symbolic execution. Here we look at KLEE and LLVM in more detail, and present a potential practical application for a symbolic executor. We also discuss some of the limitations and drawbacks with this approach.

Our changes for KLEE and LLVM can be found on github

One limitation of symbolic execution (and dynamic code analysis in general) is that the code under analysis needs to be buildable and linkable. Thus, its harder to analyse sub systems (or snippets) than with a lint tool. Another complication is that the symbolic executor’s virtual machine also needs to understand / model the system calls that the code uses. This makes the tool OS dependant, since you have to emulate all calls that "escapes" the executor. Cadar, Dunbar and Engler explains how this can be done for Linux analysing GNU coreutils in [1].


KLEE is built on the LLVM Compiler infrastructure . LLVM defines a language independent intermediate code representation called LLVM-IR. KLEE contains a LLVM-IR interpreter (executor) capable of executing any program in LLVM-IR format. Further, KLEE allows you to mark certain areas of memory as symbolic, altering the execution to cover previously untouched code. At a very high level, KLEE creates an internal state for each instance of execution that runs a unique path. The points at which new states are created (forked) are typically branches where the condition is symbolic.

At any given time, KLEE can calculate concrete values for the symbolic memory that forces the code down a particular path. So, this technology can be used to generate a set for test vectors to drive full test coverage of any given piece of code.

In its current implementation, KLEE also checks for erroneous memory references and division by zero defects.  Whenever a defect like this is found, KLEE will generate the concrete values that caused this defect. This means that in the set of defects KLEE finds, there will be no false positives.
As you can imagine, for a big program, a potential huge amount of states can be created, due to the CFG path explosion problem. This is one of the limitations of a symbolic executor.

KLEE makes heavy use of a component called STP [2], a purpose built constraint solver to evaluate the accumulated path constraints for the symbolic data. Solving the path constraints (and the Boolean SAT that they are converted to) is in fact a NP-complete problem, so the time it takes to find a solution in unbound. Fortunately the area of theorem provers and SAT solvers are under heavy research and progress is steady. STP is a combination of best in class SAT solver and many heuristics and optimizations tailored for the kind of constrains you see from executing code. It has been proven to be very effective.

Finally, while executing a program and building up a big set of states, KLEE has to determine which state to “schedule” next. Using an optimized searcher (KLEE terminology) is crucial to find the correct path and solving the problem you are interested in. KLEE provides a wide range of different searchers, optimized for different use cases like accomplishing maximum code coverage etc.

The first step in analysing code with KLEE is to generate a single LLVM-IR executable containing the code you want to test, and the libraries it depends on. At the time of writing, there are 2 compilers available of generating LLVM-IR; llvm-gcc  and clang . With llvm-gcc you can use all gcc front-ends (i.e. C, C++ etc) whereas clang offers support for C and objective-C (however, support for C++ is being worked on).
When you have compiled all your code (and supporting libraries) into LLVM-IR, you need to link them together into a single file (the command llvm-link can be used for this). KLEE can now execute and analyze the code in this binary.

However, in most cases this will only result in a normal execution of your code, since KLEE has to be told what parts of memory to treat as symbolic. It is also very likely that the code you are interested in is not reachable for the default main() function in this binary. More often than not, you will need to edit your code to remedy these issues. Fortunately, marking variables, arrays etc as symbolic is very easy, and only requires one added line to your code. Making the code of interest reachable is potentially harder. Basically you will need to call into the APIs/functions etc of the code of interest. If you have an existing test-suite, that’s a very good starting point, if not you need to write at least parts of it. It is also worthwhile to note that once a test suite exists, KLEE is a very powerful tool to be part of the regression testing, in order to automatically cover and look for certain defects in new code.

These manually aspects of setting up code for analysis is one of the drawbacks of symbolic executors like KLEE, limiting the level of automation you can apply. However, certain aspects of this manual work can be simplified [6, 7].

It is important that you compile all libraries into LLVM-IR alongside the code you want to test. The KLEE interpreter will resort to calling into “the environment” (i.e. the runtime / operating system) for all unresolved symbols in the LLVM-IR binary. This is fine for normal execution; however, KLEE cannot make calls outside the LLVM-IR binary with symbolic arguments. That will result in the termination of the execution state.

As a general rule, you only want to resort to calling outside the LLVM-IR environment at as a low level as possible (i.e. system calls). KLEE contains a model of the 40 most common linux systems calls, and can cope with calls to them since this model understand the semantics of the desired action well enough to generate the desired constraints. By creating your model at this level, you also limit the size of the model, and can keep the rest of the library and runtimes executed as normal by the KLEE interpreter.

How we altered KLEE and LLVM
We mainly focused on the CFG path explosion problem during our study. The basic problem is to decide (using a searcher) what execution state to pick in order to reach the part of the code you are interested in. Picking a state by random or doing depth first search is very likely to “get lost” in the combinatorial explosion of possible paths in a large piece of code.

Depending of what your objective of running KLEE is, several approaches can he taken. If the goal is to maximize the code coverage, picking an execution state that will most likely lead to new code being covered can be taken. KLEE comes with a searcher that picks a state that is “closest” to uncovered code.
During our study we used the assumption that we would already know (several) areas of interests in the code under test. Let’s say we have an “Oracle” pin-pointing out areas of interest in code. This Oracle could for instance be a Static Code Analysis tool like Coverity, CodeScanner, Lint or even manual code review.
In this case, KLEE can be used to verify the findings of these Oracles, removing false positives and generating test cases for the true defects. It can also help answer the question under which circumstances a particular part of the code is executed.

Given a set of areas of interest (in the form of file names and line numbers) and a LLVM-IR binary of the code, we created a LLVM analysis pass that does the following;
  1. Translates the file name, line number tuples to LLVM-IR basic blocks.
    Explanation; Given a number of potential problems given by an Oracale (in the form of filename and line number), the LLVM pass tries to map this to a specific basic block in the IR.
    The goal for the Klee searcher is now to "hit" that block and see whether it contains a defect.
  2. Generates at set of N unique paths between the entry point the basic block from 1.
    Explanation; The set of basic blocks from 1/ is not enough information for the searcher to find a path. It also needs hints on potential basic blocks leading towards it. We used a naive approach using graph theory to generate a large number of paths connecting the root block to the block of interest. Most of these paths will probably be impossible, so it is still a rough approximation.
Dependant of the complexity of the code, this operation can take a long time.

We wrote a new searcher to be used when executing the same LLVM-IR binary in KLEE. This searcher selects states by matching them to the set of pre-generated paths and terminates when the all basic blocks of interest have been covered.

Areas of improvement
A particularly hard problem for symbolic executors is how to reason with symbolic pointer dereferences. At its current implementation KLEE does an exhaustive search for each symbolic pointer deference. This implies checking if any solution of the pointer’s path constraint lies outside any allocated memory area. Though being correct, this is very expensive, and will lead to massive increase of states in a big program. There are suggested solutions to this problem; one that looks particularly promising is described in [4].

The annotation of the code to mark memory areas as symbolic (i.e. mechanically inserting the klee_make_symbolic) can be automated. See DART [6] for API analysis, or KleeNet [7] for ANTLR solution.

For supporting big real-world programs, more aggressive pruning of execution paths must be done. One very good way to do this is the record actual execution paths during program execution. This is done in both GodeFroid Dart [3] and Bitblaze. They both use a emulated environment for non-symbolic execution (to track actual execution paths) which can then be fed into the symbolic executor for further analysis. Bitblaze uses QEMU for this purpose, and is a nice practical hybrid  of VMs and Valgrind VEX-IR. Avalanche is a simpler solution relying solely on Valgrind.

  1. Cadar, Dunbar, Engler 2008
    KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs
  2. Cadar, Ganesh, Pawlowski, Dill, Engler 2006 EXE: Automatically Generating Inputs of Death
  3. GodeFroid, Nori, Rajamani, Tetali 2010
    Compositional May-Must Program Analysis: Unleashing The Power of Alternation
  4. Godefroid, Elkarablieh, Levin 2009
    Precise Pointer Reasoning for Dynamic Test Generation
  5. Godefroid, Levin, Molnar 2008
    Automated Whitebox Fuzz Testing
  6. Godefroid, Klarlund, Sen 2005
    DART: Directed Automated Random Testing
  7. Sasnauskas, Link, Alizai, Wehrle 2008
    Bug Hunting in Sensor Network Applications

Tuesday, 8 November 2011

Is LLVM the beginning of the end for GNU (as we know it)?

GNU and Richard Stallman was a real catalyst for the open source movement and it's crown jewel; the Linux kernel. Not only did Mr Torvalds early Linux releases have near 100% GNU "user-land", he also decided to use release it under the GNU Public License; GPL. GNU and Stallman is forever linked with the birth and popularization of open source, and innovated both technically and legally by turning copyright laws on it's head with the copy-left licenses. The Free Software Foundation, the custodians of the GPL, is a constant source of spicy statements about the state of the software industry.

As the years have progressed, the "GNU percentage" of the popular Linux distributions has dwindled, quite naturally if you ask me. GNU focuses on core *ix command line and developer tools. The most important project is the GNU Compiler Collection (or GCC), used by pretty much everybody in the Linux community (and other platforms) for compiling C/C++ code. Since GCC (with binutils) can generate code for pretty much any CPU, the term "Linux community" in this context includes smartphones (Android, Linaro, Meego etc), and countless other embedded systems.

When Apple transformed NeXT (mach micro-kernel and BSD user-land) into OSX, the whole tools suite was based on GCC and it's debugger GDB (targetting PowerPC). They made significant contributions to GDB to make it sing in the XCode IDE.

Even though there is a plethora of GNU projects, it's GCC (and perhaps Emacs) that kept GNU relevant moving into the 21st century.

However, there's a new kid on the open source compiler block, and it's called LLVM and Clang. LLVM is real breath of fresh air in the compiler space. It features a beautiful modular architecture making it ideal for research and innovation. It already outperforms GCC in both compiler speed, and more importantly, quality of the generated code. The GCC code base has always been known to be almost impossible to work on, and LLVM has really called that bluff. LLVM/Clang is moving ahead in lightning speed, already supporting C/C++/Obj-C, features a debugger and having it's own c++ std library.

Apple is the main benefactor of the LLVM project, and they have already replaced GCC as the default compiler in XCode4. GDB to being swapped over pretty soon as well. With Apple "gone", how long will it take for the Linux community to switch over? The technical inertia should be small, since the Clang boys has gone for drop-in GCC replacement capability. If the technical benefits are there, I think the switch should be pretty rapid.

GNU without GCC is less relevant, the "GNU percentage" will drop even further. I believe we are witnessing the beginning of the end for GNU, at least in the form we know it. The free software foundation still has a role to play, as opinion makers, lobbyists and evangelists of the GPL.

Saturday, 5 November 2011

Javascript and the future of code in browsers

Here's an observation; Javascript (the lingua franca for code in browsers) is transforming from a programming language to a specification for code generators, an IL if you will.

Popular tools like CoffeScript, and more experimental ones like ClojureScript and Google Dart are just some examples. That's great right? Well, yes kind of. Developers gets better tools and can be more productive / write more correct programs. However, this can be seen as another form of obfuscation because the output is anything but readable. The output is certainly not small, leading to slower websites. The famous Dart "hello world" example being some 17000 lines of javascript.

Javascript is not well suited as an IL. We can do much better. What if the had a proper managed runtime in common for all browsers? Think JVM or CLR. Something actually designed to be an IL, and not a crippled language taken by surprise once again. This would mean that, with the right compiler, you could use any language you wanted for your web development. Heck, you could use any combination of languages you wanted. I really like the sound of that!

One obvious objection to this idea is that we don't want binary blobs in our HTML pages, it should be readable. But how readable is obfuscated Javascript anyway, especially if it has been code generated? The statement that IL isn't readable doesn't really hold true either, check out what you can do with reflector or java decompilers. That stuff is pretty readable if you ask me;

So how would the HTML with this code look like? Well, the obvious way is to put the link in there;
Or we could use the byte codes directly;

Another example is to put the blob in there;
With good APIs, we could merge stuff like WebGL into this thing. And given that it's a proper IL, we could make the runtime run really fast, using JIT compilation etc. This would eliminate the need for "inventions" like Google NaCl, which is a bad fit for the web if you ask me. Also, any Java or Silverlight plugins are not needed anymore.

I can't shake the feeling that Google missed a great opportunity when they announced Dart. They were thinking to small. Dart should have been an proper IL instead of a new language and a new language specific runtime.

Thursday, 3 November 2011

Why F# needs Mono (and really should be a JVM language)

When people think about .NET development, they think of C#. Sure there are other languages (VB, ASP.NET etc) but .NET and C# are very tightly linked (just drop an .NET assembly in reflector for technical proof). If you're writing a new Windows application (and it's not a high performant game), chances are you are reading WPF books right now.

One of the promises of .NET when it was released was "the great language independent" runtime, making all these languages interoperate in joyful blizz. Technically this still holds, but in practice it's all about C#.

F# is still considered a new .NET language, but the fact is that it's been around since 2007. Let me tell you, it's an absolute gem of a language, and it's a fully supported in Visual Studio 2010. F# is however not .NET the same way C# is. Drawing from it's OCaml roots, it doesn't feels like typical Microsoft technology (it's just too darn good :) It seems more at home in the open source world (the F# compiler is in fact open source).

In the recent Microsoft 2011 build conference, where Windows8, WinRT, future of Silverlight, new Visual Studio etc was announced. F# wasn't mentioned in the headlining presentations. How much mind-share does F# have internally at Microsoft?

The approach taken by the F# team is to educate the C# crowd, and win them over on technical excellence. It's a valid strategy, but it's going to be a very long process. Force feeding a new language and a new programming paradigm to the OO crowd is a very hard sell. What needs to happen in the .NET community for F# to gain ground is pull rather than push. A couple of high profile Microsoft apps showing off F# to the world (just see what Rails did for Ruby) and get the C#-ers' attention.

.NET being Microsoft technology, it's Windows only. Microsoft have opened the runtime standard, but it's a mine field of patents, and Microsoft has an army of lawyers. If you're a non-windows user your only hope is Mono, a open-source .NET runtime and C# compiler with a limp. In my experience, it's not ready for running production code. Some of Mono's core technologies, like it's garbage collector is not up to scratch. I can still easily get the runtime to crash using the experimental "generational gc". Some really core .NET libraries is missing, like WPF, and it doesn't support tail call elimination.

F#'s creator Dom Syme is flirting heavily with Mono, and he understands the importance of getting the universities and hobbyists on board. There is no way they are going to fork out for a Windows and Visual Studio license. F# is included in the standard Mono release, but it's still missing from the MonoDevelop IDE. Tomas Petricek has made a F# plugin for MonoDevelop, but not yet officially included in MonoDevelop, and doesn't work for version 2.8. A big part of the potential F# community is in the open source world, so F# needs Mono. Too bad Mono isn't good enough.

F# is perceived more of a "server room" language, leaving the UI code for C#. Technically, this is false, but that is a bias that will be near impossible to break. Server rooms are the domain of *UX, thus once again F# needs Mono.

So, F# have a tough battle of survival against C# on Windows and a weak story in the open source world. F# sorely needs a good open runtime, it deserves it. Mono's main contributing company Xamarin seems to have it's focus elsewhere (iPhone apps written in C#) instead of fixing the basics. It's worth noting that there are other promising projects on the horizon however, vmkit might come along and save the day? Unfortunately it's going to be a long wait.

I've thought many times how awesome F# would be if it targeted the JVM. Scala would be completely redundant and F# would have a much bigger and more eager user base. There is a F#-shaped hole in the JVM language space, and given it's open source heritage, I am sure it would be big hit. The server-room developers are now starting to use Scala and Clojure, just imagine what they could do with active patterns and asynchronous workflows! The JVM world needs solid ML language, there have been some valiant attempts, but none of them have taken off.

Mr Syme; just think how much easier this would all be if you were in the JVM camp? :-)

Ps. Here's a workshop presentation by Richard Minerich showing F# in Mono and MonoDevelop.

Update: After talking to some people in the F# / Mono community, my lacklustre attitude of Mono has slightly changed. The TCO problems is fixed, which I can verify. And even if we all agree that the Mono + F# tooling situation is bad, the runtime seems to in a better shape than I feared, and are indeed used in production. I stand by my absolute conviction that F#'s fate is tightly linked to Mono's, and thus deserves the full attention from the F# community.

Wednesday, 2 November 2011

Scheming in F#

Given the fact that I worship at the SICP altar, it should come as no surprise that I follow the recipe outlined in chapter 4 of said book; implementing a Scheme interpreter in every language I am trying to learn. Over the years it has turned out to be a very useful exercise, since the problem is just "big enough" for to force me to drill into what the language have to offer.

I'll post the source of the interpreters on github in this and future posts and highlight some of my findings in more detail in the posts. I am not going to write and explain too much about the languages themselves, there are plenty of books and tutorials for that purpose, just highlights :)

F# is part of the ML family and largely compatible with OCaml. It's one of the new hybrid functional / OO languages (like Scala, Clojure etc) that the kids are raving about these days. This means it can expose and interact with .NET libraries and objets code seamlessly. It also have a whole host of other functionality like active patterns, asynchronous workflows and (soon) type providers that I will get back to in future posts.

Let's start with discriminated unions which is a very powerful way of concisely describing (in this case) the syntax of the language;
This is means what it reads, "a scheme expression is either empty or a list of expressions or a list of listtypes or ...". It really can't be any clearer than that now can it?
ML's pattern matching is just fantastic when writing parsers, well any code really. This is how the evaluator of Expressions look like;
This is the famous recursive eval/appy loop as described in SICP. Of all the languages I've written this in, nothing is as concise and readable as ML. The main match statement is basically a switch, but there are a few subtleties here. For instance the Combination matches have a nested pattern separating the empty ([]) case and list (head::tail) case.
The interactive Read-Eval-Print-Loop (REPL) also deserves a shout out, this is using the F# pipe operator that passes the result of a function to the (last) parameter of another. So that try/catch is basically saying;
  1. read a line from the console
  2. convert it to a list
  3. parse that list
  4. take the head (first element) of the list (this is the expression)
  5. evaluate it
  6. call itself recursively
Please note that the new environment map is passed as parameter in the loop, meaning it can be immutable!
All the code can be found here, there are few bugs (see failing tests) that I might fix later, or maybe you are up for it! :)

Symbolic Execution

A while back I worked with a colleague (Philippe Gabriel) on a research project looking into automating defect finding and improving over-all test coverage. The main defect of concern at that particular time was null pointer dereferences, which could cause system wide crashes. We looked at many different strategies and tools (both free and commercial). What really spiked my interest was a field of research called "Symbolic Execution". Here's the elevator pitch; what if you had a tool that automatically found "nasty bugs" by analysing your source code, with very little or no false positives, and produced the input stimuli to provoke that bug?

Dawson Engler at Stanford along with Patrice Godefroid at Microsoft Research have made great contributions to this field of research. Their detailed papers are a thrilling read.

This article contains some background information about what symbolic execution is, and in future posts I'll explain how this can be applied in practice (using frameworks such as LLVM and klee).

Static Code Analysis has made a lot of progress in recent years, and can now be applied to large real-world code bases and generate valid results. One of the key metrics is the ratio between actual defects to the false positives. Optimizing this ratio is a very hard problem to solve in includes a lot of heuristics, optimization and tweaking. One of the biggest benefits of Static Code Analysis is that it is done by analyzing the source code itself. There is no requirement to build, link and run the code in question, analyzing “snippets” of code perfectly valid.

Unlike a Static Code Analysis tools, a symbolic executor executes the code under test in a “sandbox” or virtual machine. The main difference with normal execution is that some input are treated as symbolic (allowed to be “anything”), and for the code dependent on symbolic input the symbolic executor builds up boolean expressions (path constraints) used to force the code down previously unexplored paths.

Symbolic Execution
With Symbolic Execution, the program is "executed" in an abstracted manner. Let’s describe this process, with the following code fragment:
Say we want to insure that func always returns a valid (non-null) pointer.

Control Flow Graph
The first step in symbolic execution is to generate a Control Flow Graph or CFG. A CFG is an abstracted representation of the code in the form of a directed graph. Each node is a "basic block" terminated in a conditional (here an if statement). Each edge is a boolean "truth value" for the condition.

Once the code is expressed in this way it is easier to see the paths of execution. Note that CFGs are not specific to Symbolic Execution; any compiler generates a CFG when building the code, as this representation lends itself well to optimization techniques.

Analysing the CFG with symbolic values
With Symbolic execution we are interested in finding feasible paths which result in "bad things" happening. In this case we are interested in finding path(s) that lead to returning p without being allocated. We are interested to understand how the value of p changes, so taking the same CFG we state the value of p at each node. Or more precisely, we state the symbolic value of p, we do not actually care about which specific memory location p points to. Instead we only care whether p is null or non-null.

Note that we also draw implicit code path, as they lead to different symbolic value of p. Now with the CFG rewritten like this, it is trivial to see that there is indeed a path that leads to returning p without prior initialization. This simple example illustrates in practice 2 basic techniques;
  • Transformation to CFG
  • Reasoning on symbolic values

Scalability and Accuracy
Anyone designing a "symbolic tool" is faced with many hard problems. At a very high level, the problems can be put in one of these 2 categories;
  • Scalability / speed of analysis, because any non trivial function has a large number of paths.
    Ideally we want the tool to finish before the end of the universe.
  • Accuracy, because not all these paths are actually feasible and any defect reported along an infeasible path is a false positive
If you look a much simpler "lint" tool, it's usability is greatly reduced by the cheer amount of "garbage" it produces. An ideal tool will only report "real" errors, and as you will see further on, test vectors that triggers the error.

Scalability and Inter-Procedural analysis
The scalability problem is compounded by the fact that for any non trivial analysis, we cannot limit the analysis to the paths contained within a single function. For instance, in the previous example, it doesn't matter that func returns a null pointer if any code that calls func actually checks for this. For a meaningful analysis we need to follow the path until the pointer is actually used. In order to check that, we need to analyse paths going in and out of the function (inter-procedural analysis). The upside of doing inter-procedural analysis is that it increases the accuracy. The downside is that it exponentially increases the number of paths to analyse. Taking into account inter-procedural analysis, a real life a CFG is more likely to look like:

A symbolic execution engine that would naively enumerate paths exhaustively through this CFG would never complete. Knowing how to prune the CFG and identifying the small number of "interesting" paths is a very lively domain of academic research and a large bibliography exists on the topic. Heuristics to limit the path explosion problem come from either graph theory or are based on inferring more information from the context of execution.

An added complication that comes from having to analyze long path for actually finding anything interesting is that the "Path Conditions" along the paths are non satisfiable. Let's examine this fact in greater detail, as this leads us to introducing another big idea that underlines symbolic execution.

Path Condition
Let's revisit our previous example and explain what path conditions are. We now rewrite the CFG with the truth value for all the conditions that we encountered along given paths. The path condition is the boolean equation that synthesises the truth values for conditional expressions we encountered at each node.

Here for the path of interest (leading to a null p), this expression is: x*y==0 && y!=0 Hence we can now associate a set of Boolean equations with each edges along a given path. In the example above, we can now state, x*y==0 && y!=0 implies that p==0

Inter-Procedural analysis and Path Condition
Now that we know what path conditions are, let's see how we use these to solve the accuracy problem. We need to find out whether the path that is feasible in the narrow context of this function is actually valid in the larger context of a program, (i.e. when you perform interprocedural analysis). For instance, we might find that on all the paths containing call to this function, prior to the call there is a ASSERT statement that checks: x!=0;
The path condition has become: x*y!=0 && y!=0 && x!=0
It is clear to see that this Boolean equation cannot be solved, or according to the lingo there is no assignment of Boolean terms that satisfy this equation. This is otherwise known as the Boolean Satisfiability Problem.

Boolean SAT applied to Symbolic Execution
Now, there's good news and bad news...
The bad news is that this is an NP Complete problem, or in other words a fiendishly complex problem to solve, with not upper limit to it's running complexity. This is not really apparent in this simple example, but any non-trivial path will be summarized by a Path Condition with 100s of terms, which means that simply iterating through possible assignment for Boolean variables would never complete.
The good news is that there are solutions for it. Most notably, solving Boolean equation with large number of terms is something that the VLSI industry has been doing for years and have developed fairly efficient software along the way, aka SAT solvers. In fact there is a surprisingly large number of people constantly refining algorithms for their SAT solvers to solve ever larger boolean equation ever faster who race them every year.
As a result all modern symbolic execution engines can now incorporate them. A boolean SAT solver would tell you in no time that x*y!=0 && y!=0 && x!=0 is indeed unsatisfiable and that there is no point reporting such a defect which is clearly a false positive.