Lecture 6: Lambda Calculus

For the last few lectures, we've been studing the semantics of IMP, a fairly simple imperative language we created because it roughly emulated what we think a "normal" language looks like—it has arithmetic, conditionals and loops baked in. We claimed (but did not prove) that IMP is Turing complete, which gave us some more confidence that it was a "real" language. But it wasn't the most convenient language to study; it led to relatively complex semantics like two small- and big- step rules for if and while or the confusing "backwards" assignment rule for Hoare logic.

Was all of this complexity really necessary? As "PL people" we're often in search of minimalism, and so we should ask if there is a simpler way to get "real" computation. One potential answer is the lambda calculus, which comes to us from Alonzo Church in the 1920s, via John McCarthy in the 1950s and Peter Landin in the 1960s (McCarthy is most well known for Lisp, and Landin for his paper The next 700 programming languages). As the name suggests, it's a core calculus for computation. It's dead simple—just three syntactic forms and three small-step rules or two big-step rules. And yet despite this simplicity, the lambda calculus is Turing complete!

In this lecture we'll study the untyped (or "pure") lambda calculus. We have two motivations here. First, the lambda calculus distills the essence of programming, and in particular of functional programming, which we've seen a lot of this semester through Coq and Racket. Second, the untyped lambda calculus prepares us (as the name suggests) to study typed lambda calculi, which will be our vehicle for studying type systems for the remainder of this semester.

Untyped lambda calculus

Just like any programming language we define in this class, we need to define two things about the lambda calculus: its syntax and its semantics.

The syntax is simple enough: a term t in the lambda calculus is one of just three forms:

t := x
   | λx. t
   | t t

Here, x is a variable, which for our purposes we can just consider to be a string (like the var constructor we've had in our expression languages before). λx. t is an anonymous function that takes as input one argument and, when called, replaces x in t with that argument. We sometimes call these anonymous functions (lambda) abstractions. (Aside: now you know why languages like Python, Java, and Racket call anonymous functions "lambdas", lest you thought this class was too abstract). Abstractions always take exactly one argument, though you can easily build 0-argument or 2+-argument functions from there. Finally, t t is a function call, where the first term is an abstraction and the second is the argument. We'll mostly call these applications instead of calls.

Small-step semantics and subtleties

The syntax of the untyped lambda calculus is simple, and the description of it is almost enough to just write down the small-step semantics directly. We'll define a small-step semantics of the shape $t \to t^{'}$ , where $t$ and $t^{'}$ are lambda calculus terms. We don't need valuations in the lambda calculus semantics like we did for IMP; we aren't going to assign values to the variables in our terms, instead using them more as uninterpreted "symbols". As always, let's go one constructor at a time, but—spoiler alert—only one case is interesting.

The variable constructor x can't step at all. Again, think of variables as really just being symbols, rather than having a value assigned to them.

The abstraction constructor λx. t also can't step. We just say it's a function or abstraction.

We say that these two constructors that can't step are values; the final result of evaluating a term in the lambda calculus. These are the terms for which evaluation is "done". Thinking back to the small-step semantics for IMP, the only value was skip, as all other terms could take a step.

The only way to step a lambda calculus term is via application. Let's think about how we want this to work. Given a term t1 t2, if t1 is an abstraction λx. t, we step it by "doing the function call". Roughly, we want the call to return t but after replacing all occurrences of x in t with t2. You might remember from Lecture 5 that we called this idea "substitution" and had a syntax $t [t_{2} / x]$ for it. We can use that to just write down the rule directly: $\frac{}{(λ x . t) t_{2} \to t [t_{2} / x]} β -reduction$ This rule, which does the substitution we just discussed, is called β-reduction. β-reduction is the workhorse of the lambda calculus, and really of all functional languages (calling function application "beta reduction" is a shibboleth for PL nerds). It also gives rise to a notion of equivalence between lambda-calculus terms called β-equivalence: we say that terms $(λ x . t) t_{2}$ and $t [t_{2} / x]$ are β-equivalent, because the latter is the result of one application of β-reduction on the former.

Unfortunately, this β-reduction rule is deceptively complex. It's hiding two important subtleties that we'll need to think about in order to finish defining our semantics:

The rule is non-deterministic—does the order in which we apply it to terms matter?
Substitution is tricky in the presence of abstractions, and our notion from previous lectures doesn't quite work.

Let's tackle each of these in turn.

Evaluation strategies

The β-reduction rule is non-deterministic. To see why, let's look at a term: $(λ x . x) ((λ y . y) (λ z . z))$

Here are two different ways we could reduce this term by repeated application of the β-reduction rule. In each step I've underlined the term to which we're about to apply the rule: $\begin{aligned} (λ x . x) (\underset{―}{(λ y . y) (λ z . z)}) \\ \to \underset{―}{(λ x . x) (λ z . z)} \\ \to λ z . z \end{aligned}$

$\begin{aligned} \underset{―}{(λ x . x) ((λ y . y) (λ z . z))} \\ \to \underset{―}{(λ y . y) (λ z . z)} \\ \to λ z . z \end{aligned}$

These two strategies are different—we call them call by value and call by name, respectively. Call by value only applies the β-reduction rule once the right-hand side has reduced to a value (in other words, once it can not step any further). This means that we have to reduce the inner terms before we can reduce the outermost application. In contrast, call by name applies β-reduction as soon as possible, from the outside in. For this example, that means reducing the outermost applications first.

In both strategies, we stop reducing as soon as the outermost term is a value (an abstraction or variable). We do not allow reductions inside top-level abstractions, so for example, a term $λ x . ((λ y . y) x)$ does not step. This is the same as how our small-step semantics rules have always worked: we applied the rule to the entire term, not to sub-terms inside it. (This means we technically need a couple more rules for call-by-value; we'll see those in a moment).

Notice that both cases result in the same final term, even though the evaluation order was different. This isn't a coincidence. The untyped lambda calculus enjoys a property called confluence (sometimes also called the Church–Rosser theorem), which says that the order of applications of β-reduction does not affect the final result. More formally, if $t \to^{*} t_{a}$ and $t \to^{*} t_{b}$ , then there exists a $t^{'}$ such that $t_{a} \to^{*} t^{'}$ and $t_{b} \to^{*} t^{'}$ .

Most programming languages use a call-by-value strategy for function application (for many reasons, one being that it's generally simpler to implement). But there are a few exceptions. Haskell uses a variant of call-by-name, which is what gives Haskell its laziness property: arguments to functions are only evaluated if they are actually needed during function evaluation, rather than being eagerly evaluated.

We'll stick to call-by-value for our lambda calculus, though, which means that our actual small-step rules will look like this: $\frac{}{(λ x . t) v \to t [v / x]} β -reduction-CBV$ $\frac{t_{1} \to t_{1}^{'}}{t_{1} t_{2} \to t_{1}^{'} t_{2}} AppL$ $\frac{t_{2} \to t_{2}^{'}}{v t_{2} \to v t_{2}^{'}} AppR$ Here, we write $v$ for a value (an abstraction) that cannot be reduced any further. Restricting the β-reduction rule to only apply when the right-hand side is a value is what gives us the call-by-value semantics; the other rules just let us reduce the terms in an application to values to enable β-reduction. The other two rules allow us to "crunch down" terms until they fit the form of the call-by-value β-reduction rule. The $AppR$ rule also uses a value $v$ in its conclusion, which forces deterministic ordering in our evaluation strategy: we need to keep applying $AppL$ until the left-hand side is a value (hopefully an abstraction!), and then we can keep applying $AppR$ until the right-hand side is a value, and then finally we can apply $β -reduction-CBV$ .

Substitution and scope

The second problem with our semantics is the substitution part of the β-reduction rule. Informally, we said that $t [v / x]$ means " $t$ but with all occurrences of $x$ replaced by $v$ ". That was mostly fine when we studied Hoare logic, but it's problematic here. Consider reducing this term: $(λ a . (λ a . a)) (λ b . b)$ According to the β-reduction rule, and applying our notion of substitution, this should reduce to: $(λ a . a) [(λ b . b) / a] = λ a . (λ b . b)$ But this is not quite right, because we've confused two different $x$ s here, from the outer and inner abstractions. Watch what happens if we use a different variable name for the outer abstraction: $(λ c . (λ a . a)) (λ b . b) \to (λ a . a) [(λ b . b) / c] = λ a . a$ Now we have two different results depending upon our choice of variable names—the first version takes any input and always returns an abstraction $λ b . b$ , while the second is the identity function. That's not great.

What we are tripping over here a notion of the scope of a variable. You've probably thought about scoping in programming before—we often talk about which variables are "in scope". To be more formal, we say that a variable $x$ is bound when it occurs in the body $t$ of the term $λ x . t$ , and we sometimes call $λ x$ a binder whose scope is the term $t$ . Conversely, we say that a variable $x$ is free if it appears in a position where is is not bound by any enclosing abstractions.

Now we have the words for what went wrong in our substitution: in the body of the inner abstraction, $a$ is bound, and so it's not correct to replace it during substitution on the outer abstraction. In other words, the substitution $t [v / x]$ should only replace all free occurrences of $x$ in the body $t$ . Formally, we can define it like this, with a case for each term constructor: $\begin{aligned} x [t / y] & = {\begin{cases} t & if x = y, \\ x & otherwise \end{cases} \\ (λ x . t^{'}) [t / y] & = {\begin{cases} λ x . t^{'} & if x = y, \\ λ x . t^{'} [t / y] & otherwise \end{cases} \\ (t_{1} t_{2}) [t / y] & = t_{1} [t / y] t_{2} [t / y] \end{aligned}$ Now, when we go to reduce the example above, we get the right result, because the first case of the abstraction rule prevents us looking inside the body of the inner abstraction.

Capture-avoiding substitution

This new definition almost works, and it will be the definition we'll use for the rest of this class. However, there's still one problem with it. Consider this term: $(λ a . (λ b . a)) b$ To reduce this term, we do our new version of substitution: $(λ b . a) [b / a] = λ b . b$ But again, watch what happens if we change a binder, in this case the inner one: $(λ a . (λ c . a)) b \to (λ c . a) [b / a] = λ c . b$ These two results aren't the same! Here, our new substitution rule didn't quite save us, because the problem was not that $a$ collided with a bound variable but instead that $b$ was free in the original term. In the first substitution, we say that we have unintentionally captured the variable $b$ , because it's now bound to the inner $λ b$ binder rather than referring to the original free variable.

What we need here is a notion of capture-avoiding substitution. In essence, we want to be able to somehow rename binders (in this case, $λ b$ ) to avoid captures. Renaming binders is OK because the lambda calculus enjoys another Greek-letter-named property called α-equivalence. This is just the fancy PL way of saying that $λ x . x$ and $λ y . y$ are equivalent terms; in other words, bound variable names do not matter in the lambda calculus.

Capture-avoiding substitution is easy to write down on paper as a slight tweak to our latest substitution definition. The idea is that in the case $(λ x . t^{'}) [t / y]$ , we should rename the outer binder $λ x$ to a new one $λ z$ such that $z$ does not appear free in $t$ . In our example, $t$ was just the term $b$ , in which $b$ obviously appears free, and so as long as we renamed the $λ b$ binder to something else (like we did with $c$ ), things would have worked out. While this idea is easy to explain, capture avoidance is surprisingly difficult to mechanize in a proof assistant like Coq. The issue is roughly that now every substitution step needs to evaluate a recursive function (to compute the free variables) on the term $t$ , which makes proofs very difficult (see this 2008 paper for a more technical review). Getting this encoding right is still an active research area.

Instead of dealing with that mess, in this class we'll work around the problem a different way. Variable capture is only an issue when reducing what we call open terms, which are terms that contain free variables; a term that contains no free variables is a closed term. In the example above, $b$ was a free variable in the term $(λ a . λ b . a) b$ , and so the term was open. We will restrict ourselves to considering only closed terms, which avoids the issue entirely. Closed terms are still sufficient to see everything we want to see about the lambda calculus, and they'll make it much easier to formalize.

Restricting to closed terms has another advantage: it simplifies our notion of "values" to be only abstractions, rather than abstractions and variables.

Programming in the untyped lambda calculus

The lambda calculus is Turing complete, a fact originally proved by Turing himself in the 1930s. That's pretty surprising—all we have are functions and function applications! Rather than try to prove this ourselves, which would involve the somewhat tedious act of building a Turing machine in the lambda calculus, let's handwave it a bit by adding some "real" programming language features to the lambda calculus. We'll add three things: booleans and conditionals; natural numbers; and recursion. Together these should get us close enough to (at least informally) declare the lambda calculus a real language.

Before we start, I want to point to a super useful tool called λab for interactively playing with terms in the untyped lambda calculus. It's way nicer than manipulating these terms on paper, which quickly gets out of hand.

Booleans and conditionals

First, we'll define values true and false as terms:

true  := λt. λf. t
false := λt. λf. f

In other words, true is effectively a two-argument function that returns its first argument, and false a two-argument function that returns its second argument. Here, we're just giving names to otherwise-anonymous lambda terms. We sometimes call terms like true and false combinators; technically, any closed abstraction term is a combinator, but named terms like these are "building blocks" for larger terms and that's usually what we mean by "combinator".

The obvious question at this point is: why these definitions? To start to see it, let's implement some boolean logic:

not := λb. b false true

Let's see this definition in action; again, I'm underlining the terms we reduce at each step under the call-by-value strategy: $\begin{aligned} not true & = \underset{―}{(λ b . b false true) true} \\ \to true false true \\ = true \underset{―}{(λ t . λ f . f) (λ t . λ f . t)} \\ \to true (λ f . f) \\ = \underset{―}{(λ t . λ f . t) (λ f . f)} \\ \to (λ f . λ f . f) \\ = false & (by α -equivalence) \end{aligned}$ Notice that in the last step we used α-equivalence to justify saying that the term $λ f . λ f . f$ is equivalent to false—the $f$ here is bound by the inner binder, and so we're free to rename the outer binder however we like. We can come up with similar definitions for and and or.

Now let's try our hand at conditionals:

if := λc. λt. λe. c t e

Here, we kind of see how the pieces will fit together: c will be a boolean, a two-argument function that returns its first argument if true or its second if false. Then we pass in the then and else terms to c. If it's true it returns the then term, otherwise it returns the else term. For example, where $v$ and $w$ are values: $\begin{aligned} if true v w & = \underset{―}{(λ c . λ t . λ e . c t e) true} v w \\ \to \underset{―}{(λ t . λ e . true t e)} v w \\ \to \underset{―}{(λ e . true v e) w} \\ \to true v w \\ = \underset{―}{(λ t . λ f . t) v} w \\ \to \underset{―}{(λ f . v) w} \\ \to v \end{aligned}$

This definition is almost right, but it does have one problem: it's not short-circuiting, and under the call-by-value strategy, it will evaluate both sides of the if expression before returning the correct one (see this example on λab). That isn't going to matter much for our purposes, but it's worth pondering a little. Under call-by-name semantics, that wouldn't happen (see this example). None of this is a violation of the confluence principle we discussed earlier: both evaluation strategies end up in the same final result, they just take different paths to get there. The problem is that in most languages the path we take actually matters, perhaps because of side effects of evaluating a term, and so we really want the short-circuiting effect. But lambda calculus doesn't have side effects! In other words, most other languages don't enjoy confluence, because the evaluation strategy matters for its side effects.

Natural numbers

We can also add the natural numbers to the lambda calculus through the Church encoding. We've actually already seen something very similar to this when we defined the natural numbers as an inductive set in Lecture 1. We define the natural numbers like this:

0 := λf. λx. x
1 := λf. λx. f x
2 := λf. λx. f (f x)
3 := λf. λx. f (f (f x))
...

One way to explain this definition is to say that a number n is an abstraction that takes as input two arguments, a function f and a base term x, and applies f to x n times. In some sense, a number is a term that applies a function that many times to some base term.

A more useful way to see this is to go back to how we defined the natural numbers earlier, by definition a successor function:

succ := λn. λf. λx. f (n f x)

Let's see that in practice: $\begin{aligned} succ 1 & = \underset{―}{(λ n . λ f . λ x . f (n f x)) (λ f . λ x . f x)} \\ \to λ f . λ x . f ((λ f . λ x . f x) f x) \\ = λ f . λ x . f (one f x) \end{aligned}$ This term says "apply the function f to x one times, and then apply f one more time", which is equivalent to two. But if we're being precise, we're actually stuck in the reduction here, because call-by-value stops reducing once the head term is an abstraction. This fine—once we go to actually apply the result succ 1 to something, we'll be able to continue, and get the same result as if we applied 2. If we go beyond call-by-value and allow ourselves to reduce inside the abstraction, we can see where we end up without needing more arguments: $\begin{aligned} λ f . λ x . f (one f x) \\ = λ f . λ x . f (\underset{―}{(λ f . λ x . f x) f} x) \\ \to λ f . λ x . f \underset{―}{((λ x . f x) x)} \\ \to λ f . λ x . f (f x) \\ = 2 \end{aligned}$

Finally, addition is just applying succ multiple times:

add := λm. λn. n succ m

We can also define all the other usual arithmetic stuff, like mul and pred and isZero.

This is a very tedious encoding, but it is a correct encoding. In fact, there's a proof in Section 11.2 of Formal Reasoning About Programs that this encoding is equivalent to the inductive version we've been using in Coq.

Recursion and loops

Finally, we'd expect a Turing-complete language to have some notion of (potentially infinite) recursion. Recursive functions are a bit tricky in the lambda calculus—we saw in both Racket and Coq that recursive functions needed to be able to refer to their own name, but there is no notion of names inside the lambda calculus. Instead, to define recursion we use combinators (actually, the "macros" or "names" we've been using so far are just combinators too).

First, can we write a program in the lambda calculus that loops forever? Here's the omega combinator:

omega := (λx. x x) (λx. x x)

Watch what happens when we β-reduce it once: $\begin{aligned} \underset{―}{(λ x . x x) (λ x . x x)} \\ \to (λ x . x x) (λ x . x x) \end{aligned}$ We just end up back in the same place! This should give us a little more confidence about how expressive the lambda calculus is: it allows non-terminating programs.

The omega combinator's not very useful, though. What we really want is a way to write a recursive abstraction that refers to itself. We get that by means of a fixed-point combinator. There are multiple ways to write such a combinator, but one popular one is called the (call-by-value) Y-combinator. (Aside: now you know why that orange website is called what it is). The version for call-by-name evaluation is simpler, but the ugly call-by-value version looks like this:

Y := λF. (λf. (λx. f (λv. x x v)) (λx. f (λv. x x v))) F

This is intricate and not really possible to understand just by looking at its definition. Let's attack it with a couple of examples. First, why is Y called a "fixed-point" combinator? Because of the way it evaluates: $\begin{aligned} Y F & = \underset{―}{(λ f . (λ x . f (λ v . x x v)) (λ x . f (λ v . x x v))) F} \\ \to \underset{―}{(λ x . F (λ v . x x v)) (λ x . F (λ v . x x v))} \\ \to F (λ v . \underset{―}{(λ x . F (λ v . x x v)) (λ x . F (λ v . x x v))} v) \\ \approx F (λ v . (Y F) v) \\ \approx F (Y F) \end{aligned}$ Here I'm handwaving the last two steps a bit, because we're not really allowed to simplify like this: first we notice that the underlined term is the same one we already stepped Y F to on the second line, and second we're handwaving how function application works since we don't yet have a v to apply onto. The point is that this is computing a fixed point of F.

Now, how do we use a fixed-point combinator like Y? The idea is that we can now write a function like this one, for the factorial function:

g := λfact'. λn. if (isZero n) 1 (mul n (fact' (pred n)))
factorial := λn. Y g n

What we've managed to do is write a recursive function that refers to itself, fact', by hiding it behind an abstraction g and then invoking the Y combinator on g.

This combinator stuff hurts my head a whole bunch, and it's not super critical to understand exactly how it works (although it's fun to play around with!). The takeaway is that we can implement recursive functions in the pure lambda calculus. Once we add booleans and numbers to the mix, hopefully you're convinced that this looks a lot like any other programming language, just a bit more inconvenient.

Baking things in and getting stuck

Everything we've just done "works", in the sense that we can encode any Turing machine using these primitives. But the encodings are incredibly tedious to work with, and so in practice functional languages aren't implemented that way. Instead, the richer features are "baked into" the language as extensions of the syntax and semantics. Let's see a simple example by extending our lambda calculus with booleans.

As always, we'll need both a syntactic and semantic extension. The syntax just adds true, false, and conditionals:

t := x
   | λx. t
   | t t
   | true
   | false
   | if t then t else t

When doing β-reduction, we had a notion of "values" that said when we were done, and said that applications could only step when the right-hand side is a value. In this new extension, we'd like true and false to also be values.

The semantics is not too tricky, either. It looks basically the same as it did for IMP, by adding three rules: $\frac{e_{1} \to e_{1}^{'}}{if e_{1} then e_{2} else e_{3} \to if e_{1}^{'} then e_{2} else e_{3}}$ $\frac{}{if true then e_{2} else e_{3} \to e_{2}}$ $\frac{}{if false then e_{2} else e_{3} \to e_{2}}$ The first rule reduces the condition until it's either true or false, and then the third rule chooses which side of the expression to run.

This all seems reasonable enough, but we've just accidentally created a huge problem in our language. Look at these two terms:

if (λx. x) then true else false
true (λx. x)

They can't step at all! None of the if rules we just added apply to the first term, and although the second term has the shape of an application, the left-hand side isn't an abstraction, so we can't β-reduce it further. But neither of these terms are values. They are stuck! It was impossible to end up in this predicament with closed terms in our original lambda calculus, so something has changed here.

You might recognize problems like this one from real programming languages as a "type error". The idea is that some programs, although they are syntactically well-formed, are not actually valid programs.

Ruling out these invalid programs is the role of a type system, and in the next lecture, we'll see how to build a type system for this extended lambda calculus and prove that it prevents programs getting stuck.