Major Section: PARALLELISM
In this topic we introduce the ACL2 parallelism primitives using the example of a doubly-recursive Fibonacci function, whose basic definition is as follows. See for a very high-level summary of the parallelism capability described here, and see parallelism-build for how to build an executable image that supports parallel evaluation. Here, we assume that such an executable is being used.
Serial Fibonacci
(defun fib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (+ (fib (- x 1)) (fib (- x 2))))))
Introducing Pargs
A simple way to introduce parallelism into this function definition is to
wrap the addition expression with a call of pargs
, and the arguments to
the addition will be computed in parallel whenever resources are available.
As a result, we end up with a very similar and thus intuitive function
definition. Note that we replaced +
by binary-+
, since
pargs
expects a function call, not a macro call.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (pargs (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))
Introducing the Granularity Problem
After you submit the above two versions of the Fibonacci function, test them with the following forms.
(time$ (fib 10)) (time$ (pfib 10))Now increase the argument by increments of 5 to 10 until you find your curiosity satisfied or your patience wearing thin. You can interrupt evaluation if necessary and return to the ACL2 loop. You will immediately notice that you have not increased execution speed, at least not by much, by introducing parallelism.
First, consider the computation of (pfib 4)
. Assuming resources are
available, (pfib 4)
will create a thread for computing (pfib 3)
and
another thread for (pfib 2)
. It is easy to imagine that setting up each
thread takes much longer than the entire computation of (fib 4)
.
Second, we must realize that if we have two threads available for computing
(fib 10)
, then the evaluation of (fib 8)
will probably complete
before the evaluation of (fib 9)
. Once (fib 8)
finishes, parallelism
resources will become available for the next recursive call made on behalf of
(fib 9)
. If for example that call is (fib 3)
, we will waste a lot of
cycles just handing work to the thread that does this relatively small
computation. We need a way to ensure that parallelism resources are only
used on problems of a "large" size. Ensuring that only "large" problems
are spawned is called the ``granularity problem.''
In summary: We want to tell ACL2 that it can evaluate the arguments of
pargs
in parallel only when the parameter of pfib
is greater
than some threshold. Our tests using OpenMCL have suggested that 27 is a
reasonable threshold.
Explicit Programming for the Granularity Problem
One way to avoid the granularity problem is to duplicate code as follows.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (if (> x 27) ; the granularity test (pargs (binary-+ (pfib (- x 1)) (pfib (- x 2)))) (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))Duplicating code is fundamentally a bad design principle, because it can double the work for future maintenance. A ``granularity form'' is an expression
(declare (granularity <form>))that can allow threads to be spawned (without duplicating code) whenever the evaluation of
<form>
results in a non-nil
value. It may be placed
inside any call of a parallelism primitive, in a position documentated
separately for each primitive. Here is a definition of pfib
using this
feature for a call of the parallelism primitive pargs
.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (pargs (declare (granularity (> x 27))) (binary-+ (pfib (- x 1)) (pfib (- x 2)))))))
Test each version as follows (or substitute your own natural number for 33).
(time$ (fib 33)) (time$ (pfib 33))
Another Granularity Issue Related to Thread Limitations
Our implementation of parallelism primitives has the property that once a thread is assigned a computation, that assignment stays in effect until the computation is complete. In particular, if a thread encounters a parallelism primitive that spawns child threads, the parent thread stays assigned, waiting until the child computations complete before it can continue its own computation. In the meantime, the parent thread reduces the number of additional threads that Lisp can provide by 1, rather than being reassigned to do other work.
How can this lack of reassignment affect the user? Consider, for example,
the application of a recursive function to a long list. Imagine that this
function is written so that the body contains a recursive call, for example
as (pargs (process (car x)) (recur (cdr x)))
. Each such pargs
call
that spawns child work must wait for its children, one of which is the work
of evaluating (recur (cdr x))
, to complete. There is an ACL2 limit on
how much pending work can be in the system, limiting the number of
pargs
calls that can result in parallel evaluation. For example, if
the ACL2 limit is k and each call of pargs
actually spawns threads for
evaluating its arguments, then
after k cdr
s there will be no further parallel evaluation.
Possible solutions may include reworking of algorithms (for example to be tree-based rather than list-based) or using appropriate granularity forms. We hope that future implementations will allow thread ``re-deployment'' in order to mitigate this problem.
Introducing Plet
We can use a let
binding to compute the recursive calls of fib
and
then add the bound variables together, as follows.
(defun fib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (let ((a (fib (- x 1))) (b (fib (- x 2)))) (+ a b)))))
By using the parallelism primitive plet
, we can introduce parallelism
much as we did using pargs
, with an optional granularity form, as
follows.
(defun pfib (x) (declare (xargs :guard (natp x))) (cond ((or (zp x) (<= x 0)) 0) ((= x 1) 1) (t (plet (declare (granularity (> x 27))) ((a (pfib (- x 1))) (b (pfib (- x 2)))) (+ a b)))))Notice that this time, we were able to use
+
rather than being forced to
use binary-+
. Unlike pargs
, which expects a function call (not a
macro call), plet
allows macros at the top level of its body.
Introducing Pand
(By Way of a Tree Validation Example)
Consider ``genetic trees'' that contains leaves of DNA elements, in the sense
that each tip is one of the symbols A
, G
, C
, or T
. First we
define the function valid-tip
which recognizes whether a tip contains one
of these symbols.
(defun valid-tip (tip) (declare (xargs :guard t)) (or (eq tip 'A) (eq tip 'G) (eq tip 'C) (eq tip 'T)))
Now we define a function that traverses the tree, checking that every tip is valid.
(defun valid-tree-serial (tree) (declare (xargs :guard t)) (if (atom tree) (valid-tip tree) (and (valid-tree-serial (car tree)) (valid-tree-serial (cdr tree)))))
We also define a parallel version.
(defun valid-tree-parallel (tree) (declare (xargs :guard t)) (if (atom tree) (valid-tip tree) (pand (valid-tree-parallel (car tree)) (valid-tree-parallel (cdr tree)))))
Before we can time the functions, we need to create test trees. We have found that test trees need to be approximately 1 gigabyte before we clearly see speedup, and we make them asymmetric to demonstrate the ability of our implementation to adapt to asymmetric data. We can create the trees with the execution of the following forms.
(defun make-valid-binary-tree (x) (declare (xargs :mode :program)) (if (< x 0) (cons (cons 'C 'G) (cons 'A 'T)) (cons (make-valid-binary-tree (- x 2)) ; 2 to make asymmetric (make-valid-binary-tree (- x 1)))))(defconst *valid-tree* (make-valid-binary-tree 30)) ; takes awhile (defconst *invalid-tree* (cons *valid-tree* nil)) ; nil is invalid tip
We can time our functions with the forms:
(time$ (valid-tree-serial *valid-tree*)) (time$ (valid-tree-parallel *valid-tree*))Unfortunately, the serial version runs faster than the parallelism version; however, there is more to this story.
Demonstrating Early Termination with an Invalid Tree Now observe this magic:
(time$ (valid-tree-serial *invalid-tree*)) (time$ (valid-tree-parallel *invalid-tree*))The serial version took awhile, while the parallel version finished almost immediately. The test for validation was split into testing the
car
and the cdr
of the *invalid-tree*
root, and since the cdr
was
equal to nil
, its test returned immediately. This immediate return
quickly interrupted the computation associated with the car
, and returned
the result.
Granularity with Pand
We can also define a parallel version with a granularity form:
(defun valid-tree-parallel (tree depth) (declare (xargs :guard (natp depth))) (if (atom tree) (valid-tip tree) (pand (declare (granularity (< depth 5))) (valid-tree-parallel (car tree) (1+ depth)) (valid-tree-parallel (cdr tree) (1+ depth)))))
We can test this form by executing our previous forms. You will probably
find some speedup on a machine with several cores available, but more speedup
can likely be obtained with an expensive test on the leaves in place of
valid-tip
.
(time$ (valid-tree-serial *valid-tree*)) (time$ (valid-tree-parallel *valid-tree* 0))
Introducing Por
Por
can be used in the same way as pand
, but with early
termination occurring when an argument evaluates to a non-nil
value, in
which case the value returned is t
. Finally, por
also supports the
use of a granularity form.