(Advanced) how to distribute ACL2 book building over a cluster of machines.
Warning: getting a cluster set up and running smoothly is a significant undertaking. Aside from hardware costs, it may take significant energy to install and administer the system, and you will need to learn how to effectively use the queuing system. You'll probably also need to be ready to do some scripting to work around dumb problems. Think of this topic as: some hints that may help you, not a usable guide to setting up a cluster.
At Centaur,
For one,
These comments allow you to say, e.g., how much memory a job is going to take, so that if a job takes more than its allotted memory, the clustering software may choose to kill it. The clustering software also uses this memory limit to ensure that when it allocates a job to a machine, the machine will have enough physical machine to run the job.
This is really very useful. If you let a machine start swapping into the gigabytes, at worst you will need to physically reset it, because it dies a special kind of horrible death where its load average is 50 and you can't even "kill" anything. In a slightly better case, you may run into the Linux overcommit and OOM killer features, which are also really awful. My favorite article on the topic, from back before we had the cluster and were running into this frequently, is here.
At any rate, when cert.pl writes out the scripts to certify books, it
includes some PBS commands that say how much memory the book is expected to
take. This is done by a stupid heuristic: we search for
Besides this support for PBS directives,
$STARTJOB -c "acl2 < certify-commands &> foo.cert.out"
So, given a suitable
A suitable
The build system scans books for lines containing the pattern
; cert-env: (varname1=value1, varname2, ...)
Each of these varname/value pairs are set in the environment before calling STARTJOB to build the book; these can therefore be set in such a way as to give information to the queuing system about how the book needs to be built. If a varname is included without a corresponding value, the value defaults to 1.
The build system also scan for some particular patterns to help define how much memory and time the book should need to be allocated. The following kinds of set-max-mem forms are recognized and used to generate the environment variable CERT_MAX_MEM:
(set-max-mem (expt 2 k)) (set-max-mem (* n (expt 2 30))) ;; N gigabytes (set-max-mem (* (expt 2 30) n))
Additionally, the following pattern is scanned to set the environment variable CERT_MAX_TIME. Note this isn't a real ACL2 event, so it should occur in a comment:
; (set-max-time N)
We originally found that our builds would often "fail" due to the following scenario:
To avoid this,