Hierarchical Ordering of Sequential Processes
by Prof.dr. Edsger W. Dijkstra
The processing unit of a working computer performs in a short period of time a sequence of millions of instructions and as far as the processing unit is concerned this sequence is extremely monotonous: it just performs instructions one after the other. And if we dare to interpret the output, if we dare to regard the whole happening as "meaningful", we do so because we have mentally grouped sequences of instructions in such a way that we can distinguish a structure in the whole happening. Similar considerations apply to the store: high speed stores contain typically millions of bits stored in a monotonous sequence of consecutively numbered but otherwise equivalent storage locations. And again, if we dare to attach a meaning to such a vast amount of bits, we can only do so by grouping them in such a way that we can distinguish some sort of structure in the vast amount of information. In both cases the structure is our invention and not an inherent property of the equipment: with respect to the structure mentioned the equipment itself is absolutely neutral. It might even be argued that this "neutrality" is vital for its flexibility. On the other hand, it then follows that it is the programmer's obligation to structure "what is happening where" in a useful way. It is with this obligation that we shall concern ourselves. And it is in view of this obligation that we intend to start with a rather machine-bound, historical introduction: this gives us the unordered environment in which we have to create order, to invent structure adequate for our purposes.
In the very old days, machines were strictly sequential, they were controlled by what was called "a program" but could be called very adequately "a sequential program". Characteristic for such machines is that when the same program is executed twice —with the same input data, if any— both times the same sequence of actions will be evoked. In particular: transport of information to or from peripherals was performed as a program-controlled activity of the central processor.
With the advent of higher electronic speeds the discrepancy in speed between the central processor on the one hand and the peripheral devices on the other became more pronounced. As a resulttherecame for instance a strong economic pressure to arrange matters in such a way that two or more peripherals could be running simultaneously.
In the old arrangement one could write a program reading information from a paper tape, say at a maximum speed of 50 char/sec. In that case the progress through that piece of program would be synchronized with the actual movement of the paper tape through the reader. Similarly one could write a program punching a paper tape, say at a maximum speed of 30 char/sec. To have both peripherals running simultaneously and also closely to their maximum speed would require a tricky piece of program specifically designed for this mix of activities. This was clearly too unattractive and other technical solutions have been found. Channels were invented; a channel is a piece of hardware dedicated to the task of regulating the information traffic between the store and the peripheral to which it is attached, and doing this synchronized to the natural speed of the peripheral device, thus doing away with the implicit mutual synchronization of the peripheral devices that would be caused if both were controlled by the same sequential program execution.
The introduction of channels created two problems, a microscopic and a macroscopic one. The microscopic problem has to do with access to the store. In the old arrangement only the central processor required access to the store and when the central processor required access to the store it could get it. In the new arrangement, with the channels added —channels that can be regarded as "special purpose processors"— a number of processors can be competing with each other as regards access to the store because such accesses from different processors very often exclude eachother in time (for technical or logical reasons). This microscopic problem has been solved by the invention of the "switch", granting the competing processors access to the store according to some priority rule. Usually the channels have a lower traffic density and a higher priority than the central processor: the processor works at full speed until a channel requests access to the store, an arrangement which is called "cycle stealing". We draw attention to the fact that the unit of information in which this interleaving takes place —usually "a word"— is somewhat arbitrary; in a few moments we shall encounter a similar arbitrariness.
The macroscopic problem has to do with the coordination of central processor activity and channel activity. The central processor issues a command to a channel and from that moment onwards, two activities are going on simultaneously and —macroscopically speaking— independent of eachother: the central processor goes on computing and the channel transports information. How does the central processor discover, when the execution of the channel command has been completed? The answer to this has been the "interrupted". Upon completion of a channel command the channel sets an interrupt flip-flop; at the earliest convenient moment (but never sooner that after completion of the current instruction), the central processor interrupts the execution of the current program (in such a neat way that the interrupted computation can be resumed at a later moment as if nothing has happened) and starts executing an interrupt program instead, under control of which all now appropriate actions will be taken. From the point of view of the central processor it interleaves the various program executions, the unit of interleaving being —similarly arbitrarily— "the instruction".
The above scheme can be recognized in all larger, modern computers that I ever studied. It has been embellished in many directions but we don't need to consider those embellishments now. We go immediately to the next questions: given a piece of equipment constructed along the lines just sketched, what are the problems when we try to use it and in what direction should we look for their solution?
What are the problems? Well the main point is that from the point of view of program control such a piece of equipment must be regarded as a non-deterministic machine. Measured in a grain of time appropriate for the description of the activity of the central processing unit —clockpulse or instruction execution time— the time taken by a peripheral transport must be regarded as undefined. If completion of such a peripheral is signalled to the central processor by means of an interrupt, this means that we must regard the moment when the interrupt will take place (or more precisely: the point of progress where the computation will be interrupted) as unpredictable. The problem is that in spite of this indeterminacy of the basic hardware, we must make a more or less deterministic automaton out of this equipment: from the outside world the machine will be confronted with a well-defined computational task and it has to produce a well-defined result in a microscopically unpredictable way!
Let me give a simple example to explain what I mean by "a more or less deterministic automaton". Suppose that offering a program to the machine consists of loading a pack of cards into a card reader (and pushing some button on the reader in order to signal that it has been loaded). Suppose now that we have a machine with two readers and that we want to load it with two programs, A and B, and that we can do this by loading both card readers and pressing both buttons. We assume that the two card readers are not mutually synchronized, i.e. we regard both speeds as unpredictable. To what extent will the total configuration be a deterministic automaton? It will be fully deterministic in the sense that eventually it will produce both output A and output B. If these outputs are to be produced by the same printer, they will be produced in some order and the system may be such that the order in which the respective outputs appear on the printer does depend on the relative speeds of the two readers. As far as the operator is concerned, who has to take the output from the printer and to dispatch it to the customers, the installation is non-deterministic: what he has to do depends on the unpredictable speed ratio of the two readers, which may cause output A to precede or to follow output B. For both cases the operator has his instructions such that in both cases all output is dispatched to the proper customer. The "computation centre" —i.e. installation and operator together— are deterministic. We can regard the operator's activity as an outer layer, "wrapping up the installation", shielding from the outside world a level of interior indeterminacy.
Now, even if the operator is aware of not having a fully deterministic machine, we should recognize that he has only to deal with two cases —output A before output B or the other way round— while the number of possible sequences of occurrences at cycle time level is quite fantastic. In other words, by far the major part of the "shielding of indeterminacy" is done by the installation itself. We call the resulting installation "more or less deterministic" because as the case may be, a few degrees of limited freedom —here one boolean degree of freedom— may be left unpredictable.
We have called the operator's activity "an outer layer", shielding a level of indeterminacy, and of course we did so on purpose. At the other end we may distinguish an inner layer, viz. in the channel signalling (via an interrupt signal) that the next card has been read: it tells the central processor that the next card image is available in core, regardless which storage cycles have been stolen to get it there. The terms "inner layer" and "outer layer" have been chosen in order to suggest that in the total organization we shall be able to distinguish many layers in between. But an important remark is immediately appropriate: I assume that with the card read command an area in core has been designated to receive this card image: the remark that the interrupt signalled the completed transfer of the card image irrespective of which cycles had been stolen to transport its constituents is only true, provided that no other access to the designated core area took place in the period of time ranging from the moment the command was given up to the moment that the completion was signalled! Obvious but vital.
It draws our attention to an element of structure that must be displayed by the remaining programs if we wish to make the total organization insensitive to the exact identity of the cycles stolen by the channel. And from the above it is clear that this insensitivity must be one of our dearest goals. And on next levels (of software) we shall have to invent similar elements of structure, making the total organization insensitive (or "as insensitive as possible") to the exact moment when interrupts are honoured. Again it is clear that this must be one of our dearest goals. And on a next level we must make our organization insensitive (or "as insensitive as pos-sible") to the exact number of cards put into the readers for program A and B, and so on . . . . . This "layered insensitivity" is, in two words, our grand plan.
I have used the term "layer" on purpose, because it has seemed to pro-vide an attractive terminology in terms of which to talk about operating systems and their total task. We can regard an operating system as the basic software that "rebuilds" a given piece of hardware into a (hopefully) more attractive machine. An operating system can then be regarded as a sequence of layers, built on top of eachother and each of them implementing a given "improvement". Before going on, let me digress for a moment and let me try to explain why I consider such an approach of ordered layers a fruitful one.
There is an alternative approach, which I would like to call the approach via unordered modules. There one makes a long list of all the functions of the operating system to be performed, for each function a module is programmed and finally all these modules are glued together in the fervent hope that they will cooperate correctly and will not interfere disastrously with each other's activity. It is such an approach which has given rise to the assumed law of nature, that complexity grows as the square of the number of program components, i..e. of the number of "functions".
In the layered approach we start at the bottom side with a given hardware machine A0, we add our bottom layer of software rebuilding A0 into the slightly more attractive machine A1 , for which the next layer of software is programmed rebuilding it into the still more attractive machine A2, etc. As the machines in the sequence A0 , A1 , A2, ... get more and more attractive, adding a further layer gets easier and easier. This is in sharp contrast to the approach via unordered modules, where adding new functions seems to get progressively worse!
So much in favour of a layered approach in general. When one wishes to design an operating system, however, one is immediately faced with the burning question, which "improvement" is the most suitable candidate to be implemented in the bottom layer.
For the purpose of this discussion I will choose a very modest bottom layer. I do so for two reasons. Firstly, it is a choice with which for historical reasons I myself am most familiar. Secondly, as a bottom layer it is very modest and neutral, so neutral in fact that it provides us with a mental platform from where we can discuss various alternatives for the structure of what is going to be built on top of it. As a bottom layer it seems close to the choice of minimal commitment. The fact that this bottom layer is chosen as a starting point for our discussion is by no means to be interpreted as the suggestion that this is the best possible choice: on the contrary, one of the later purposes of this discussion is the consideration of alternatives.
With the hardware taking care of the cycle stealing we felt that the software's first responsibility was to take care of the interrupts, or, to put it a little more strongly, to do away with the interrupt, to abstract from its existence. (Besides all rational arguments this decision was also inspired by fear based on the earlier experience that, due to the irreproducibility of the interrupt moments, a program bug could present itself misleadingly like an incidental machine malfunctioning.) What does it mean "to do away with the interrupt"? Well, without interrupt the central processor continues the execution of the current sequential process while it is the function of the interrupt to make the central processor available for the continuation of another sequential process. We would not need interrupt signals if each sequential process had its own dedicated processor. And here the function of the bottom layer emerged: to create a virtual machine, able to execute a number of sequential programs in parallel as if each sequential program had its own private processor. The bottom layer has to abstract of the existence of the interrupt or, what amounts to the same thing, it has to abstract from the identity of the single hardware processor. If this abstraction is carried out rigorously it implies that everything built on top of this bottom layer will be equally applicable to a multiprocessor installation, provided that all processors are logically equivalent (i.e. have the same access to main memory etc.). The remaining part of the operating system and user programs together then emerges as a set of harmoniously cooperating sequential processes.
The fact that these sequential processes out of the family have to cooperate harmoniously implies that they must have the means of doing so, in particular, they must be able to communicate with eachother and they must be able to synchronize their activities with respect to eachother. For reasons which, in retrospect, are not very convincing, we have separated these two obligations. The argument was that we wished to keep the bottom layer as modest as possible, giving it only the duty of processor allocation; in particular it would leave the "neutral, monotonous memory" as it stood, it would not rebuild that part of the machine and immediately above the bottom layer the processes could communicate with eachother via the still available, commonly accessible memory.
The mutual synchronization, however, is a point of concern. Closely, related to this is the question: given the bottom layer, what will be known about the speed ratios with which the different sequential processes progress? Again we have made the most modest assumption we could think of, viz. that they would proceed with speed ratios, unknown but for the fact that the speed ratios would differ from zero. I.e. each process (when logically allowed to proceed see below) is guaranteed to proceed with some unknown, but finite speed. In actual fact we can say more about the way in which the bottom layer grants processor time to the various candidates: it does it "fairly" in the sense that in the long run a number of identical processes will proceed at the same macroscopic speed. But we don't tell, how "long" this run is and the said fairness has hardly a logical function.
This assumption about the relative speeds is a very "thin" one, but as such it has great advantages. From the point of view of the bottom layer, we remark that it is easy to implement: to prevent a running program to monopolize the processor an interrupting clock is all that is necessary. From the point of view of the structure built on top of it is also extremely attractive: the absence of any knowledge about speed ratios forces the designer to code all synchronization measures explicitly. When he has done so he has made a system that is very robust in more than one sense.
Firstly he has made a system that will continue to operate correctly when an actual change in speed ratios is caused, and this may happen in a variety of ways. The actual strategy for processor allocation as implemented by the bottom layer, may be changed. In a multiprocessor installation the number of active processors may change. A peripheral may temporarily work with speed zero, e.g. when it requires operator attention. In our case the original line printer was actually replaced by a faster model. But under all those changes the system will continue to operate correctly (although perhaps not optimally, but that is quite another matter).
Secondly —and we shall return to this in greater detail— the system is robust thanks to the relative simplicity of the arguments that can convince us of its proper operation. Nothing being guaranteed about speed ratios means that in our understanding of the structure built on top of the bottom layer we have to rely on discrete reasoning and there will be no place for analog arguments, for other purposes than overall justification of chosen strategies. I trust that the strength of this remark will become apparent as we proceed.
Let us now focus our attention upon the synchronization. Here a key problem is the so-called "mutual exclusion problem". Given a number of cyclic processes of the form
cycle begin entry;
critical section;
exit;
remainder of cycle
end
program "entry
" and "exit
" in such a way that at any moment at most one of the processes is engaged in its critical section. The solution must satisfy the following requirements:
a) The solution must be symmetrical between the processes; as a result we are not allowed to introduce a static priority.
b) Nothing may be assumed about the ratio of the finite speeds of the processes; we may not even assume their speeds to be constant in time.
c) If any of the processes is stopped somewhere in "remainder of cycle", this is not allowed to lead to potential blocking of any of the others.
d) If more than one process is about to enter its critical section, it must be impossible to devise for them such finite speeds, that the decision to determine which of them will enter its critical section first is postponed until eternity. In other words, constructions in which "After you"-"After you"-blocking, although improbable, is still possible, are not to be regarded as valid solutions.
I called the mutual exclusion problem "a key problem". We have met something similar in the situation of programs A and B producing their output in one of the two possible orders via the same printer: obviously those two printing processes have to exclude each other mutually in time. But this is a mutual exclusion on a rather macroscopic scale and in all probability it is not acceptable that the decision to grant the printer to either one of the two activities will be taken on decount of the requirement of mutual exclusion alone: in all probability considerations of efficiency or of smoothness of service require a more sophisticated printer granting strategy. The explanation why mutual exclusion must be regarded as a key problem must be found at the microscopic end of the scale. The switch granting access to store on word basis provides a built in mutual exclusion, but only on a small, fixed and rather arbitrary scale. The same applies to the single processor installation which can honour interrupts in between single instructions: this is a rather arbitrary grain of activity. The problem arises when more complicated operations on common data have to take place. Suppose that we want to count the number of times something has happened in a family of parallel processes. Each time such an occurrence has taken place, the program could try to count it via
n:= n+1
If in actual fact such a statement is coded by three instructions
R:= n;
R:= R+1;
n:= R
then one of the increases may get lost when two such sequences are executed, interleaved on single instruction basis. The desire to compound such (and more complicated) operators on common variables is equivalent to the desire to have more explicit control over the degree of interleaving than provided by the neutral, standard hardware. This more explicit control is provided by a solution to the mutual exclusion problem.
We still have to solve it. Our solution depends critically on the communication facilities available between the individual processes and the common store. We can assume that the only mutual exclusion provided by the hardware is to exclude a write instruction or a read instruction, writing or reading a single word. Under that assumption the problem has been solved for two processes by T.J. Dekker in the early sixties. It has been solved by me for N processes in 1965 (C.A.C.M., 1965, Vol. 8, nr. 9, pag. 569). The solution for two processes was complicated, the solution for N processes was terribly complicated. (The program pieces for "enter
" and "exit
" are quite small, but they are by far the most difficult pieces of program I ever made. The solution is only of historical interest.)
It has been suggested that the problem could be solved when the individual processes had at their disposal an indivisible "add to store"which would leave the value thus created in one of the private process registers as well, so that this value is available for inspection if so desired. Indicating this indivisible operation with braces the suggested form of the parallel programs was:
cycle begin while {x:= x+l } ≠ 1
do {x:= x-1};
critical section;
{x:= x-1};
remainder of cycle
end
where the "add to store
" operation is performed on the common variable "x
" which is initialized with the value zero before the parallel programs are started.
As far as a single process is concerned the cumulative Δx
as effected by this process since its start is =0 or =1; in particular, when a process is in its critical section, its cumulative Δx
= 1. As a result we conclude that at any moment when N processes are in their critical section simultaneously, x
≥N will hold.
A necessary and sufficient condition for entering a critical section is that this process effectuates for x
the transition from 0 to 1. As long as one process is engaged in its critical section (N = 1 ), x
>1 will hold. This excludes the possibility of the transition from 0 to 1 taking place and therefore no other process can enter its critical section. We conclude that mutual exclusion is indeed guaranteed. Yet the solution must be rejected: it is not difficult to see that even with two processes (after at least one succesful execution of a critical section) "After you"-"After you" blocking may occur (with the value of x
oscillating between 1 and 2).
A correct solution exists when we assume the existence of an indivisible operation "swap
" which causes a common variable (x
) and a private variable (loc
) to exchange their values. With initially x
= 0 the structure of the parallel programs is:
begin integer loc; loc:= 1;
cycle begin repeat swap (x, loc)
until loc = 0;
critical section;
swap (x, loc);
remainder of cycle
end
end
The invariant relation is that of the N+1 variables (i.e. the N loc
's and the single x
) always exactly one will be =0, the others being =1. A process is in its critical section if and only if its own loc
=0, as a result at. most one process can be engaged in its critical section. When none of the processes is in its critical section, x
= 0 and "After you"-"After you" blocking is impossible. So this is a correct solution.
In a multiprogramming environment, however, the correct solutions referred to or shown have a great drawback: the program section called "enter
" contains a loop in which the process will cycle when it cannot enter its critical section. This so-called "busy form of waiting" is expensive in termsof processing power, because in a multiprogramming environment (with more parallel processes than processing units) there is a fair chance that there will be a more productive way of spending processing power than giving it to a process that, to all intents and purposes, could go to sleep for the time being.
If we want to do away with the busy form of waiting we need some sort of synchronizing primitives by means of which we can indicate those program points where —depending on the circumstances— a process may be put to sleep. Similarly we must be able to indicate that potential sleepers may have to be woken up. What form of primitives?
Suppose that process 1 is in its critical section and that process 2 will be the next one to enter it. Now there are two possible cases.
a) process 1 will have done "exit
" before process 2 has tried to "enter
"; in that case no sleeping occurs
b) process 2 tries to "enter
" before process 1 has done "exit
"; in that case process 2 has to go sleep temporarily until is woken up as a side-effect of the "exit
" done by process 1.
When both occurrences have taken place, i.e. when process 2 has succesfully entered its critical section it is no longer material whether we had case a) or case b). In that sense we are looking for primitives (for "enter
" and "exit
") that are commutative. What are the simplest commutative operations on common variables that we can think of? The simplest operation is inversion of a common boolean, but that is too simple for our purpose: then we have only one operation at our disposal and lack the possibility of distinguishing between "enter
" and "exit
". The next simplest commutative operations are addition to (and subtraction from) a common integer. Furthermore we observe that "enter
" and "exit
" have to compensate eachother: if only the first process passes its critical section the common state before its "enter
" equals the common state after its "exit
" as far as the mutual exclusion is concerned. The simplest set of operations we can think of are increasing and decreasing a common variable by 1 and we introduce the special synchronizing primitives
P(s) : s : = s-1
and
V(S) : s : = s+1
special in the sense that they are "indivisible" operations: if a number of P- and V-operations on the same common variable are performed "simultaneously" the net effect of them is as if the increases and decreases are done "in some order".
Now we are very close to a solution: we have still to decide how we wish to characterize that a process may go to sleep. We can do this by making the P- and V-operations operate not on just a common variable, but on a special purpose integer variable, a so-called semaphore, whose value is by definition non-negative; i.e. s
≥0.
With that restriction, the V-operation can always be performed: unsynchronized execution of the P-operation, however, could violate it.
We therefore postulate that whenever a process initiates a P-operation on a semaphore whose current value equals zero, the process in question will go to sleep until (another) process has performed a V-operation on that very same semaphore. A little bit more precise: if a semaphore value equals zero, one or more processes may be blocked by it, eager to perform a P- operation on it. If a V-operation is performed on a semaphore blocking a number of processes, one of them is woken up, i.e. will perform its now admissible P-operation and proceed. The choice of this latter process is such that no process will be blocked indefinitely long. A way to implement this is to decide that no two processes will initiate the blocking P-operation simultaneously and that they will be treated on the basis "first come, first served" (but it needs not be done that way, see below).
With the aid of these two primitives the mutual exclusion problem is solved very easily. We introduce a semaphore "mutex
" say, with the initial value
mutex = 1,
after which the parallel processes controlled by the program
cycle begin P(mutex);
critical section;
V(mutex);
remainder of cycle
end
are started.
Before proceeding with the discussion I would like to insert a remark. In languages specifically designed for process control I have met two other primitives, called "wait
" and "cause
", operating on an "event variable", which is a (possibly empty) queue of waiting processes. Whenever a process executes a "wait
" it attaches itself to the queue until the next "cause
" for the same event, which empties the queue and signals to all processes in the queue that they should proceed. Experience has shown that such primitives are very hard to use. The reason for this is quite simple: a "wait
" in one process and a "cause" in another are non-commutative operations, their net effect depends on the order in which they take place and at the level where we need the synchronizing primitives we must assume that we have not yet effective control over this ordering. The limited usefulness of such "wait
" and "cause
" primitives could have been deduced a priori.
As a next interlude I am going to prove the correctness of our solution. One may ask "Why bother about such a proof, for the solution is obviously correct". Well, in due time we shall have to prove the correctness of the implementation of more sophisticated rules of synchronization and the proof structure of this simple case may then act as a source of inspiration.
With each process "j" we introduce a state variable "Cj", characterizing the progress of the process.
Cj = 0 | process j is in the "remainder of cycle" |
Cj = 1 | process j is in its "critical section". |
While process j performs (i.e., "completes") the operation P(mutex)j the translation Cj=0 → Cj=1 takes place, when it performs the operation V(mutex)j the transition Cj=1 → Cj=0 takes place. (Note that the Cj are not variables occurring in the program, they are more like functions defined on the current value of the order counters.) In terms of the Cj the number of processes engaged in its critical section equals
(Σj : 1 ≤ j ≤ N : Cj ) NOTE
In order to prove that this number will be at most =1, we follow the life history of the quantity
K =
mutex
+ (Σj : 1 ≤ j ≤ N : Cj )
The quantity K will remain constant as long as its constituents are constant: the only operations changing its constituents are the 2N mutually exclusive primitive actions P(mutex)i and V(mutex)i (for 1 ≤ i ≤ N) .
We have as a result of
P(
mutex
)i : ΔK = Δmutex
+ Δ(Σj : 1 ≤ j ≤ N : Cj )= Δ
mutex
+ ΔCi= +1 -1 = 0
and similarly, as a result of
V(
mutex
)i : ΔK = Δmutex
+ ΔCi= +1 -1 = 0
As these 2N operations are the only ones affecting K's constituents, we conclude that K is constant, in particular, that it is constantly equal to its initial value,
K = 1 + (Σj : 1 ≤ j ≤ N : 0) = 1
As a result
(Σj : 1 ≤ j ≤ N : Cj ) = 1 -
mutex
.
Because mutex is a semaphore, we have