Over the last few years, we have written a number of papers concerning the efficient implementation of collective communication operations on parallel architectures []. As part of that research, we have noticed that efficient implementations for scatter, gather, collect, and distributed reduction, one can build efficient implementations for the broadcast, reduce-to-one, and reduce-to-all by making the following observations:
Indeed, given optimal implementations of scatter, gather collect, and distributed reduction, implementing the other operations as described can be shown to be asymptotically (for long vectors of data) within a factor two of optimal, or even optimal.
- Broadcast: equivalent to a scatter followed by a collect.
- Reduce-to-one: equivalent to a distributed reduce followed by a gather.
- Reduce-to-all: equivalent to a distributed reduce followed by a collect.