There may be considerable inefficiency associated with achieving all parallelism through calls to the PLAPACK global level-3 BLAS. Sources of inefficiency include:
We now show how the left-looking level-3 BLAS based algorithm given above can be implemented at a lower level, essentially in-lining the call to PLA_Syrk in the implementation given in Figure 8.4.
- The number of message sent may increase, since there may be an opportunity to combine messages that are performed within separate global BLAS calls.
- Each call to a global BLAS routine will require creation of temporary work buffers (duplicated objects) to hold duplicated data and contributions to results.
Consider the code in Figure 8.5. At the top of the loop, we changed the partitioning of the matrix so that is guaranteed to exist on one node. We do so by determining the size to be the minimum of the algorithmic blocking size, the split size at the top and left of the current matrix . Next, the update of
is explicitly exposed by the following additions to the code:
Before entering the loop we create objects to hold duplicated versions of and local contributions to . By taking views into these objects, we can reuse these objects during the execution of the loop. When comparing the two left-looking level-3 BLAS based implementations, we notice that the second implementation is clearly more complex, but that this complexity is still manageable.
- a call to PLA_Copy to spread the contents of within columns,
- calls to compute local contributions to and , and
- an explicit reduce of the local contributions to and .
It will not always be the case that this more complex implementation will outperform the first. Notice that the implementation that uses only the global BLAS does not limit the algorithmic blocking size by the distribution blocking size. Thus, it may be possible to achieve better load balance by taking the distribution blocking size to be much smaller than the algorithmic blocking size. But then, we could further rewrite chol_left_blas3_alt to decouple the algorithmic and distribution blockings sizes explicitly within the code.