In order to take advantage of the higher performance of sequential level-3 BLAS, we rewrite the above algorithm in terms of matrix-matrix operations. Partition
where and are sub-matrices. Now,
This in turn yields the equations
We thus conclude that the following steps will implement the Cholesky factorization, overwriting the lower triangular portion of A with L :
Notice that to compute , a level-2 implementation of the Cholesky factorization can be used. The bulk of the computation is in the operation , which now uses a matrix-matrix operation, thereby achieving better performance.
- , where Cholesky factor overwrites the lower triangular part of .
- compute the Cholesky factorization of updated recursively, yielding .
The PLAPACK code for this level-3 BLAS based implementation is given in Figure 8.2. Notice that the block size b must be determined. This block size should equal the width of that makes the the symmetric rank-k update: most efficient. The call
PLA_Environ_nb_alg( PLA_OP_SYM_PAN_PAN, template, &nb_alg );queries the infrastructure for the optimal block size for symmetric rank-k (panel-panel) update. A second optimization, which could also have been used for the level-2 BLAS right-looking variant, recognizes that computation naturally flows to the right and down: notice that the columns are completed from left to right, and similarly the order in which rows no longer participate is from top to bottom. The calls
PLA_Temp_get_comm_dir( template, PLA_DIR_TEMP_ROW, &old_dir_row );query the current directions of the flow of computation, sets them as appropriate for the Cholesky factorization, and resets them after the Cholesky factorization has completed.
PLA_Temp_get_comm_dir( template, PLA_DIR_TEMP_COL, &old_dir_col );
PLA_Temp_set_comm_dir( template, PLA_DIR_TEMP_ROW, PLA_DIR_RIGHT );
PLA_Temp_set_comm_dir( template, PLA_DIR_TEMP_COL, PLA_DIR_DOWN );
PLA_Temp_set_comm_dir( template, PLA_DIR_TEMP_ROW, old_dir_row );
PLA_Temp_set_comm_dir( template, PLA_DIR_TEMP_COL, old_dir_col );