Quarterly Status Report
Performance Modeling
An Environment For End-to-End Performance Design of
Large Scale parallel Adaptive Computer/Communications Systems
for the period August 1, 1998- October 31, 1998,
Contract N66001-97-C-8533
CDRL A001
1.0 Purpose of Report
This status report is the quarterly contract deliverable (CDRL A001),which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.
2. Project Members
University of Texas, spent: 960 hours
sub-contractor (Purdue), spent: 405 hours
sub-contractor (UT-El Paso), spent: 134 hours
sub-contractor (UCLA), spent: 54 hours
sub-contractor (Rice), spent: 200 hours
sub-contractor (Wisconsin), spent: 474 hours
sub-contractor (Los Alamos), spent: 0 hours
3.0 Project Description (last modified 07/97)
3.1 Objective
The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical scaleable systems.
3.2 Approach
The project will combine innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.
Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.
Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum programs.
Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.
4.0 Performance Against Plan
4.1 Spending
– Spending has about caught up to plan. All of the subcontracts except for LANL are place. Staffing at each participating institution has been completed. It has taken 12 months to get all organizational effort completed but we are now fully in motion. The spending rate for the project will, after this quarter, run at about the planned rate.4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.
Task 1 - 85% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.
Task 2 - Complete
Task 3 - 90% Complete - Specification languages for all three domains have been proposed and are in various states of completion.
Task 4 - 35% Complete - Adaptation of the dHPF compiler to generate task graphs is now well under way.
Task 5 - 85% Complete - The compiler for the specification language now generates an extended version of CODE as output. The extensions to the CODE translator necessary to generate executable from the POEMS specification language are underway.
Task 6 - 60% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)
Task 7 - 35% Complete - Subtask or Phase 1 of this task is about 80% complete. (See the progress reports from UCLA and Wisconsin for details.)
Task 8 - 30% Complete - This task has been carried only through conceptual design.
Task 9 – Task 9 has been partitioned into two subtasks. The subtasks are defined in the Project Plan. Subtask 9.1 is 65% complete and Subtask 9.2 is 45% complete.
Task 10 - 0% Complete
Task 11 - 0% Complete
5.0 Major Accomplishments to Date
a) The POEMS team stayed an extra day at the August PI Meeting to coordinate project plans for the upcoming year. The Project Plan submitted in June was updated and completion dates for each task confirmed.
b) POEMS continues to have weekly conference telephone calls.
The POEMS group has now generated or is close to completing first versions of several major components of POEMS. The time has come to focus on integrating these system components into a coherent system. The issues are: Database Design, Interfaces, and Interactions of Models. The next year of the project will focus on integration of technologies and creation of software systems.
5.2 Technical Accomplishments
The Technical Accomplishments are given as a list with the responsible parties given in parentheses at the conclusion of the description of each accomplishment.
a. The methodology and necessary software tools for the collection of LLNL-SP/2 Sweep3D task execution times are in place. The data collection process has commenced. (UTEP)
b. The SimpleScalar Tool Set of the University of Wisconsin-Madison and Intel Coporation was evaluated and adopted as the POEMS processor/memory subsystem component simulator. The profile of a Power604e was defined for SimpleScalar. (UTEP)
c. Sweep3D was ported to SimpleScalar and, using SimpleScalar, a study of Sweep3D CPU stalls has commenced. (UTEP)
d. The MPI interface for Simplescalar has been designed and is in the process of being implemented. (UTEP/UCLA)
e. The lmbenchmark suite was ported to SimpleScalar. (UTEP)
f. The preliminary specification of the POEMS database schema was proposed. (UTEP)
g. The detailed specification of the task graph application representation was completed and a draft document giving an overall description was produced. (Rice)
h. A working prototype for the automatic synthesis of task graphs was designed and developed as part of the dHPF compiler system. This includes three levels of support, namely: static task graph generation, condensing (or collapsing) the task graph, and dynamic task graph instantiation. An overview of the technical details
has been given in a paper that was submitted to PPoPP’99. (Rice)
i. Progress was made in resolving inconsistencies between the compiler’s internal representation and the information needed for the generation of task graphs. (Rice)
j. Concluded the Study of Sweep3D on the IBM SP.- We have studied the performance of large configurations of Sweep3D. Two problem sizes of interest to LANL were 20 million and 1 billion cells. We have investigated what would be the runtime of the application for these problem sizes as a function of available processors. We have found that even using large numbers of processors (upwards of a thousand) it will still take days to simulate the large-size, full-version of the application on machines with the characteristics of the current IBM SP. (UCLA/Wisconsin)
k. Support for Clusters of SMP nodes.
We have added support for the simulation of applications running on clusters of SMPs. In our model, we assume that all the processors of the machine are the same and that each SMP node contains n processors. These processors share the main memory. The SMP nodes communicate with each other through an interconnection network. When an application, such as Sweep3D, which is written with the use of the MPI communication library performs communications, it is expected that intra-node communications will be significantly fasterthat inter-node communications. (UCLA)
We have designed our model to capture the behavior of the IBM SP at Lawrence Livermore National Laboratory. That IBM SP has 4 processor SMP nodes connected by the high-performance switch. Although the hardware supports fast intra-node communications, the communication software, and specifically, MPI does not take advantage of them yet. It is expected, that the next MPI implementation will take advantage of shared memory for intra-node communications. MPI-SIM can now
simulate the behavior of the machine with the next generation MPI.
l. Hybrid Model Development
Our capability to simulate the new SMP cluster architecture has been used in a hybrid model that combines simulation and analytical models. The new hybrid model uses simulation to provide inputs to the analytical model. The hybrid then can predict the performance of the new SMP cluster architecture to large machine configurations. More on the hybrid model development is included in the Wisconsin section. (UCLA/Wisconsin)
m. The LogGP application level component model of Sweep3D has been validated and prepared for inclusion in the POEMS system. Model input parameters are measured on 4-node parallel runs of the Sweep3D code, and the model accurately projects execution time to 128-node parallel runs of the code, for both fixed total problem size and fixed problem size per node, for the SP/2 system at LLNL. (Wisconsin)
n. Together with the LANL and UCLA teams, we completed the design of a set of scalability experiments for Sweep3D. (Wisconsin/LANL/UCLA)
o. We successfully modified the LogGP component model of Sweep3D to represent execution of the application using fast intra-node MPI communication primitives that are projected to be implemented in the future on the SP/2. Together with the UCLA team, we used this new component model to create a simple hybrid analytic/simulation (LogGP/MPI-sim) model that can predict the scalability of Sweep3D with the projected new communication primitives. The hybrid model validated extremely well against the pure MPI-sim simulation model, and projects Sweep3D performance to much larger numbers of processors than is possible with MPI-sim alone. We submitted this work to the ‘99 Sigmetrics conference. (Wisconsin/UCLA)
p. Together with Prof. Derek Eager (U. of Saskatchewan) and Dan Sorin (U. of Wisconsin), we developed new AMVA techniques for modeling (1) mean wait at a server with highly bursty service times, and (2) the mean wait at "downstream" system resources due to the bursty arrival process. These new techniques are needed in the AMVA component model of the Origin 2000 memory system (to model highly bursty memory requests). The techniques are also generally applicable for modeling highly bursty behavior, and are thus a significant contribution to performance modeling methodology. Validations against exact solution of a range of two-queue networks and against the detailed RSIM simulation of a shared memory multiprocessor memory system show that the new AMVA techniques work extremely well, whereas a previous widely used AMVA technique performs poorly. During these validation experiments we also discovered that an AMVA interpolation technique invented for the prototype Origin 2000 memory system model (reported in the ‘98 ISCA) was only successful because it had two errors that cancelled each other. The new techniques have been incorporated in an improved prototype Origin 2000 memory system component model. (Wisconsin)
q. Collaboratively with Eager and Sorin we developed a hierarchical AMVA model that estimates lock contention as well as memory system performance. We submitted preliminary validations of this model and the new AMVA techniques for modeling bursty behavior to Sigmetrics ‘99. Further validations of the lock contention model are in progress. (Wisconsin)
r. In joint work with Sorin and Eager, we improved the prototype AMVA model of the Origin 2000 memory system in several ways, including: 1) modifying the model to represent the SimOS architecture (similar to the Stanford Flash architecture, which is very similar to the Origin 2000), 2) incorporating the new AMVA techniques for modeling bursty memory requests and for modeling lock contention, and 3) extending the model so that heterogeneous processor behavior can be represented. We validated this improved model against several Splash II applications executed by the detailed SimOS simulator. The improved model predicts heterogeneous per-processor execution efficiency extremely well. One surprising result was that, for Single-Program Multiple-Data (SPMD) applications that might otherwise be assumed to have homogeneous processor behavior, a great deal of heterogeneity is observed in the memory system behavior and performance of the processors. We are currently investigating the causes of the heterogeneity. We submitted the preliminary model validations and observed heterogeneity to ISCA ‘99. (Wisconsin)
s. Los Alamos has developed an instruction-level workload characterization technique using performance counters. A paper has been accepted to the Workshop on Workload Characterization at Micro-31. This technique has the advantage of collecting instruction-level characteristics in a few runs virtually without overhead or slowdown. Based on the microprocessor architectural constraints and a set of derived abstract workload parameters, the architectural performance bottleneck for a specific application can be estimated. The analyzed results can provide some insight to the problem that only a small percentage of processor peak performance can be achieved even for many cache-friendly codes. This effect has been discovered as a critical performance drawback in the SWEEP3D scalability study. Further work has been performed to a bound estimation for CPI0 and CPI, based on the same instruction-level abstract parameters and architectural constraints. A validation for the bound of CPI0 estimation has been done by running some synthetic codes on a real machine.
t. Los Alamos has completed a LoPC model for wave-front scalability performance. Although this work is conducted as an independent project, Los Alamos is providing the results to the POEMS group as a collaboration contribution. Several papers and talks of this work have been made public. A well-received tutorial based on this work has been given at Supercomputing’98.
u. The translator from the POEMS specification language to and extended specification language for the CODE parallel programming system was completed. (UT-Austin)
6.0 Artifacts Developed During the Past Quarter
b. Technical paper: Eager, D. L., D. Sorin, and M. K. Vernon, "Analytic Modeling of Burstiness and Synchronization Using Approximate MVA, submitted to Sigmetrics ‘99.
c. Technical paper: Sorin, D., J. Lemon, D. L. Eager, and M. K. Vernon, "Analytic Evaluation of Shared-Memory Systems with Heterogeneous Applications", submitted to ISCA ‘99.
7.0 Issues
7.1 Open Issues with no plan for resolution
a) Simulating Fortran programs within the SimpleScalar Tool Set. (UTEP)
b) We are looking at the possibility of integrating MPI-SIM with the Simple Scalar processor and memory simulator used by UTEP. The interface between the simulators needs to be defined and developed. (UTEP/UCLA)
7.2 Open issues with plan for resolution:
a) Detailed symbolic analysis for optimizing the condensed (or collapsed) task graph. (Rice)
b) Resolving inconsistencies between the compiler’s internal representation and the information needed for the generation of task graphs. (Rice)
c) The causes of the observed heterogeneity in parallel SPMD programs executed by the detailed SimOS simulator. Possible causes include uneven assignment of the shared data to the multiple memory modules and execution of operating system daemons.
d) How to integrate the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.
e) Definition of the interfaces across application, operating system/runtime environment, and hardware modeling domains.
f) Interfacing of SimpleScalar and MPI-SIM.
g) Experiment design for completing the experimental evaluation of Sweep3D.
h) Sweep3D memory hierarchy study.
7.3 Issues resolved:
a) Selection of a processor/memory subsystem component model/simulator.
b) Definition of a plan for calibration of processor/memory subsystem component model/simulator and physical subsystem.
c) Instantiation of the dynamic task graph through the compiler.
d) The issue of simulating a cluster of SMPs has been resolved.
e) AMVA model of highly bursty behavior.
f) Experiment design for the experimental evaluation of Sweep3D.
8.0 Near-term Plan
The Near-Term Plan is given as a set of topics for research with the principal for the task given in parentheses and the collaborations given in the text.
Task Graph Compiler (Rice)- We aim to improve the functionality of the existing prototype. We plan to add a module to translate the current task graph output into an input description that can be read by the CODE graphical programming environment.
Component Models (UTEP)- Submit a new draft of the "Hardware Domain Component Model Specification" document, which will address feedback of the UCLA and LANL Poets, and will include the specification of the LLNL-SP/2 Power604e and a description of SimpleScalar.
Integration of Simulators (UTEP/UCLA) - Interface SimpleScalar with MPI and with MPI-SIM, and submit a document that describes the design and implementation of the interface w.r.t. MPI-SIM, collaboration with UCLA.
Model Validation (UTEP/LANL) -Calibrate the SimpleScalar-modeled Power604e with the LLNL-SP/2 Power604e—possible collaboration with LLNL.
Modeling Sweep3D CPU stalls (UTEP/LANL/Wisconsin)- possible collaboration with University of Texas at Austin, University of Wisconsin-Madison, and LANL.
Memory hierarchy study of Sweep3D (UTEP/Wisconsin) -- collaboration with University of Wisconsin-Madison.
Component Model Validation (Wisconsin) - Complete the validations of the improved Origin 2000 memory system model against the Splash II applications simulated on the SimOS architecture.
Component Model Validation (Wisconsin)- Determine the causes of the observed heterogeneity in the Splash II benchmarks.
Integration of Component Model Evaluation Methods (Wisconsin/UTEP/UCLA) - Experiment with further hybrid analytic/simulation models of Sweep3d, including models with simulated alternative memory hierarchies
Code Generation for the POEMS Specification Language (UT-Austin)
9.0 Completed Travel
Adve, Bagrodia, Browne, Deelman, Rice, Sakellariou, Oliver, Teller and Vernon attended and gave presentations
Adve, Bagrodia, Browne, Deelman, and Vernon attended and gave presentations on POEMS topics.
c) WOSP ‘98, the 1998 Workshop on Software and Performance in Santa, NM
Browne and Teller attended. Browne delivered a POEMS presentation and both Browne and Teller participated in a panel.
Sakellariou was a session chair at EuroPar’98 (September 1-4, Southampton, UK). No funds for foreign travel were charged to POEMS; conference registration was charged.
10.0 Equipment
N/A
11.0 Summary of Activity
11.1 Work Focus:
The focusing tasks are given by institution with collaborations noted in the descriptions.
Rice
A prototype module for the automatic generation of task graphs was implemented as part of the dHPF system.
UTEP
a) Hardware domain component model specification and implementation, in particular, implementation of the processor/memory subsystem component model/simulator of the LLNL-SP/2 Power604e.
b) LLNL-SP/2 Sweep3D task execution time collection for performance database.
c) Specification of POEMS database schema—collaboration with Vikram Adve, Rajive Bagrodia, Jim Browne, John Rice, Pat Teller, and Richard Oliver, i.e., Rice, UCLA, UT-Austin, Purdue, and UT-El Paso.
d) Interfacing of processor/memory subsystem simulation and MPI.
e) Preparatory research and experimentation for calibration of processor/memory subsystem component model/simulator of the LLNL-SP/2 Power604e.
f) Study of Sweep3D CPU stalls—collaboration with LANL.
UCLA
a) Designing and performing the scalability study for Sweep3D on theIBM SP.
b) Adding support for SMP cluster simulation to MPI-SIM.
c) Designing the hybrid model which combines analytical andsimulation models.
Wisconsin
This past quarter we focused on:
UT-Austin
The focus of work was development of the compiler for the POEMS Specification Language.
11.2 Significant Events:
c) The paper "Are All Scientific Workloads Equal?," R. Oliver and P. J. Teller was accepted for publication in the Proceedings of the 1999 IEEE International Performance, Computing and Communications Conference (ICPPP ‘99), February 1999. (UTEP)
d) We have provided support for simulation of clusters of SMPs. (UCLA)
FINANCIAL INFORMATION:
Contract #: N66001-97-C-8533
Contract Period of Performance: 7/24/97-7/23/00
Ceiling Value: $2,013,588
Reporting Period: 8/1/98-10/31/98
Actual Vouchered (all costs to be reported as fully burdened, do not report overhead, GA and fee separately):
Current Period
Prime Contractor Hours Cost
Labor 960 24,795.00
ODC’s 38,435.42
Sub-contractor 1 (Purdue) 405 20,017.61
Sub-contractor 2 (UT-El Paso) 134 12,963.58
Sub-contractor 3 (UCLA) 54 4,815.61
Sub-contractor 4 (Rice) 200 24,043.97
Sub-contractor 5 (Wisconsin) 474 33,872.08
Sub-contractor 6 (Los Alamos) 0 0.00
TOTAL: 2,227 167,943.27
Cumulative to date:
Prime Contractor Hours Cost
Labor 5,960 177,578.01
ODC’s 232,730.65
Sub-contractor 1 (Purdue) 881 52,550.28
Sub-contractor 2 (UT-El Paso) 942 69,984.91
Sub-contractor 3 (UCLA) 230 7,877.72
Sub-contractor 4 (Rice) 676 51,195.56
Sub-contractor 5 (Wisconsin) 1,070 69,466.94
Sub-contractor 6 (Los Alamos) 0 0.00
TOTAL: 9,759 661,384.07