Quarterly Status Report
Performance Modeling
An Environment For End-to-End Performance Design of
Large Scale parallel Adaptive Computer/Communications Systems
for the period May 1st, 1999 to July 31st, 1999,
Contract N66001-97-C-8533
CDRL A001
1.0 Purpose of Report
This status report is the quarterly contract deliverable (CDRL A001), which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.
2. Project Members
University of Texas, spent: 1,065 hours
sub-contractor (Purdue), spent: 80 hours
sub-contractor (UT-El Paso), spent: 936 hours
sub-contractor (UCLA), spent: 456 hours
sub-contractor (Rice), spent: 433 hours
sub-contractor (Wisconsin), spent: 0 hours
sub-contractor (Los Alamos), spent: 0 hours
3.0 Project Description (last modified 07/97)
3.1 Objective
The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical parallel and distributed systems.
3.2 Approach
The POEMS project combines innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.
Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.
Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum
programs.
Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.
4.0 Performance Against Plan
4.1 Spending – Spending has caught up with plan. All of the subcontracts except for LANL are place. The spending rate for the project will, after this quarter, run at about the planned rate.
4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion. Assessments of task completions by participating institutions are given in the progress reports from each institution.
Task 1 - 95% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.
Task 2 - Complete
Task 3 - 95% Complete - Specification languages for all three domains have been proposed and are in various states of completion.
Task 4 - 85% Complete - Task graphs can now be developed for most HPF programs and work on MPI programs is well underway.
Task 5 - 85% Complete - The compiler for the specification language is well into development. Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.
Task 6 - 65% Complete - The initial library of components has been specified and instantiation has begun. (See the progress reports from UTEP and Wisconsin for details.)
Task 7 - 50% Complete - Subtask or Phase 1 of this task is about 50% complete. (See the progress reports from UCLA and Wisconsin for details.)
Task 8 - 55% Complete
Task 9 – Task 9 has been partitioned into seven subtasks. Subtask 9.1 is complete and Subtask 9.2 is complete. Subtask 9.3 is complete. Tasks 9.4 is 50% complete, 9.5 is 35% complete and 9.6 is20% complete. Subtask 9.7 has just been initiated this quarter.
Task 10 - 0% Complete
Task 11 - 0% Complete
5.0 Major Accomplishments to Date
a) Long Term Workplan
POEMS has generated the framework for end-to-end performance modeling and has developed initial versions of several major components. This year has been designated the "Year of Integration." The long-term goal for this year (1999/2000) is integration of POEMS components into the framework. This will enable POEMS to spend the bulk of the third year of the project in application to further example systems.
a. Knowledge Base
* Pythia system made ready to accept data from performance testing and
modeling
* Started work on Task 9.7
* Ifestos methodology validated on data from two case studies
previously published and analyzed by a completely different methodology.
b. Tool Interfacing and Integration
Integration of Compiler-Generated Task Execution Times into MPI-SIM
Rice and UCLA have been collaborating to develop hybrid models which integrate compiler-based optimizations which can describe task execution times
as a function of analytical formulas and measurement into the MPI-Sim simulator. For example, if a task is composed of a loop with bounds from 0 to n-1, the task execution time would be n* (measured execution time of the code segment inside the loop). This integration can be used to facilitate simulation of systems with thousands of processors, and realistic problem sizes expected for such large systems. In the previous quarter we had manually modified a few MPI programs to evaluate the benefits of this approach, with very promising results.
The MPI-Sim simulator was extended to accept the analytical descriptions of the task execution times and incorporates them into the performance estimation of the total execution time. The communications tasks are still simulated in detail by MPI-Sim.
The first step in developing the hybrid model is to derive a task graph from the application source code. The task graph has two purposes, one is to expose the tasks so that measurement of execution task time can be performed and the second is to derive analytical models of task time execution.
Once the tasks are exposed, the application is run and the task execution times are measured. For the purpose of large-scale simulation, the measurements are performed on a small problem size and a small number of processors.
To accomplish this generation of task graphs the Rice dHPF compiler has been extended as follows:
1) The compiler uses program slicing to identify those subsets of the
computational tasks whose *results do not affect the performance* of
the program; We call these "redundant computations" (redundant from the
viewpoint of program performance.
(2) The compiler computes analytical (but symbolic) estimates for the
execution time of these redundant computations.
(3) The compiler modifies the generated message-passing program to replace
the redundant computations with calls to a special MPI-Sim function,
passing in the symbolic performance estimate as a parameter.
(4) The compiler also generates a second new version of the message-passing
program with instrumentation inserted to measure parameter values for
the symbolic task performance estimates. This version is executed to
measure these parameter values and these values are also provided as a
separate input to the simulator.
(These compiler extensions directly exploit information in the static
task graph, which was developed an artifact of Rice University's effort on
task 4 of this project.)
Integration of MPI-Sim and SimpleScalar
UTEP finalized the interface between MPI and SimpleScalar, tested it, and distributed a document describing this effort and the use of this tool; the document is entitled "SimpleScalar 3.0a Modifications to Run Under MPI".
The interfacing of MPI and SimpleScalar has made it possible to run an MPI
program, in particular the MPI version of Sweep3D, on multiple instantiations
of SimpleScalar that communicate using MPI commands. The referenced document
can serve as a template for interfacing SimpleScalar to the POEMS platform.
Interfacing of Task Graph and Poems Specification Language Models
Rice and UT-Austin are working to interface the static and dynamic task graphs generated by dHPF to CODE environment, by mapping the task graphs to the POEMS Specification Language (PSL). The Rice investigators (Vikram Adve and Rizos Sakellariou) visited UT Austin for a day on May 17, 1999. The major result of the meeting was a resolution of the key technical issues to be faced in interfacing the two systems, and work plan to achieve this goal. Subsequently, Rizos Sakellariou provided UT Austin with a detailed example of a task graph for an example MPI program, that illustrates the features required to map to PSL.
c. Methodology Definition
The WOSP and TSEE papers define and illustrate the methodology.
d. Specification Languages
A complete example in the Poems Specification Language was given in the TSE paper.
e. Model Development and Validation
Hardware Domain Component Library
performance counters. If accurate, these counters could be used for modeling
purposes.
SWEEP3D Models
6. Artifacts Developed
Artifacts include technical papers, software and models.
a. "Compiler-Supported Simulation of Very Large Parallel Applications,"
Vikram Adve, Rajive Bagrodia, Ewa Deelman, Thomas Phan, and
Rizos Sakellariou, To appear in the Proceedings of the ACM/IEEE SC99
Conference on High Performance Networking and Computing."
b. "Analytic Evaluation of Shared Memory Architectures with Heterogeneous Applications" D. Eager, D. Sorin and M. Vernon. (Submitted to HPCA 2000.
7.0 Issues
None
a. Integration of Performance Data into the Knowledge System
How to automate (at least partially) the insertion of performance data into the knowledge system? This should not be a formidable technical proble but there are many details to be coordinated among the project participants
UCLA
b. Integration of PARSEC runtime into MPISIM.
We are considering porting the MPI-Sim simulator to the PARSEC runtime
simulation system. This will allow us to utilize the latest synchronization
algorithms present in PARSEC.
c. Integration of LogGP with AMVA
Integration of the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.
None
8.0 Near-term Plan
"Near-term" refers to the next one or two quarters.
linear algebra solvers.
b. Models and Model Evaluation
which will address feedback from UCLA and LANL and will include
the specification of the LLNL-SP/2 Power604e and a description of SimpleScalar.
of the block of work of Sweep3D.
performance counters.
9.0 – Completed Travel
Portland, Oregon. She is Student Volunteers Chair. This trip was not paid for by funds from this grant.
1999, Washington, DC. This trip was not paid for by funds from this grant.
Paso, Rice University and UCLA
10.0 Equipment
None Acquired
11.0 Summary of Activity
11.1 Work Focus:
The two foci for continuing work for the 1999/2000 year are integration of tools and component model library development.
c) Models and Model Evaluation
in particular, implementation of the processor/memory subsystem
component model/simulator of the LLNL-SP/2 Power604e and the SGI O2K
MIPS R10000.
performance analysis.
11.2 Significant events
literature.
from group into the Ifestos framework
May 4-6, 1999. The titles of the two papers were "Performance Prediction of Large Parallel Applications Using Parallel Simulations" and , "Predictive Analysis of a Wavefront Application Using LogGP."
FINANCIAL INFORMATION:
Contract #: N66001-97-C-8533
Contract Period of Performance: 7/24/97-7/23/00
Ceiling Value: $1,839,517
Reporting Period: 5/1/99-7/31/99
Actual Vouchered (all costs to be reported as fully burdened, do not report
overhead, GA and fee separately):
Actual Vouchered (all costs to be reported as fully burdened, do not report
overhead, GA and fee separately):
Current Period
Prime Contractor Hours Cost
Labor 1,065 44,440.25
ODC's 35,821.90
Sub-contractor 1 (Purdue) 80 12,414.90
Sub-contractor 2 (UT-El Paso) 936 20,086.47
Sub-contractor 3 (UCLA) 456 23,260.53
Sub-contractor 4 (Rice) 433 22,533.81
Sub-contractor 5 (Wisconsin) 0 216.07
Sub-contractor 6 (Los Alamos) 0 0.00
TOTAL: 2,970 158,773.93
Cumulative to date:
Cumulative to date:
Prime Contractor Hours Cost
Labor 8,225 254,165.26
ODC's 311,962.02
Sub-contractor 1 (Purdue) 1,208 95,232.81
Sub-contractor 2 (UT-El Paso) 3,346 148,415.06
Sub-contractor 3 (UCLA) 2,676 135,309.47
Sub-contractor 4 (Rice) 2,848 164,482.46
Sub-contractor 5 (Wisconsin) 1,658 86,058.40
Sub-contractor 6 (Los Alamos) 0 0.00
TOTAL: 19,961 1,195,625.48