From: John Rice <jrr@cs

Quarterly Status Report

Performance Modeling

An Environment For End-to-End Performance Design of

Large Scale parallel Adaptive Computer/Communications Systems

for the period November 1^st, 1998 to January 31^st, 1999,

Contract N66001-97-C-8533

CDRL A001

1.0 Purpose of Report

This status report is the quarterly contract deliverable (CDRL A001), which summarizes the effort expended by the University of Texas, Austin team in support of Performance Modeling on Contract N66001-97-C-8533.

2. Project Members

University of Texas, spent: 0 hours

sub-contractor (Purdue), spent: 110 hours

sub-contractor (UT-El Paso), spent: 996 hours

sub-contractor (UCLA), spent: 1,280 hours

sub-contractor (Rice), spent: 1,095 hours

sub-contractor (Wisconsin), spent: 288 hours

sub-contractor (Los Alamos), spent: 0 hours

3.0 Project Description (last modified 07/97)

3.1 Objective

The goals of this project are: (1) to develop a comprehensive environment (POEMS) for end-to-end performance analysis of large, heterogeneous, adaptive, parallel/distributed computer and communication systems, and (2) to demonstrate the use of the environment in analyzing and improving the performance of defense-critical parallel and distributed systems.

3.2 Approach

The project combines innovations from a number of domains (communication, data mediation, parallel programming, performance modeling, software engineering, and CAD/CAE) to realize the goals. First, we will develop a specification language based on a general model of parallel computation with specializations to representation of workload, hardware and software. To enable direct use of programs as workload specifications, compilation environments such as dHPF will be adapted to generate executable models of parallel computation at specified levels of abstraction.

Second, we will experimentally and incrementally develop and validate scaleable models. This will involve using multi-scale models, multi-paradigm models, and parallel model execution in complementary ways. Multi-scale models will allow different components of a system to be modeled at varying levels of detail via the use of adaptive module interfaces, supported by the specification language. Multi-paradigm models will allow an analyst to use the modeling paradigm—analytical, simulation, or the software or hardware system itself—that is most appropriate with respect to the goals of the performance study. Integration of an associative model of communication with data mediation methods to provide adaptive component interfaces will allow us to compose disparate models in a common modeling framework. To handle computationally expensive simulations of critical subsystems in a complex system, we will incorporate parallel simulation technology based on the Maisie language.

Third, we will provide a library of models, at multiple levels of granularity, for modeling scaleable systems like those envisaged under the DOE ASCI program, and for modeling complex adaptive systems like those envisaged under the GloMo and Quorum

programs.

Finally, we will provide a knowledge base of performance data that can be used to predict the performance properties of standard algorithms as a function of architectural characteristics.

4.0 Performance Against Plan

4.1 Spending – Spending is now catching up with plan. All of the subcontracts except for LANL are place. It has been determined that establishment of a subcontract between the University of Texas and LANL is not possible. The tasks and funding assigned to LANL are being shifted to other participants. Staffing at each participating institution has been completed. The spending rate for the project will, after this quarter, run at about the planned rate.

4.2 Task Completion - A summary of the completion status of each task in the SOW is given following. Because several participants are involved in most tasks the assessment of completion for tasks in progress have some uncertainty in the estimates of completion.

Task 1 - 90% Complete - Methodology development is an iterative process. One develops a version of the methodology, applies it and revises the methodology according to the success attained in the application. Evaluation of the methodology is in progress with the analysis of the performance of Sweep3D on the SP2 family of architectures. Closure will come with completion of Task 7 when validation of the methodology on the first end-to-end performance model has been completed.

Task 2 - Complete

Task 3 - 90% Complete - Specification languages for all three domains have been proposed and are in various states of completion.

Task 4 - 50% Complete -

Task 5 - 85% Complete - A feasibility demonstration compiler is now available.

Use of the compilation methods developed for the CODE parallel programming system at UT-Austin has accelerated this task.

Task 6 - 50% Complete - The initial library of components for Sweep3D on the IBM is complete and enhanced libraries are now in development.

Task 7 - 35% Complete - Subtask or Phase 1 of this task is about 50% complete.

Task 8 - 50% complete.

Task 9 – Task 9 has been partitioned into seven subtasks. Subtask 9.1 is complete and Subtask 9.2 is complete. Subtask 9.3 is 80% complete. Tasks 9.4 and 9.5 are 10% complete. Subtasks 9.6 and 9.7 have not yet been initiated.

Task 10 - 0% Complete

Task 11 - 0% Complete

5.0 Major Accomplishments to Date

5.1 Project Management

a) Long Term Workplan

POEMS has generated the framework for end-to-end performance modeling and has developed initial versions of several major components. This year has been designated the "Year of Integration." The long term goal for this year is integration of POEMS components into the framework. This will enable POEMS to spend the bulk of the third year of the project applying the POEMS framework to further example systems.

5.2 Technical Accomplishments

a. Knowledge-Based System

* Completed Tasks 9.1 and 9.2

* Made substantial enhancements to Pythia II (formerly called

IFESTOS) system

* Started work on Tasks 9.4 and 9.5

b. Models, Model Evaluation and Modeling

* Developed a synthetic benchmark SAMPLE (Synthetic Application for Message-Passing Library Environments). This C program allows us to measure the performance of a hardware platform as a function of application characteristics. Specifically, SAMPLE performs precisely changeable amounts of calculation and message-passing inter-process communications. SAMPLE executes message-passing via calls that can be targeted to either MPI-SIM or MPI. SAMPLE supports several communication patterns representative of a wide variety of applications. These patterns include: wavefront, nearest neighbor, ring, one-to-all, all-to-all. The frequency and size of messages, the communication patterns and the computational granularity can be varied to achieve a range of application behaviors.

* Developed models for the MPI communications on the SGI Origin 2000. We have experimentally derived the appropriate message-passing latencies to use as inputs for the MPI-SIM simulator. We will assess the model accuracy in the next quarter.

*In joint work with Sorin (Wisconsin) and Eager (U. Saskatchewan), we completed the validations of the prototype AMVA model of the Origin 2000 memory system against Splash II applications executed by the detailed SimOS simulator. The model predicts heterogeneous per-processor execution efficiency extremely well for all applications examined. We also examined the causes of the heterogeneous processor behavior for homogeneous applications, and found that the heterogeneous processor behavior was due to operating system effects. That is, the operating system places unequal processing load on the processors when measured on the time frame of a single barrier in the applications, and this results in heterogeneous processor efficiencies on this time scale.

* The data collection of Sweep3D task execution times on the LLNL-SP/2 (in accordance with the document "Measured Data for Sweep3D Task Times") is complete. The collected data is for a smaller number of problem sizes and processor grid sizes because of faulty runs. At this time, it does not seem time and cost effective to continue the data collection.

* The UTEP Poets have gained familiarity with the functionality of SimpleScalar Tool Set of the University of Wisconsin-Madison, and have implemented the interface between it and MPI. This new tool, which is called MPI_SS, runs multiple copies of SimpleScalar under MPI. In this way, the MPI version of Sweep3D can execute within the SimpleScalar simulation environment. Testing and documentation of this new tool are under way. This is the first step in interfacing SimpleScalar to MPI-SIM.

* Additional work was done w.r.t. to using narrow spectrum benchmarks for calibration of processor/memory subsystem component models.

c. Task Graph Compiler

Continued development on the compiler support for generating application task graphs. Two major extensions were added to the compiler in this quarter. First, we extended the task graph construction to an interprocedural version that can support multi-procedure programs. Second we initiated work on condensing the task graph interprocedurally, so that a sequence of computational tasks crossing procedure boundaries can be integrated into a single node. Third, we added support to compute a symbolic task scaling function that describes how the execution time of each computational task (or condensed task representing dynamic sequences of computational tasks) varies with program input parameters and system size.

Framework – Methodology, Specification Language and Compiler

* Continued refinement of the specification language.

* Continued development of the POEMS specification language compiler

6. Artifacts Developed

a. Technical Papers

[1] "A Knowledge Discovery Methodology for the Performance of Scientific Software" V.S. Verykios, E.N. Houstis, and J.R. Rice. (Accepted for publication in Journal of Knowledge and Information Systems.)

[2] "A Knowledge Discovery Methodology for the Performance of Scientific Software" V.S. Verykios, E.N. Houstis, and J.R. Rice (accepted for presentation at the International Conference on Computational Science, May 1999.)

[3] "Asynchronous Parallel Simulation of Parallel Programs". R. Bagrodia and E. Deelman (submitted to IEEE Transactionson Software Engineering.) The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. This paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis.

[4] "Parallel Simulation of Large Scale Parallel Applications" by R. Bagrodia, E. Deelman and T. Phan. (submitted to the International Journal of High-Performance and Scientific Applications - by invitation)

[5] "Application Representations for Multi-Paradigm Performance Modelingof Large-Scale Parallel Scientific Codes", Vikram Adve and Rizos Sakellariou, (submitted to the International Journal of High-Performance and Scientific Applications,- by invitation).

[6] "Compositional Development of Performance Models" J. C. Browne and A. Dube (Submitted to the International Journal of High Performance and Scientific Applications – by invitation)

b. Software

A new tool, MPI_SS, is described preceding.

7.0 Issues

7.1 Open Issues with no Plan for Resolution

a. Simulating Fortran programs within the SimpleScalar Tool Set.

7.2 Open Issues with Plan for Resolution

a. Entry of Performance Data - How to get performance data from other sites inserted into the database.

b. Integration of LogGP and AMVA Models - How to integrate the LogGP specification of application synchronization structure with the AMVA model of the Origin 2000-like memory for more precise end-to-end modeling of the Splash benchmarks.

c. Cross-Domain Model Integration - Definition of the interfaces across application, operating system/runtime environment, and hardware modeling domains.

d. Execution of Sweep3D memory hierarchy study.

Benchmarks for Calibration - Understanding the output of the narrow spectrum benchmarks being used for calibration.

7.3 Issues Resolved

MPI-SIM on the SGI Origin - The port of MPI-SIM to the Origin 2000 has been completed.

Processor Utilization Anomaly - The causes of heterogeneous processor execution behavior for homogeneous applications have been identified.

Integration of SimpleScalar and MPI-SIM - A methodology for interfacing SimpleScalar and MPI-SIM has been developed.

8.0 Near-term Plan

The near term plan focuses on completing components of POEMS and getting them ready for integration.

a. Knowledge-Based System

* Complete Task 9.3 (Initial rule set generation and test)

* Initiate Task 9.6 (Enhance the knowledge base system)

b. Models, Model Evaluation and Modeling

* Validate Sweep3d and SAMPLE and other benchmarks on the SGI Origin 2000.

* Experimentation with further hybrid analytic/simulation models of Sweep3d, including models with simulated alternative memory hierarchies (with the UCLA and UTEP teams.)

c. Task Graph Generation

* Work will continue on Task 4. We aim to improve the functionality of the existing prototype.

*Interfacing the task graph to an execution-driven MPI simulator, MPI-SIM, to improve the efficiency of simulation of MPI programs.

d. Framework, Specification Language and Compiler

* Mapping of task graphs to POEMS Specification Language.

* Development of benchmarks for POEMS Specification Language compiler.

9.0 – Completed Travel

Purdue

V.S. Verykios attended the POEMS Planning Meeting, January 15, 1999, Houston, Texas. and gave a presentation.

UCLA

Dr. Bagrodia attended the Winter Simulation Conference’1998 held

in December in Washington D.C.

Dr. Deelman attended the POEMS Planning meeting held in Houston on January 15^th, 1999

University of Texas at Austin

Browne attended the POEMS Planning meeting in Houston, Texas on January 15^th, 1999.

University of Texas at El Paso

Teller attended Supercomputing ‘98, November 7-13, 1998, Orlando, FLA,

at which she was a panelist (in Jim Browne’s place) on the panel entitled

"Using Performance Measures to Design Systems."

University of Wisconsin

Vernon attended the POEMS Planning meeting in Houston, Texas on January 15^th, 1999.

10.0 Equipment

None Acquired

11.0 Summary of Activity

The main foci for collective activities have been completion of the measurements and modeling for SWEEP3D on the IBM SP2 and completion of POEMS components and preparation for integration of these components into the POEMS framework

Each participating institution has been working on their responsibilities for tasks 1, 2, 3, 4, 6, 7 and 9.

Section 11.1 – Work Focus

The foci for activities are broken out by topic.

a) Knowledge Base

*Develop the POEMS knowledge base system (renamed PYTHIA II)

*Expand the performance data set in the knowledge base

b. Models, Model Evaluation and Modeling

* Porting MPI-SIM to the Origin 2000.

* Validations of the prototype component model of the Origin 2000 memory system.

* Hardware domain component model specification and implementation. In particular, implementation of the processor/memory subsystem component model/simulator of the LLNL-SP/2 Power604e and the SGI O2K MIPS R10000.

* Collection of LLNL-SP/2 Sweep3D task execution time for performance database.

* Interfacing of processor/memory subsystem simulation and MPI.

* Preparatory research and experimentation for calibration of processor/memory subsystem component model/simulator of the LLNL-SP/2 Power604e and SGI O2K MIPS R10000.

c. Task Graph Generation

Extending the dHPF compiler implementation of automatic synthesis of static, condensed static, and dynamic task graphs.

d. Framework – Methodology, Specification Language and Compiler

*Evaluation of the methodology and the specification language was the main effort in under this topic. Several simple examples were generated in the POEMS specification language.

Section 11.2 – Significant Events

a) Development of a synthetic benchmark which can validate MPI-SIM for a

wide range of programs with varying communication patterns and computation/communication ratios.

b) Porting of MPI-SIM and adaptation of its communication models to the SGI

Origin 2000

Explanation of the heterogeneous behavior of the SPMD Splash II benchmarks executing on the simulated SimOS architecture.

FINANCIAL INFORMATION:

Contract #: N66001-97-C-8533

Contract Period of Performance: 7/24/97-7/23/00

Ceiling Value: $1,839,517

Reporting Period: 11/01/98-1/31/99

Actual Vouchered (all costs to be reported as fully burdened, do not report overhead, GA and fee separately):

Current Period

Prime Contractor Hours Cost

Labor 0 0.00

ODC’s 14,544.31

Sub-contractor 1 (Purdue) 119 18,320.56

Sub-contractor 2 (UT-El Paso) 996 40,182.07

Sub-contractor 3 (UCLA) 1,280 68,467.14

Sub-contractor 4 (Rice) 1,095 57,810.65

Sub-contractor 5 (Wisconsin) 288 10,853.62

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 3,778 210,178.35

Cumulative to date:

Prime Contractor Hours Cost

Labor 5,960 177,578.01

ODC’s 247,279.96

Sub-contractor 1 (Purdue) 1,000 70,870.84

Sub-contractor 2 (UT-El Paso) 1,938 110,166.98

Sub-contractor 3 (UCLA) 1,510 76,344.86

Sub-contractor 4 (Rice) 1,771 109,006.21

Sub-contractor 5 (Wisconsin) 1,358 80,320.56

Sub-contractor 6 (Los Alamos) 0 0.00

TOTAL: 13,537 871,562.42