Allreduce on the AlphaServer SC and IBM SP

To: perfdata@mailhub.ornl.gov
Subject: Allreduce on the AlphaServer SC and IBM SP
From: "Patrick H. Worley" <worleyph@ornl.gov>
Date: Thu, 23 Aug 2001 18:10:51 -0400
Content-transfer-encoding: 7bit
Content-type: text/plain; x-mac-creator=4D4F5353; x-mac-type=54455854;charset=us-ascii
Organization: Oak Ridge National Laboratory
Reply-to: worleyph@ornl.gov
Resent-date: Thu, 23 Aug 2001 18:12:51 -0400 (EDT)
Sender: owner-perfdata@mailhub.ornl.gov

From: "Patrick H. Worley" <worleyph@ornl.gov>
Subject: Allreduce on the AlphaServer SC and IBM SP
To: perfdata@mailhub.ornl.gov

Allreduce is an important collective communication operator
for a couple of codes that I care about. Historically, I have
only considered allreduce performance implicitly, as part of
larger kernels. This approach has its limitations, and last
March I started developing a test code that focuses
solely on allreduce performance. This test code has a few
quirks that are not found in current MPI test suites:

1) It tests a number of parallel implementations,
    not just MPI_ALLREDUCE. (MPI_ALLREDUCE
    is not always implemented as efficiently as it might
    be.)
2) The code includes tests for a range of vector sizes, up
    to 2MB currently. At least one of my application codes
    depends on the reduction of long vectors.
3) The code includes an option to test multiple allreduces
    over disjoint groups. For example, in a 2D domain
    decomposition, the allreduce may operate in only one
    direction, and each allreduce is restricted to a subset of
    the processors. This has interesting impacts on SMP
    cluster architectures, or on clusters that care whether
    physically contiguous nodes are participating in an allreduce.
4) The code supports testing a number of different assumptions
    about the state of the cache.

Over the summer, I have been using the code to look at two
performance questions:

a) What is the performance impact of Version 2.0 of the
AlphaServer SC software? (The new software supposedly
better exploits hardware in the Quadrics switch for MPI
collective communications, as well as does a better job
with the share memory MPI within a node.)

b) What are the scaling characteristics of allreduce on
the Winterhawk II and Nighthawk II IBM SP systems?
I had heard that allreduce (and other collectives?) were
scaling poorly on the Nighthawk II based systems, and I was
wondering what the nature of the problem was, and whether
the Winterhawk II systems suffered similar problems.

My initial results (and preliminary analyses) can be viewed at

http://www.csm.ornl.gov/evaluation/ALLREDUCE

I have lots of data, and am still working out how best to view
and present it. I may also need to collect some more, but there are some

interesting results already. As far as I can tell, the scaling issue
is not a problem with the collective communication operator,
per se. Rather, any (partially) synchronous operation that
lasts long enough will run into unrelated system interrupts that
perturb the performance. If the duration of the operation is very short,

the likelihood of an interrupt (within the operation) is low.
If the duration is long, then the impact of the interrupt appears to be
negligible (for the allreduce experiments on these systems, at least).
The "medium grain" events are the ones that appear to be most sensitive.

If the system interrupts only afflict the node, and not individual
processes, then underloading the SMP node should solve the problem?
Is there another solution that does not waste processors?

I know that others (who own Nighthawk II
systems) have been very concerned with this issue. Am I missing
something here?

Pat Worley

Prev by Date: Welcome
Next by Date: Preliminary IBM Power4 (p690) results now available
Prev by thread: Preliminary IBM Power4 (p690) results now available
Next by thread: Welcome
Index(es):
- Date
- Thread

Perfdata Home | Main Index | Thread Index