MHonArc archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Allreduce on the AlphaServer SC and IBM SP



From: "Patrick H. Worley" <worleyph@ornl.gov>
Subject: Allreduce on the AlphaServer SC and IBM SP
To: perfdata@mailhub.ornl.gov

Allreduce is an important collective communication operator
for a couple of codes that I care about. Historically, I have
only considered allreduce performance implicitly, as part of
larger kernels. This approach has its limitations, and last
March I started developing a test code that focuses
solely on allreduce performance. This test code has a few
quirks that are not found in current MPI test suites:

1) It tests a number of parallel implementations,
    not just MPI_ALLREDUCE. (MPI_ALLREDUCE
    is not always implemented as efficiently as it might
    be.)
2) The code includes tests for a range of vector sizes, up
    to 2MB currently. At least one of my application codes
    depends on the reduction of long vectors.
3) The code includes an option to test multiple allreduces
    over disjoint groups. For example, in a 2D domain
    decomposition, the allreduce may operate in only one
    direction, and each allreduce is restricted to a subset of
    the processors. This has interesting impacts on SMP
    cluster architectures, or on clusters that care whether
    physically contiguous nodes are participating in an allreduce.
4) The code supports testing a number of different assumptions
    about the state of the cache.

Over the summer, I have been using the code to look at two
performance questions:

a) What is the performance impact of Version 2.0 of the
AlphaServer SC software? (The new software supposedly
better exploits hardware in the Quadrics switch for MPI
collective communications, as well as does a better job
with the share memory MPI within a node.)

b) What are the scaling characteristics of allreduce on
the Winterhawk II and Nighthawk II IBM SP systems?
I had heard that allreduce (and other collectives?) were
scaling poorly on the Nighthawk II based systems, and I was
wondering what the nature of the problem was, and whether
the Winterhawk II systems suffered similar problems.

My initial results (and preliminary analyses) can be viewed at

http://www.csm.ornl.gov/evaluation/ALLREDUCE

I have lots of data, and am still working out how best to view
and present it. I may also need to collect some more, but there are some

interesting results already. As far as I can tell, the scaling issue
is not a problem with the collective communication operator,
per se. Rather, any (partially) synchronous operation that
lasts long enough will run into unrelated system interrupts that
perturb the performance. If the duration of the operation is very short,

the likelihood of an interrupt (within the operation) is low.
If the duration is long, then the impact of the interrupt appears to be
negligible (for the allreduce experiments on these systems, at least).
The "medium grain" events are the ones that appear to be most sensitive.

If the system interrupts only afflict the node, and not individual
processes, then underloading the SMP node should solve the problem?
Is there another solution that does not waste processors?

I know that others (who own Nighthawk II
systems) have been very concerned with this issue. Am I missing
something here?

Pat Worley









Perfdata Home | Main Index | Thread Index