[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MPI Calculation Strategy with Lanczos Algorithm



Very nice. As for distributing the work across processors....this 
takes us into the general topic of load balancing, which we have 
avoided thus far by using homogeneous symmetric clusters and an 
algorithm that has the same computatonal time independent of energy. 
I think a simple way to do this is to work out, even if only roughly 
and empirically, how the number of iterations  scales with energy and 
then give the processors doing the higher energy part of the spectrum 
more energy points to calculate. If we can get a rough idea of the 
scaling, this slight adjustment of how we parcel out the work will be 
easy to implement. If we get fancier, the "worker" machines can each 
do a single energy and then come back and say "thank you sir, may I 
have another" to the master machine, which then assigns another 
energy point to that node.

If we get a 5-fold increase in speed, this will be really terrific. 
Right now we are at a point with the MPI calculation where the 
asymptotic speedup is about a factor of 33 over a single processor, 
because .03 of the time is in the calculations that are still 
executed sequentially. The Lanczos method will NOT change the 
asymptotic limit, but it means we could get to that limit with 5 
times fewer processors, which would be a pretty small cluster of 
10-12 machines.

Jim, Howard and I are looking into the residual 3% of the runtime 
that is still sequential to see if we can find any other hot spots to 
get rid of. No final results on this yet, but we're looking.



>   The tests that Alex has been doing with Lanczos methods indicates that a
>van der Vurst scheme called BIGSTAB  (i.e. "stabilized bi conjugate 
>gradient" I think)
>which is essentially a Lanczos procedure applied to LU, is both 
>efficient and highly
>stable. Overall the method is typically about 5 times faster than LU.
>
>   Interestingly the number of iterations required for convergence 
>can be interpreted
>roughly as the order of the MS expansion used. For a 600 atom Si 
>run, this number
>starts at about 50 iterations at low energies and then drops to about 5-6 at
>the highese energies.  In path expansion terms, this means one needs 
>paths with
>up to 50 legs near threshold for XANES but only 5 leg paths in 
>EXAFS. Thus this
>shows how the calculation crosses over from full multiple scattering to the
>path expansion with increasing energy.
>
>   This also suggests a way of revising the MPI strategy. Presently 
>we assign the
>first processor the first nloc = nex/numproc energies, e.g., 0 - nloc , the
>next nloc+1, ... 2nloc, etc. This works file with the default LU method when
>all processors do exactly the same N**3 steps of exact matrix iversion.
>
>However, with the Lanczos approach, the last processor will be 
>sitting idle most of the time,
>having completed its work in 5 steps, while the 1st processor will 
>still be groaning away.
>Thus it's preferable to reorder the work, so processor of rank i does energy
>points i, i+numproc, i + 2 numproc ... and the work is more evenly spread out.
>Is there a version of MPE_DECOMP1D that can do this kind of order?
>
>Let me know if you have other comments on this idea.
>
>   This change does not have to be done immediately of course. But it 
>will likely
>be very useful in the future, especially when the big rotation 
>matrix drix is reconfigured.
>
>    J. Rehr