Parallel Algorithms for Solving Linear Equations in VLSI Circuit
Simulation
Analog circuit simulation is indispensable for the design
of integrated circuits. Simulation complexity and expenditure grow with
increasing packing density of the circuits. Most of the computing time
is required for the recurrent direct solution of large systems of linear
equations Ax=b. Matrix A has a very sparse and irregular pattern of nonzero
coefficients, with changing values during the simulation process. To minimize
the number of fill-ins an empirical reordering method (Markowitz) is used.
In this contribution parallel algorithms are presented, which reduce simulation
time using multiprocessor machines with shared memory. Experimental results
rest upon efficient multithreaded methods. Elimination trees and elimination
graphs embody the dependence of data during the elimination process. Therefore,
they are important means for estimating and controlling parallel solution
algorithms. At the so-called coarse sparse-matrix granularity level, for
each variable a quantitative measure of the computational effort of the
corresponding elimination step is introduced yielding an estimate of the
available speed-up. Applying these instruments and a modification of Liu's
rotation method, the elimination tree is restructured such that the potential
parallelism is increased. This restructuring corresponds to reordering
the equations. Using an empirical algorithm, which reduces the critical
path length (CPL), the total elimination task is subdivided into a number
of partitions. In most cases the employed rules prevent disadvantageous
partitions, but often the result is unbalanced. This problem is solved
by increasing the number of partitions (pool of tasks) and using a corresponding
scheduling mechanism. With the CPL as relevant measure, the quality of
the partitioning is estimated comparing the elimination and partition tree.
Furthermore, experimental results are brought into comparison with theoretical
speed-ups, which are calculated in a symbolic scheduling process, where
any overhead like that for synchronization is neglected. For some examples
the expectations are exceeded. For the other examples the difference between
theoretical and experimental results, and the saturation behaviour is insignificant.
Despite the rotation method the speed-up is typically limited to a factor
of four. Therefore it is necessary to exploit the potential parallelism
based on medium sparse-matrix granularity to further increase speed-up.
Because of the synchronization overhead, the use of such a method is successful
in special matrix areas only. Such areas are detected by a cost analysis.
Static and dynamic multigranular methods, which are based on such a cost
analysis, are presented.
|