AllPairs Shortest Paths
Instead of finding the shortest paths from a single vertex v to every other vertex, we are sometimes interested in finding the shortest paths between all pairs of vertices. Formally, given a weighted graph G(V, E, w), the allpairs shortest paths problem is to find the shortest paths between all pairs of vertices v_{i}, v_{j} V such that i j. For a graph with n vertices, the output of an allpairs shortest paths algorithm is an n x nmatrix D = (d_{i}_{,}_{j}) such that d_{i}_{,}_{j} is the cost of the shortest path from vertex v_{i} to vertex v_{j}.
The following sections present two algorithms to solve the allpairs shortest paths problem. The first algorithm uses Dijkstra’s singlesource shortest paths algorithm, and the second uses Floyd’s algorithm. Dijkstra’s algorithm requires nonnegative edge weights (Problem 10.4), whereas Floyd’s algorithm works with graphs having negativeweight edges provided they contain no negativeweight cycles.
Contents
 1 10.4.1 Dijkstra’s Algorithm
 2 10.4.2 Floyd’s Algorithm
 2.1 Algorithm 10.3 Floyd’s allpairs shortest paths algorithm. This program computes the allpairs shortest paths of the graph G = (V, E ) with adjacency matrix A.
 2.2 Parallel Formulation
 2.3 Figure 10.7. (a) Matrix D(k) distributed by 2D block mapping into subblocks, and (b) the subblock of D(k) assigned to process Pi,j.
 2.4 Figure 10.8. (a) Communication patterns used in the 2D block mapping. When computing , information must be sent to the highlighted process from two other processes along the same row and column. (b) The row and column of processes that contain the kth row and column send them along process columns and rows.
 2.5 Algorithm 10.4 Floyd’s parallel formulation using the 2D block mapping. P*,j denotes all the processes in the jth column, and Pi,* denotes all the processes in the ith row. The matrix D(0) is the adjacency matrix.
 2.6 Figure 10.9. Communication protocol followed in the pipelined 2D block mapping formulation of Floyd’s algorithm. Assume that process 4 at time t has just computed a segment of the kth column of the D(k1) matrix. It sends the segment to processes 3 and 5. These processes receive the segment at timet + 1 (where the time unit is the time it takes for a matrix segment to travel over the communication link between adjacent processes). Similarly, processes farther away from process 4 receive the segment later. Process 1 (at the boundary) does not forward the segment after receiving it.
 2.7 Table 10.1. The performance and scalability of the allpairs shortest paths algorithms on various architectures with O (p) bisection bandwidth. Similar run times apply to all k – d cube architectures, provided that processes are properly mapped to the underlying processors.
 3 10.4.3 Performance Comparisons
10.4.1 Dijkstra’s Algorithm
In Section 10.3 we presented Dijkstra’s algorithm for finding the shortest paths from a vertex v to all the other vertices in a graph. This algorithm can also be used to solve the allpairs shortest paths problem by executing the singlesource algorithm on each process, for each vertex v. We refer to this algorithm as Dijkstra’s allpairs shortest paths algorithm. Since the complexity of Dijkstra’s singlesource algorithm is Q(n^{2}), the complexity of the allpairs algorithm is Q(n^{3}).
Parallel Formulations
Dijkstra’s allpairs shortest paths problem can be parallelized in two distinct ways. One approach partitions the vertices among different processes and has each process compute the singlesource shortest paths for all vertices assigned to it. We refer to this approach as the sourcepartitioned formulation. Another approach assigns each vertex to a set of processes and uses the parallel formulation of the singlesource algorithm (Section 10.3) to solve the problem on each set of processes. We refer to this approach as the sourceparallel formulation. The following sections discuss and analyze these two approaches.
SourcePartitioned Formulation The sourcepartitioned parallel formulation of Dijkstra’s algorithm uses n processes. Each process P_{i} finds the shortest paths from vertex v_{i} to all other vertices by executing Dijkstra’s sequential singlesource shortest paths algorithm. It requires no interprocess communication (provided that the adjacency matrix is replicated at all processes). Thus, the parallel run time of this formulation is given by
Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows:
Equation 10.2
It might seem that, due to the absence of communication, this is an excellent parallel formulation. However, that is not entirely true. The algorithm can use at most n processes. Therefore, the isoefficiency function due to concurrency is Q(p^{3}), which is the overall isoefficiency function of the algorithm. If the number of processes available for solving the problem is small (that is, n = Q(p)), then this algorithm has good performance. However, if the number of processes is greater than n, other algorithms will eventually outperform this algorithm because of its poor scalability.
SourceParallel Formulation The major problem with the sourcepartitioned parallel formulation is that it can keep only n processes busy doing useful work. Performance can be improved if the parallel formulation of Dijkstra’s singlesource algorithm (Section 10.3) is used to solve the problem for each vertex v. The sourceparallel formulation is similar to the sourcepartitioned formulation, except that the singlesource algorithm runs on disjoint subsets of processes.
Specifically, p processes are divided into n partitions, each with p/n processes (this formulation is of interest only if p > n). Each of the n singlesource shortest paths problems is solved by one of the n partitions. In other words, we first parallelize the allpairs shortest paths problem by assigning each vertex to a separate set of processes, and then parallelize the singlesource algorithm by using the set of p/n processes to solve it. The total number of processes that can be used efficiently by this formulation is O (n^{2}).
The analysis presented in Section 10.3 can be used to derive the performance of this formulation of Dijkstra’s allpairs algorithm. Assume that we have a pprocess messagepassing computer such that p is a multiple ofn. The p processes are partitioned into n groups of size p/n each. If the singlesource algorithm is executed on each p/n process group, the parallel run time is
Equation 10.3
Notice the similarities between Equations 10.3 and 10.2. These similarities are not surprising because each set of p/n processes forms a different group and carries out the computation independently. Thus, the time required by each set of p/n processes to solve the singlesource problem determines the overall run time. Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows:
Equation 10.4
From Equation 10.4 we see that for a costoptimal formulation p log p/n^{2} = O (1). Hence, this formulation can use up to O (n^{2}/log n) processes efficiently. Equation 10.4 also shows that the isoefficiency function due to communication is Q((p log p)^{1.5}). The isoefficiency function due to concurrency is Q(p^{1.5}). Thus, the overall isoefficiency function is Q((p log p)^{1.5}).
Comparing the two parallel formulations of Dijkstra’s allpairs algorithm, we see that the sourcepartitioned formulation performs no communication, can use no more than n processes, and solves the problem in timeQ(n^{2}). In contrast, the sourceparallel formulation uses up to n^{2}/log n processes, has some communication overhead, and solves the problem in time Q(n log n) when n^{2}/log n processes are used. Thus, the sourceparallel formulation exploits more parallelism than does the sourcepartitioned formulation.
10.4.2 Floyd’s Algorithm
Floyd’s algorithm for solving the allpairs shortest paths problem is based on the following observation. Let G = (V, E, w) be the weighted graph, and let V = {v_{1},v_{2},…, v_{n}} be the vertices of G. Consider a subset {v_{1},v_{2},…, v_{k}} of vertices for some k where k n. For any pair of vertices v_{i} , v_{j} V, consider all paths from v_{i} to v_{j} whose intermediate vertices belong to the set {v_{1}, v_{2},…, vk }. Let be the minimumweight path among them, and let be the weight of . If vertex v_{k} is not in the shortest path from v_{i} to v _{j}, then is the same as . However, if v_{k} is in , then we can break into two paths – one from v_{i}to v_{k} and one from v_{k} to v_{j}. Each of these paths uses vertices from {v_{1}, v_{2},…, v_{k}_{1}}. Thus, . These observations are expressed in the following recurrence equation:
Equation 10.5
The length of the shortest path from v_{i} to v_{j} is given by . In general, the solution is a matrix .
Floyd’s algorithm solves Equation 10.5 bottomup in the order of increasing values of k. Algorithm 10.3 shows Floyd’s allpairs algorithm. The run time of Floyd’s algorithm is determined by the triplenested for loops in lines 47. Each execution of line 7 takes time Q(1); thus, the complexity of the algorithm is Q(n^{3}). Algorithm 10.3 seems to imply that we must store n matrices of size n xn. However, when computing matrix D^{(}^{k}^{)}, only matrix D^{(}^{k}^{1)} is needed. Consequently, at most two n x n matrices must be stored. Therefore, the overall space complexity is Q(n^{2}). Furthermore, the algorithm works correctly even when only one copy of D is used (Problem 10.6).
Algorithm 10.3 Floyd’s allpairs shortest paths algorithm. This program computes the allpairs shortest paths of the graph G = (V, E ) with adjacency matrix A.
1. procedure FLOYD_ALL_PAIRS_SP( A) 2. begin 3. D^{(0)} = A; 4. for k := 1 to n do 5. for i := 1 to n do 6. for j := 1 to n do 7. ; 8. end FLOYD_ALL_PAIRS_SP
Parallel Formulation
A generic parallel formulation of Floyd’s algorithm assigns the task of computing matrix D^{(}^{k}^{)} for each value of k to a set of processes. Let p be the number of processes available. Matrix D^{(}^{k}^{)} is partitioned into p parts, and each part is assigned to a process. Each process computes the D^{(}^{k}^{)} values of its partition. To accomplish this, a process must access the corresponding segments of the k^{th} row and column of matrix D^{(}^{k}^{1)}. The following section describes one technique for partitioning matrix D^{(}^{k}^{)}. Another technique is considered in Problem 10.8.
2D Block Mapping One way to partition matrix D^{(}^{k}^{)} is to use the 2D block mapping (Section 3.4.1). Specifically, matrix D^{(}^{k}^{)} is divided into p blocks of size , and each block is assigned to one of the p processes. It is helpful to think of the p processes as arranged in a logical grid of size . Note that this is only a conceptual layout and does not necessarily reflect the actual interconnection network. We refer to the process on the i^{th} row and j^{th} column as P_{i}_{,}_{j}. Process P_{i}_{,}_{j} is assigned a subblock of D^{(}^{k}^{)} whose upperleft corner is and whose lowerright corner is . Each process updates its part of the matrix during each iteration. Figure 10.7(a) illustrates the 2D block mapping technique.
Figure 10.7. (a) Matrix D^{(}^{k}^{)} distributed by 2D block mapping into subblocks, and (b) the subblock of D^{(}^{k}^{)} assigned to process P_{i}_{,}_{j}.
During the k^{th} iteration of the algorithm, each process P_{i}_{,}_{j} needs certain segments of the k^{th} row and k^{th} column of the D^{(}^{k}^{1)} matrix. For example, to compute it must get and . As Figure 10.8illustrates, resides on a process along the same row, and element resides on a process along the same column as P_{i}_{,}_{j}. Segments are transferred as follows. During the k^{th} iteration of the algorithm, each of the processes containing part of the k^{th} row sends it to the processes in the same column. Similarly, each of the processes containing part of the k^{th} column sends it to the processes in the same row.
Figure 10.8. (a) Communication patterns used in the 2D block mapping. When computing , information must be sent to the highlighted process from two other processes along the same row and column. (b) The row and column of processes that contain the k^{th} row and column send them along process columns and rows.
Algorithm 10.4 shows the parallel formulation of Floyd’s algorithm using the 2D block mapping. We analyze the performance of this algorithm on a pprocess messagepassing computer with a crossbisection bandwidth of Q(p). During each iteration of the algorithm, the k^{th} row and k^{th} column of processes perform a onetoall broadcast along a row or a column of processes. Each such process has elements of the k^{th} row or column, so it sends elements. This broadcast requires time Q. The synchronization step on line 7 requires time Q(log p). Since each process is assigned n^{2}/p elements of the D^{(}^{k}^{)} matrix, the time to compute corresponding D^{(}^{k}^{)} values is Q(n^{2}/p). Therefore, the parallel run time of the 2D block mapping formulation of Floyd’s algorithm is
Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows:
Equation 10.6
From Equation 10.6 we see that for a costoptimal formulation ; thus, 2D block mapping can efficiently use up to O(n^{2}/log^{2}n) processes. Equation 10.6 can also be used to derive the isoefficiency function due to communication, which is Q(p^{1.5} log^{3} p). The isoefficiency function due to concurrency is Q(p^{1}^{.}^{5}). Thus, the overall isoefficiency function is Q(p^{1.5} log^{3} p).
Speeding Things Up In the 2D block mapping formulation of Floyd’s algorithm, a synchronization step ensures that all processes have the appropriate segments of matrix D^{(}^{k}^{1)} before computing elements of matrixD^{(}^{k}^{)} (line 7 in Algorithm 10.4). In other words, the k^{th} iteration starts only when the (k – 1)^{th} iteration has completed and the relevant parts of matrix D^{(}^{k}^{1)} have been transmitted to all processes. The synchronization step can be removed without affecting the correctness of the algorithm. To accomplish this, a process starts working on the k^{th} iteration as soon as it has computed the (k 1)^{th} iteration and has the relevant parts of theD^{(}^{k}^{1)} matrix. This formulation is called pipelined 2D block mapping. A similar technique is used in Section 8.3 to improve the performance of Gaussian elimination.
Algorithm 10.4 Floyd’s parallel formulation using the 2D block mapping. P_{*,}_{j} denotes all the processes in the j^{th} column, and P_{i}_{,*} denotes all the processes in the i^{th} row. The matrix D^{(0)} is the adjacency matrix.
1. procedure FLOYD_2DBLOCK(D^{(0)}) 2. begin 3. for k := 1 to n do 4. begin 5. each process P_{i}_{,}_{j} that has a segment of the k^{th} row of D^{(}^{k}^{1)}; broadcasts it to the P_{*,}_{j} processes; 6. each process P_{i}_{,}_{j} that has a segment of the k^{th} column of D^{(}^{k}^{1)}; broadcasts it to the P_{i}_{,*} processes; 7. each process waits to receive the needed segments; 8. each process P_{i}_{,}_{j} computes its part of the D^{(}^{k}^{)} matrix; 9. end 10. end FLOYD_2DBLOCK
Consider a pprocess system arranged in a twodimensional topology. Assume that process P_{i}_{,}_{j} starts working on the k^{th} iteration as soon as it has finished the (k – 1)^{th} iteration and has received the relevant parts of theD^{(}^{k}^{1)} matrix. When process P_{i}_{,}_{j} has elements of the k^{th} row and has finished the (k – 1)^{th} iteration, it sends the part of matrix D^{(}^{k}^{1)} stored locally to processes P_{i}_{,}_{j} _{1} and P_{i}_{,} _{j} _{+1}. It does this because that part of theD^{(}^{k}^{1)} matrix is used to compute the D^{(}^{k}^{)} matrix. Similarly, when process P_{i}_{,}_{j} has elements of the k^{th} column and has finished the (k – 1)^{th} iteration, it sends the part of matrix D^{(}^{k}^{1)} stored locally to processes P_{i} _{1,} _{j}and P_{i} _{+1,} _{j}. When process P_{i}_{,}_{j} receives elements of matrix D^{(}^{k}^{)} from a process along its row in the logical mesh, it stores them locally and forwards them to the process on the side opposite from where it received them. The columns follow a similar communication protocol. Elements of matrix D^{(}^{k}^{)} are not forwarded when they reach a mesh boundary. Figure 10.9 illustrates this communication and termination protocol for processes within a row (or a column).
Figure 10.9. Communication protocol followed in the pipelined 2D block mapping formulation of Floyd’s algorithm. Assume that process 4 at time t has just computed a segment of the k^{th} column of the D^{(}^{k}^{1)} matrix. It sends the segment to processes 3 and 5. These processes receive the segment at timet + 1 (where the time unit is the time it takes for a matrix segment to travel over the communication link between adjacent processes). Similarly, processes farther away from process 4 receive the segment later. Process 1 (at the boundary) does not forward the segment after receiving it.
Consider the movement of values in the first iteration. In each step, elements of the first row are sent from process P_{i}_{,}_{j} to P_{i} _{+1,}_{j}. Similarly, elements of the first column are sent from process P_{i}_{,}_{j} to processP_{i}_{,}_{j}_{+1}. Each such step takes time Q(). After Q steps, process gets the relevant elements of the first row and first column in time Q(n). The values of successive rows and columns follow after time Q(n^{2}/p) in a pipelined mode. Hence, process finishes its share of the shortest path computation in time Q(n^{3}/p) + Q(n). When process has finished the (n – 1)^{th} iteration, it sends the relevant values of the n^{th} row and column to the other processes. These values reach process P_{1,1} in time Q(n). The overall parallel run time of this formulation is
Since the sequential run time is W = Q(n^{3}), the speedup and efficiency are as follows:
Equation 10.7
Maximum Number of Processes for E = Q(1) 
Corresponding Parallel Run Time 
Isoefficiency Function 


Dijkstra sourcepartitioned 
Q(n) 
Q(n^{2}) 
Q(p^{3}) 
Dijkstra sourceparallel 
Q(n^{2}/log n) 
Q(n log n) 
Q((p log p)^{1.5}) 
Floyd 1D block 
Q(n/log n) 
Q(n^{2} log n) 
Q((p log p)^{3}) 
Floyd 2D block 
Q(n^{2}/log^{2} n) 
Q(n log^{2} n) 
Q(p^{1.5} log^{3} p) 
Floyd pipelined 2D block 
Q(n^{2}) 
Q(n) 
Q(p^{1.5}) 
From Equation 10.7 we see that for a costoptimal formulation p/n^{2} = O (1). Thus, the pipelined formulation of Floyd’s algorithm uses up to O (n^{2}) processes efficiently. Also from Equation 10.7, we can derive the isoefficiency function due to communication, which is Q(p^{1.5}). This is the overall isoefficiency function as well. Comparing the pipelined formulation to the synchronized 2D block mapping formulation, we see that the former is significantly faster.
10.4.3 Performance Comparisons
The performance of the allpairs shortest paths algorithms previously presented is summarized in Table 10.1 for a parallel architecture with O (p) bisection bandwidth. Floyd’s pipelined formulation is the most scalable and can use up to Q(n^{2}) processes to solve the problem in time Q(n). Moreover, this parallel formulation performs equally well even on architectures with bisection bandwidth O , such as a meshconnected computer. Furthermore, its performance is independent of the type of routing (storeandforward or cutthrough).