# Issues in Sorting on Parallel Computers

Parallelizing a sequential sorting algorithm involves distributing the elements to be sorted onto the available processes. This process raises a number of issues that we must address in order to make the presentation of parallel sorting algorithms clearer.

Contents

- 1 9.1.1 Where the Input and Output Sequences are Stored
- 2 9.1.2 How Comparisons are Performed
- 2.1 One Element Per Process
- 2.2 Figure 9.1. A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai, aj}, and Pj keeps max{ai , aj}.
- 2.3 More than One Element Per Process
- 2.4 Figure 9.2. A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its own block and retains only the appropriate half of the merged block. In this example, process Pi retains the smaller elements and process Pj retains the larger elements.

#### 9.1.1 Where the Input and Output Sequences are Stored

In sequential sorting algorithms, the input and the sorted sequences are stored in the process’s memory. However, in parallel sorting there are two places where these sequences can reside. They may be stored on only one of the processes, or they may be distributed among the processes. The latter approach is particularly useful if sorting is an intermediate step in another algorithm. In this chapter, we assume that the input and sorted sequences are distributed among the processes.

Consider the precise distribution of the sorted output sequence among the processes. A general method of distribution is to enumerate the processes and use this enumeration to specify a global ordering for the sorted sequence. In other words, the sequence will be sorted with respect to this process enumeration. For instance, if P_{i} comes before P_{j} in the enumeration, all the elements stored in P_{i} will be smaller than those stored in P_{j} . We can enumerate the processes in many ways. For certain parallel algorithms and interconnection networks, some enumerations lead to more efficient parallel formulations than others.

#### 9.1.2 How Comparisons are Performed

A sequential sorting algorithm can easily perform a compare-exchange on two elements because they are stored locally in the process’s memory. In parallel sorting algorithms, this step is not so easy. If the elements reside on the same process, the comparison can be done easily. But if the elements reside on different processes, the situation becomes more complicated.

##### One Element Per Process

Consider the case in which each process holds only one element of the sequence to be sorted. At some point in the execution of the algorithm, a pair of processes (P_{i}, P_{j}) may need to compare their elements, a_{i} and a_{j}. After the comparison, P_{i} will hold the smaller and P_{j} the larger of {a_{i}, a_{j}}. We can perform comparison by having both processes send their elements to each other. Each process compares the received element with its own and retains the appropriate element. In our example, P_{i} will keep the smaller and P_{j} will keep the larger of {a_{i}, a_{j}}. As in the sequential case, we refer to this operation as compare-exchange. As Figure 9.1illustrates, each compare-exchange operation requires one comparison step and one communication step.

##### Figure 9.1. A parallel compare-exchange operation. Processes P_{i} and P_{j} send their elements to each other. Process P_{i} keeps min{a_{i}, a_{j}}, and P_{j} keeps max{a_{i} , a_{j}}.

If we assume that processes P_{i} and P_{j} are neighbors, and the communication channels are bidirectional, then the communication cost of a compare-exchange step is (t_{s} + t_{w}), where t_{s} and t_{w} are message-startup time and per-word transfer time, respectively. In commercially available message-passing computers, t_{s} is significantly larger than t_{w}, so the communication time is dominated by t_{s}. Note that in today’s parallel computers it takes more time to send an element from one process to another than it takes to compare the elements. Consequently, any parallel sorting formulation that uses as many processes as elements to be sorted will deliver very poor performance because the overall parallel run time will be dominated by interprocess communication.

##### More than One Element Per Process

A general-purpose parallel sorting algorithm must be able to sort a large sequence with a relatively small number of processes. Let p be the number of processes P_{0}, P_{1}, …, P_{p}_{-1}, and let n be the number of elements to be sorted. Each process is assigned a block of n/p elements, and all the processes cooperate to sort the sequence. Let A_{0}, A_{1}, … A _{p}_{-1} be the blocks assigned to processes P_{0}, P_{1}, … P_{p}_{-1}, respectively. We say that A_{i} A_{j} if every element of A_{i} is less than or equal to every element in A_{j}. When the sorting algorithm finishes, each process P_{i} holds a set such that for i j, and .

As in the one-element-per-process case, two processes P_{i} and P_{j} may have to redistribute their blocks of n/p elements so that one of them will get the smaller n/p elements and the other will get the larger n/pelements. Let A_{i} and A_{j} be the blocks stored in processes P_{i} and P_{j}. If the block of n/p elements at each process is already sorted, the redistribution can be done efficiently as follows. Each process sends its block to the other process. Now, each process merges the two sorted blocks and retains only the appropriate half of the merged block. We refer to this operation of comparing and splitting two sorted blocks as compare-split. The compare-split operation is illustrated in Figure 9.2.

##### Figure 9.2. A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its own block and retains only the appropriate half of the merged block. In this example, process P_{i} retains the smaller elements and process P_{j} retains the larger elements.

If we assume that processes P_{i} and P_{j} are neighbors and that the communication channels are bidirectional, then the communication cost of a compare-split operation is (t_{s} + t_{w}n/p). As the block size increases, the significance of t_{s} decreases, and for sufficiently large blocks it can be ignored. Thus, the time required to merge two sorted blocks of n/p elements is Q(n/p).