Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

NTHU 2021 Parallel Programming HW1 Odd Even Sort implemented with MPI to Speed up The Performance of MergeSort and Quicksort

NotificationsYou must be signed in to change notification settings

Howeng98/Odd-Even-Sort

Repository files navigation

1. Implementation

In this homework, the target goal is implement odd even sort algorithm through MPI interface. The whole process can be seperate as 6 parts:

  • allocate memory
  • read input_data
  • MPI process communication
  • local sorting
  • reduce
  • write output_data

Load Balancing

In the load balancing part, since there always havenumber of data(n) greater thannumber of proc, thus most of the data need handle:

n / size

But some processes need handle:

n / size + 1 if rank < n % size

Data Preprocessing

I will prepare three memory space (data, temp, buf) for each rank.data is used t store the current loadinput_data orsorted_data from which received from another rank,data will be output as output value from each rank after all processes had done assigned task.temp is used for the temp memory space inmerge-sort phase.buf is the receive buffer when using MPI_Recv to receive sorted data from other rank.

Odd-Even Sort Phase

Let's explain how the odd-even phase work. In the even stage, there is only the rank whichrank%2==0orn&1 is true will recv the data from the odd rank (even rank-1) through MPI_Send and MPI_Recv. When they recv data from another rank, they save them in buf memory first, and try to merge the localdata and recvbuf intemp, and send the rest data back after sorting. Odd stage also do the samething, for more details please check my report documents.

Reduce

The even and odd stage are executed in while loops, it will be break when theMPI_Reduce result ofMPI_LOR is false. Because when if there is no anyswap happened,has_swap will be false, which is determined in themerge-sort part.

2. Optimization

  1. Use & instead of %

Use bit operator instead of arithmetic operator is better for compiler.

  1. Avoid to use ternary operator (e.g. a ? b : c)

Ternary operator also is a heavy cost operation, try to avoid use it.

  1. Reassign pointer target instead of copy the array value when doing Array memory swap

This is important, relocate array pointer instead of copying array elements one by one, save a lot of time.

  1. ++i > i++ > i+=1 > i=i+1

A code writing hint.

  1. Pick a suitable sorting algorithm

In this repo case, for local sort before start odd-even-sort, speadsort > qsort(C Library) > self-defined quicksort. It seems like qsort is optimized in the library, hence becareful when you choose sorting algorithm.

  1. Dynamic allocating, load balancing

Using dynamic way to allocate task job to each MPI rank.

  1. Avoid function memory allocation
  2. Avoid logical, duplicate redudant computing, use constant value

3. Experiments

System Spec

  • 4 Nodes, 12 CPUS ( if 1 cpu for 1 thread, then maximum is 48 threads in a same time )
  • OS: Arch Linux
  • Compilers: GCC 10.2.0, Clang 11.0.1
  • MPI: Intel MPI Library, Version 2019 Update 8
  • Scheduler: Slurm 20.02.5
  • Network: Infiniband

Strong Scalability (Speedup Factor)

Single NodeMultiple Node

Time Profile!

Single NodeMultiple Node

Discussion

The experiments result mentioned above, is executed on testcase34, which contain 546869888 data. The reason why I pick this testcase is because it have many data, so when we testing on it, it's executing time will make our experiments result image more clearly and easy to distinguish different through settings.

From my experiments result, obviously it shows that whether single node or multiple node, it's speedup performance is increasing slowly. But sometimes the performance will drop explicitly, especially the single node proc12 and multiple node on node4. I think the reason cause it is because the last node only can communicate with its left node, so its proc performance won't increase efficiently and let the sort step cost a lot of time.

4. Conclusion

From this homework, I had learn how to use MPI to let different processes to communication with each other. And the bottleneck in this repo I think is the Communication time, and the MPI_Wtime() functions, in the future work we can try to put more effort on these parts to make the program more efficiency.

5. References

About

NTHU 2021 Parallel Programming HW1 Odd Even Sort implemented with MPI to Speed up The Performance of MergeSort and Quicksort

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

[8]ページ先頭

©2009-2025 Movatter.jp