Multithreading : Multicore programming and false-sharing benchmark

1. Introduction : Multicore programming is essential to benefit from power of hardware as it allows us truely make our code run on different CPU cores. However, someone doing multithreaded programming has to understand the underlying hardware to fully utilize it.

In most of processors, L1 cache and L2 caches are private which means they are per core. One exception is L1 can be shared between 2 virtual processors if there is hyperthreading. See hyperthreading article on Wikipedia : https://en.wikipedia.org/wiki/Hyper-threading. On the other hand L3 caches are shared by different CPU cores.

This shared cache lines introduces the problem as known as false sharing. For the simplest example, if there is 2 variables in 1 shared cache line used by 2 different threads on different cores, when one of the threads updates the value of  its own variable, the cache line will be invalidated and updated since it is shared :

false_sharing

This can be a performance penalty if you have many shared cache lines. In following sections, I will show how false sharing affects the execution time. Therefore it is a very good practise to align your data to cache line. You can easily do this by using alignas type specifier in C++11.

2. Benchmark code  : In the benchmark code, we have a static struct variable which has 3 32-bit integers. ( The benchmark will be executed on a 64 bit processor with 64 bytes of cache line size ). We fire 3 threads , each operates on different member of that static struct variable. For controlling the benchmark, I added macros that enables/disables alignment and CPU core IDs that our 3 threads will be bound to :

3. Benchmark  : I ran the benchmark on a 64 bit Ubuntu system with 8 core and 64 byte cache line Intel I7 Haswell CPU :

root@akhin-GS60-2PC-Ghost:/home/akhin/Desktop/aligned# cat /proc/cpuinfo |grep cache
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64

As mentioned in code section, I controlled alignment and thread CPU affinities by using macros. Here is the results of test steps :

1. No alignment , all threads on the same core : 0.746 seconds.

2. No alignment , thread 1 on core 0 , thread 2 on core 2, thread 3 on core 4 : 1.0295 seconds

3. Alignment , thread 1 on core 0 , thread 2 on core 2, thread 3 on core 4 : 0.2903 seconds

As you can see from the results, when we used 3 threads on 3 cores,  rather than performance improvement , we got a slower execution time compared to 3 threads on a single core. However as soon as I enabled alignment , I got the best result.

You can download this code from : https://github.com/akhin/benchmarks/tree/master/false_sharing

Advertisements

1 thought on “Multithreading : Multicore programming and false-sharing benchmark”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s