Multithreading : Lock contention and fine-grained vs coarse-grained benchmark

1. Introduction : In concurrent programming, lock contention might be a bottleneck in your software , if a thread holds a lock for a while and if the other threads wait  on that lock. It can also be described as a competition between thread to acquire a lock. There are other terms around this such as “coarse-grained” locking and “fine-grained” locking. In fine grained locking, you might end up using more locks but protecting a smaller portion of data. Note that implementation will be harder. On the other hand in coarse grained locking, you lock larger data and also because of this it is easier to implement and safer to use.

2. Why fine grained locking and what we will benchmark : As mentioned you can easily use coarse-grained lock to make sure that your code is thread safe, however you can also notice that you are only ensuring that your code is thread safe but not benefiting from concurrency for speed. In this blog article, I will first show a coarse-grained MPMC ( multiple producer multiple consumer) unbounded queue. And after that I will show the implementation of a fine grained one. Finally I will compare results and use Intel VTune to identify the contention.

3. Coarse-grained MPMC unbounded queue : Below you can see the implementation. Basically we use a mutex for protecting an std::queue and we also use a condition variable in order to notify the pop/dequeue calls that the queue is not empty :

4. Fine grained MPMC unbounded queue: For the implementation we will not use an std::queue but we will be implementing a simple linked list in order to divide the tail and the head of the queue. Also in order to avoid to access head element in a enqueue call ( to see whether the head is null and updating the head pointer during adding the first element), we will start the queue with a pre-allocated dummy node. This implementation is based on : https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html#nbq

5. Benchmark : I ran the benchmark on a 64 bit Ubuntu system with 64 byte cache line Intel I7 Haswell CPU. The benchmark code created 50 writer threads and 50 reader threads. And each thread did 1000 iterations. Also we benchmarked both executables in 100 iterations. You can find the source code at :

https://github.com/akhin/benchmarks/tree/master/fine_grained_vs_coarse_grained

I used a simple bash script for the benchmark to run both executables in N iterations and collect max,min and average times : https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca

Program executed : ./fine_grained_queue

Iteration times : 100

Maximum time : 34 milliseconds

Minimum time : 18 milliseconds

Average time : 20.81000 milliseconds

And below you see the coarse grained benchmark :

Program executed : ./coarse_grained_queue

Iteration times : 100

Maximum time : 60 milliseconds

Minimum time : 45 milliseconds

Average time : 47.23000 milliseconds

As you can see for 50 reader and 50 writer threads, the average difference is 27 milliseconds in 100( program iteration ) * 1000 ( thread iteration ).

6. Intel VTune and spotting the lock contention : I used Intel VTune on Ubuntu to see a report of both executable benchmarks :

vtune_locks

As you can notice the right hand side fine grained statistics whereas the left hand side is the coarse grained one. In the fine grained one , you will notice 2 concentrated pieces -> 1 mutex for the tail and 1 mutex for the head whereas in the coarse grained one there is one busy looking mutex lock. The time figures show the time waited on these locks by all threads , therefore the time waited on a single lock in the coarse-grained example is more than 3 times than the time waited on 2 mutexes in the fine grained sample.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see :

https://github.com/akhin/multithreaded_order_matching_engine/tree/master/source/core/concurrency

Advertisements

4 thoughts on “Multithreading : Lock contention and fine-grained vs coarse-grained benchmark”

  1. Great explanation! I haven’t seen that implementation for a concurrent queue before but it makes a ton of sense. Often times I’ve found myself using either multiple single-producer single-consumer lock free queues with a load balancer, or using the coarse grained queue implementation on multi-producer multi-consumer queues. Cool to know there is a better way, thanks!

  2. You are a rockstar and I am your fan. Could you please also make suggestions & advise best practices with these samples. That would be very much helpful. I think you should write a book on low latency development using C++

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s