1. Introduction : In concurrent programming, lock contention might be a bottleneck in your software , if a thread holds a lock for a while and if the other threads wait on that lock. It can also be described as a competition between thread to acquire a lock. There are other terms around this such as “coarse-grained” locking and “fine-grained” locking. In fine grained locking, you might end up using more locks but protecting a smaller portion of data. Note that implementation will be harder. On the other hand in coarse grained locking, you lock larger data and also because of this it is easier to implement and safer to use.
2. Why fine grained locking and what we will benchmark : As mentioned you can easily use coarse-grained lock to make sure that your code is thread safe, however you can also notice that you are only ensuring that your code is thread safe but not benefiting from concurrency for speed. In this blog article, I will first show a coarse-grained MPMC ( multiple producer multiple consumer) unbounded queue. And after that I will show the implementation of a fine grained one. Finally I will compare results and use Intel VTune to identify the contention.
3. Coarse-grained MPMC unbounded queue : Below you can see the implementation. Basically we use a mutex for protecting an std::queue and we also use a condition variable in order to notify the pop/dequeue calls that the queue is not empty :
4. Fine grained MPMC unbounded queue: For the implementation we will not use an std::queue but we will be implementing a simple linked list in order to divide the tail and the head of the queue. Also in order to avoid to access head element in a enqueue call ( to see whether the head is null and updating the head pointer during adding the first element), we will start the queue with a pre-allocated dummy node. This implementation is based on : https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html#nbq
5. Benchmark : I ran the benchmark on a 64 bit Ubuntu system with 64 byte cache line Intel I7 Haswell CPU. The benchmark code created 50 writer threads and 50 reader threads. And each thread did 1000 iterations. Also we benchmarked both executables in 100 iterations. You can find the source code at :
I used a simple bash script for the benchmark to run both executables in N iterations and collect max,min and average times : https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca
Program executed : ./fine_grained_queue
Iteration times : 100
Maximum time : 34 milliseconds
Minimum time : 18 milliseconds
Average time : 20.81000 milliseconds
And below you see the coarse grained benchmark :
Program executed : ./coarse_grained_queue
Iteration times : 100
Maximum time : 60 milliseconds
Minimum time : 45 milliseconds
Average time : 47.23000 milliseconds
As you can see for 50 reader and 50 writer threads, the average difference is 27 milliseconds in 100( program iteration ) * 1000 ( thread iteration ).
6. Intel VTune and spotting the lock contention : I used Intel VTune on Ubuntu to see a report of both executable benchmarks :
As you can notice the right hand side fine grained statistics whereas the left hand side is the coarse grained one. In the fine grained one , you will notice 2 concentrated pieces -> 1 mutex for the tail and 1 mutex for the head whereas in the coarse grained one there is one busy looking mutex lock. The time figures show the time waited on these locks by all threads , therefore the time waited on a single lock in the coarse-grained example is more than 3 times than the time waited on 2 mutexes in the fine grained sample.
Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see :