Multithreading : Memory orderings and fine grained SPSC lockless queue benchmark

1. Introduction : In a previous article, I wrote about SPSC lockfree queue and compared it against lock based SPSC queue :

In this article, I will improve it by changing the memory ordering. You can find the source code on

2. Memory orderings and C++ Standard Memory Model : Both the processor and the compilers can do reordering of operations in memory.  That is mainly intended for optimisations.  Developers had to know how to use hardware and also compiler specific instructions in order to have atomic variables and to have multicore synchronisation by using memory fences :

A nice example for compiler reordering :

A nice example for CPU reordering :

And here is Wikipedia`s definition of a memory model :


In computing, a memory model describes the interactions of threads through memory and their shared use of the data.

C++ has a standard memory model since C++11. This model tells compilers how to access memory for threads. And it also helps to program accordingly considering the memory orderings. You can see the link below for the details :

3. Sequential Consistency: By default atomic operations have sequential consistency. So that basically means all operations such as load and store will happen as they appear in the source code and that applies to all threads. This is the strongest guarantee and quite useful. Also in C++11 standard library, this is the default memory ordering type. In the previous SPSC ring buffer implementation, the atomic variables had sequential consistency and it was enough for us to provide synchronisation between one producer and one consumer.

4. Acquire / Release : Acquire/release ordering on the other hand can provide us a more “relaxed” ordering. In short in this one, if thread A loads a variable with acquire ordering , and if thread B  stores to the same shared variable with release ordering , it is guaranteed that the store operation will happen before the load. Below you can see an example  which is taken from page. This example is equivalent of mutual exclusion by using acquire release memory ordering from C++11 standard model  :

5. Implementation : So what we will do is in the previous lockless SPSC code is that we will specify stores with release ordering and load operations with acquire ordering. Therefore by leaving sequantial consistency, we will be “relaxing” the memory ordering. Rather than providing synchronisation for all source as they appear, we will be providing synchronisation for only shared variables between threads :

6. Sequential vs Acquire/Release benchmark : I ran the benchmark on a 64 bit Ubuntu system with 8 core and 64 byte cache line Intel I7 Haswell CPU. For the benchmark I used a bash script which you can find at

And here is the results for 5 million iterations :


Program executed : ./lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 118 milliseconds

Minimum time : 72 milliseconds

Average time : 85.40000 milliseconds

Program executed : ./fine_grained_lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 46 milliseconds

Minimum time : 34 milliseconds

Average time : 38.80000 milliseconds


As you can see we managed to get a better result by relaxing the memory order.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s