Context switches : Epoll vs multithreaded IO benchmark

1. Introduction and multiplexed io
You need to implement thread per client solution in the classic server implementation  that needs to handle multiple clients simultaneously. This approach requires constantly calling  socket apis.

However when using multiplexed io , you wait for events instead of polling everytime. Another advantage it brings is that you can implement such a server using only one thread which is nice to avoid  context switches. This post`s purpose is measuring different IO mechanisms which are available in user space. The implementations will use TCP, however all tests done on the same machine via loopback adapter in order to avoid network effects but focusing on kernel.

2. Select, poll and epoll

Linux provides select, poll and epoll for multiplexed io :

http://man7.org/linux/man-pages/man2/select.2.html

http://man7.org/linux/man-pages/man2/poll.2.html

http://man7.org/linux/man-pages/man7/epoll.7.html

We will be using epoll  in this post as its biggest advantage is that you traverse events rather than  file descriptors to look for events. Therefore you don`t need to loop idle file descriptors.

Another note about epoll is that it provides various modes :

Level triggered mode : You will get an event notification as long as there is data to process. It means that , you will still continue
to get same events if you didn`t process the buffer.

Edge triggered mode : You will get notifications only once regardless you processed the buffer or not.

 3. About IO Patterns : Reactor and Proactor
Select, poll and epoll allows you to implement reactor IO pattern in which you wait for events : https://en.wikipedia.org/wiki/Reactor_pattern

Another similar IO pattern is called proactor : https://en.wikipedia.org/wiki/Proactor_pattern

In proactor pattern , you instead wait for completion of reading from a descriptor such as  a socket.  As searching for proactor patterns a while , I don`t think it is truly possible to implement in Linux as such kernel mechanism is not provided : https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports

On the other hand it is possible to implement it in Windows using IO completion ports. You can see an example implementation here : https://xania.org/200807/iocp

Note that Boost.ASIO is actually using epoll beneath in order to implement proactor, therefore we can`t say it is a true proactor in Linux.

4. Thread per client implementation

Thread per client implementation has an always running thread to accept new connection and spawns a new thread per connection. It is using std::mutex only when a new connection happens in order to syncronise book keeping of connected clients.

The implementation of base TCPServer class which is used by both thread-per-client and reactor implementations to manage the connected peers  :

The  implementation of thread per client server which is derived from TCPServer on :

5. Reactor ( Epoll ) server implementation 

Reactor implementation accepts new connections and handles client events all on the same thread. Therefore does not require any syncronisation. It uses level triggered epoll for simplicity.

Firstly the implemenation of io_event_listener_epoll.cpp which an epoll wrapper on :

And you can see server_reactor.cpp which uses the epoll wrapper to implement a reactor server :

6. Dropped connections and socket buffer sizes 

I observed disconnection issues with high number of sockets and threads.  For example,  1024 sockets and threads for client automation  and same for thread-per-client server implementation even using loopback adapter on the same machine. All had the same symptom : client  automation program got socket error code 104 ( Connection reset by peer ) whereas could not spot any socket error in the server side.  However , one thing I noticed is that increasing socket receive and send buffer sizes helped. In order to set socket send and receive buffer sizes  system-wide :

echo ‘net.ipv4.tcp_wmem= 10240 1024000 12582912’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem= 10240 1024000 12582912’ >> /etc/sysctl.conf

And eventually type “sysctl -p” ,therefore the system picks the changes up. Tried different default buffer sizes and observed similar results for different system-wide socket receive and send default buffer sizes  such as 1024000, 87380, 10240 and 128 bytes. Observed a high number of disconnections while benchmarking thread per client  server with 1024 clients/threads when socket buffer sizes were only 128 byte.

7. Benchmark

As I am measuring IO performances of different kernel mechanism from user space, I benchmarked in the same machine.. That is also useful to avoid any network effect as mainly interested only in IO and context switches.

You can specify number of clients and number of messages when using the client automation which I wrote for benchmarking. A thread will be spawned for each thread  and each thread will send the specified amount of messages to the connected server. And each thread will expect a response per message for automation to end.
At the end , the client automation will show you total elapsed time and average RTT ( round trip time ). It will also report number of disconnections which gives an idea about the accurracy of the results.

You can find the all source code of servers and client automation on : https://github.com/akhin/low_latency_experiments/tree/master/epoll_vs_multithreaded_io

During the benchmark system wide TCP buffer sizes were as below :

net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 87380 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

In all benchmarks , I used 100 ping-pongs between client automation and have changed number of clients ( threads ) in each benchmark in each benchmark. For 100 ping-pongs :

Client number         Epoll RTT                   Thread per client RTT
4                                   20 microseconds       62.5 microseconds
128                               23 microseconds       95 microseconds
1024                             30 microseconds       148 microseconds

8. Measuring context switches per thread using systemTap

I wanted to display the context per thread. Therefore first , I used named threads in server implementations using pthread_setname_np :

http://man7.org/linux/man-pages/man3/pthread_setname_np.3.html

That allowed me to give a OS-level name to each thread ( basically process as thread are light-weight processes : https://en.wikipedia.org/wiki/Light-weight_process )

After that I prepared a short systemTap ( https://sourceware.org/systemtap/ ) script to measure context switch via Linux kernel sched_switch call :

In order to run the script above for a specific program :

stap context_switch.stp -c program_name

When you run this systemTap script , it will report number of context switches per thread. You can easily notice high number of total context switches in thread-per-client implementation compared to epoll/reactor implementation.

SystemTap probes are working system-wide therefore slowing down the system. So I have got outputs for 32 clients from thread per client server and epoll server.

Thread per client server context switch counts per thread :

systemtap_thread_per_client

 

Epoll/Reactor server context switch count for the single epolling thread :

 

systemtap_epoll

Advertisements

GDB Debugging Automation with Python : Implementing a memory leak detector

1. Introduction : In this post , I will talk about how to automate debugging with GDB. And as an example project I will show a memory leak detector. The final leak detector can be seen here in action :

2. About GDB automation : Previously posted a small writing about Windbg: https://nativecoding.wordpress.com/2016/01/10/automate-attach-to-process-on-windows-with-windbg/

Similarly to Windbg , GDB supports script files, which allows you to save a batch of commands. You can also have those in gdb init file to load any time you load GDB. Additionally to that , gdb CLI interface also has a batch mode, therefore you can also automate GDB commands with an external Bash script. Below you can see a simple GDB script that dumps information about malloc calls :

To load a script like above , you just type “source malloc_dumper.txt” in GDB prompt.

However debugging automation can be really powerful when you use GDB`s Python API. Starting from GDB7, GDB comes with an embedded Python interpreter and also exposes a module named “gdb” for use from Python.

3. What can be done with GDB Python API : Firstly note that “gdb” module for Python will only be available when your script executed from GDB as it dynamically injects it to its interpreter. Therefore you will need to directly work with GDB prompt to explore it. Initially. Here is a list APIs :

https://sourceware.org/gdb/onlinedocs/gdb/Python-API.html#Python-API

To summarise what can be done generally :

  • You can do anything you could do with GDB scripts simply by doing “gdb.execute(“gdb_command_you_want_to_enter”)
  • You can create new GDB commands or  functions
  • You can create pretty printers. ( See the final Links section for an STL pretty printer example )
  • You can access breakpoints, frames & blocks and symbols, processs , threads and exceptions. values and more and all of these are provided as classes which makes job of automation very convenient compared to simple GDB scripting

4. Example project “memory leak detector” : I coded a small Python GDB extension script which dumps information about malloc,realloc, calloc and free calls from GNU LibC runtime. How it works briefly :

  • Places breakpoints for GNU LibC runtime memory functions. Also places breakpoints for main and exit function to detect start and the end of the session.
  • When a memory-function related breakpoint hit , it takes the control , captures the arguments passed to the function, captures callstack, and executes it until the end of its frame in order to capture the return value and then continues debugging
  • Also created a small Bash script , which makes it easy to use “memdump” extension. It basically executes GDB in batch mode, loads the Python script to GDB`s memory and executes it.

Note : As prerequisites you will need to install debug version of GNU LibC runtime. On Ubuntu :

sudo apt-get install libc6-dbg

And on CentOS :

yum install yum-utils
debuginfo-install glibc

And here you can see the Python implementation :

You can use the command below in order to start a GDB session by loading memdump.py :

gdb -batch -ex “source memdump.py” -ex ‘memdump’ -ex ‘r’ <debugee_execitable>

5. Analysing dump output : The previous GDB extension creates a text file with all memory operations information. I also implemented a separate Python script that parses the GDB extension`s output and finds out leaks. Basically the way it works :

  • For each calloc and malloc we add the memory event to a hash table by making the memory address key value
  • For each realloc , we remove the entry in the hash table for a previously allocated address and add a new entry with new memory address
  • For each free operation, we remove the entry from the dictionary.
  • Specifically to GNU LibC Runtime, we ignore memory operations which belong to directly GNU LibC Runtime`s internal functions
  • Finally each entry remaining in the dictionary gives us leaks.

Here is the analyser script :

6. Links : Here you can find a list of nice resources for the topic :

A presentation about GDB Python extensions : https://dmalcolm.fedorapeople.org/presentations/PyCon-US-2011/GdbPythonPresentation/GdbPython.html#1

A pretty printing example : http://hgad.net/posts/object-inspection-in-gdb/

A Python extension to make deadlock analysis easier : http://www.linuxjournal.com/article/11027?page=0,0

Another deadlock detector :

https://github.com/xmementoit/gdb-automatic-deadlock-detector/blob/master/gdbDisplayLockedThreads.py

 

C++ exceptions with stack traces

  1. Introduction : In this post,  I will share a simple single header file “pretty_exception.h”. Basically it is for throwing exception messages with much more information. This one will have file, file line number and the function/method that is throwing the exception. Further more it also adds callstack information, can have colored console outputs, traces for syslog/Dbgview and even comes with a simple message box ( Windows only ). Below you can see outputs from Linux and Windows consoles :

And Windows console output :

If you enable tracing,  you can see exception trace in syslog on Linux:

[root@localhost ~]# tail -f /var/log/messages
Jul 24 22:36:58 localhost dbus[633]: [system] Successfully activated service ‘org.freedesktop.hostname1’
Jul 24 22:36:58 localhost systemd: Started Hostname Service.
Jul 24 22:40:01 localhost systemd: Starting Session 3 of user root.
Jul 24 22:40:01 localhost systemd: Started Session 3 of user root.
Jul 24 22:40:32 localhost chronyd[636]: Selected source 94.125.132.7
Jul 24 22:45:49 localhost systemd: Starting Cleanup of Temporary Directories…
Jul 24 22:45:50 localhost systemd: Started Cleanup of Temporary Directories.
Jul 24 22:48:38 localhost systemd: Starting Session 4 of user root.
Jul 24 22:48:38 localhost systemd: Started Session 4 of user root.
Jul 24 22:48:38 localhost systemd-logind: New session 4 of user root.
Jul 24 22:48:44 localhost slog[4589]: Exception type : std::runtime_error

Message : AAA

File : main.cppLine : 5

Callstack :

5 : ./pretty() [0x4021a4]
4 : ./pretty() [0x40197a]
3 : ./pretty() [0x401bad]
2 : /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6ca71bdaf5]
1 : ./pretty() [0x4016b9]

And you can use Microsoft`s Dbgview utility to see the exception trace on Windows :

 

Additionally on Windows, you can also have message boxes if you enable it in the header file :

2. Implementation notes :

  • File number , line number, function name : The code uses __LINE__ , __FILE__ macros. As for the function name , initially intended to use C99 __func__ however currently not using it as the callstack information already provides it.

 

 

 

  • Macro expansion : In order to have __FILE__ and __LINE__ macros , I had to define throw functionality as macros as __FILE__ and __LINE__ macros should be copied to the place of the caller by the preprocessor. However I needed to concatenate these predefined macros with my macros therefore I had to use the technique described perfectly on this page : http://stackoverflow.com/questions/19343205/c-concatenating-file-and-line-macros

 

 

  • Supporting string literals & std::string : In order to support both string literals and std::string as input message , we define a template convertToStdString function and an const char* overload for it :

inline std::string convertToStdString(const char* str){ return std::string(str);}


template <typename T>T convertToStdString(T const& t){ return t;}

 

3. Source code and usage : Its target platforms are Linux with GCC ( tested on CentOS7 with GCC4.8 ) and Windows with MSVC ( tested with MSVC2013 on Windows8) . The code initially check predefined macros to see if it is a supported system. For other platforms & compilers, the changes should be straightforward. Currently you can throw 4 different std::exception types : std::runtime_error, std::invalid_arg and std::length_error and std::logic_error. In order to use just include its header file and call one of the throw macros :

#include “pretty_exception.h”

void foo()
{
THROW_PRETTY_RUNTIME_EXCEPTION(“AAA”)
}

And finally here is the source code :

Undefined, implementation defined and unspecified behaviors in C++

Introduction :

C++ standards documentation define how your code should behave. However if you look at it, you will also notice undefined behaviours. That practically means that how your code will behave in specific conditions will be defined by the standardised specifications and also implementation of the compiler you are using :


Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuanceof a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message )

Undefined behaviors exist to allow C and C++ compilers to optimise the generated assembly. You can read about its advantages in LLVM blog :

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

And another explanation here : https://monoinfinito.wordpress.com/2016/06/02/c-why-is-undefinedness-important/

The most typical examples are uninitialised variables, dereferencing a null pointer , integer overflows and various wrong type casts and more. As for a general list of undefined behaviors , here is the best collection I could find :

http://stackoverflow.com/a/367662

Undefined behavior vs Unspecified Behavior vs Implementation Defined Behavior

If you look at the standard documentation more , you will also notice that there are 2 more terms : unspecified behavior  and implementation-defined behavior . Here you can see a stackoverflow question and answers to it regarding the differences :

http://stackoverflow.com/questions/2397984/undefined-unspecified-and-implementation-defined-behavior

Here is the definition of implementation specified behavior :

behavior, for a well-formed program construct and correct data, that depends on the implementation and that each implementation documents.

The very last part of the standard documentation has a list of implementation defined behaviors. And here is the unspecified behavior :

behavior, for a well-formed program construct and correct data, that depends on the implementation
Note: The implementation is not required to document which behavior occurs. The range of possible behaviors is usually delineated by this International Standard.

Briefly we can say that a code that has undefined behavior has errors. On the other hand an implementation defined behavior are behaviors implemented by compiler vendors such as sizeof int  and it has also has to be documented by the compiler and standard library vendor. Finally an unspecified behavior is like implementation defined but there is no need to document it such as the order in which the arguments to a function are evaluated.

An undefined behaviour example which is handled differently by MSVC and GCC :  In the example below , we first do upcasting via pointers. After upcasting , it is assigned to a derived class pointer and the method of derived class called successfully. However,after that, we make the base pointer to point to a base object and re-assign it to derived pointer. And if you call a derived class method via a derived class pointer that actually points to a base object , that is an undefined behaviour. If you try the example below, it will work fine in MSVC ( probably in the example there is no special implementation ), however if you compile it with GCC it will cause a segmentation fault :

Interesting consequences of undefined behaviors

An interesting example is that a variable can be both true and false due to an uninitialised variable issue which is a quite common  undefined behavior :

http://markshroyer.com/2012/06/c-both-true-and-false/

Another interesting example is from Linux kernel : https://isc.sans.edu/diary/A+new+fascinating+Linux+kernel+vulnerability/6820

What happens in that one is :

1. First there is initialisation of a variable and since it is a pointer it can be NULL.

struct sock *sk = tun->sk; // initialize sk with tun->sk

2. Then there is a null check and if it is NULL , the code has to return an error code :

if (!tun)
return POLLERR; // if tun is NULL return error

3. However , the compiler optimised the code , noticing that the variable being initialised and eventually it deletes the NULL check

4. This can lead the kernel to read/write data from 0x00000000 which can be mapped to user land memory.

5. And here you can see how it can be exploited :

Ubsan : Undefined behavior sanitizer

GCC ( starting from 4.9) and Clang , have “ubsan” ( undefined behavior sanitizer ). If you build with -fsanitize=undefined flag , undefined behaviors will be reported in runtime.

Here is the simplest example of dereferencing a null pointer , when you compile and execute it , you will get a segmentation fault/memory access violation :

http://coliru.stacked-crooked.com/a/001178588381592e

Eventually if you add -fsanitize=undefined : http://coliru.stacked-crooked.com/a/d4c9040456f31156

This time you will also get :

“main.cpp:7:14: runtime error: load of null pointer of type ‘int'”

In another example on http://coliru.stacked-crooked.com/a/fe6bcd958be237ac , we have a function that doesn`t return a value. Ubsan reports it as :

main.cpp:1:5: runtime error: execution reached the end of a value-returning function without returning a value

As for an example use case  , here you can see that how it is used for testing Clang`s libc++ :

https://cplusplusmusings.wordpress.com/2013/03/26/testing-libc-with-fsanitizeundefined/

FURTHER RESOURCES

A nice stackoverflow answer regarding types of UBs : https://stackoverflow.com/questions/367633/what-are-all-the-common-undefined-behaviours-that-a-c-programmer-should-know-a

Here is a one and a half hour video from BoostCon about undefined behavior : https://www.youtube.com/watch?v=uHCLkb1vKaY&index=75&list=UU5e__RG9K3cHrPotPABnrwg

A CPPCon video : https://www.youtube.com/watch?v=g7entxbQOCc

A CPPCon video presented by Chandler Carruth : https://www.youtube.com/watch?v=yG1OZ69H_-o

A great article about UB in C++ : https://blog.regehr.org/archives/1520?utm_source=newsletter_mailer&utm_medium=email&utm_campaign=weekly

Debugging/profiling tools in AAA game industry

Introduction : This post was initially about a CPPCon 2015 presentation which is about memory debugging at Electronic Arts. After noticing more CPPCon presentations from Ubisoft I`ve decided to add them to this post as well. And finally added links to tools of Unreal engine and CryEngine.

Memory debugging at Electronic Arts : 

There is a really nice 1 hour CPPCon presentation in which Scott Wardle talks about memory debugging at EA. You can watch in on :

And the PowerPoint presentation can be downloaded from :

https://github.com/CppCon/CppCon2015/blob/master/Presentations/Memory%20and%20C++%20debugging%20at%20EA/Memory%20and%20C++%20debugging%20at%20EA%20-%20Scott%20Wardle%20-%20CppCon%202015.pptx

Initially, the presentation starts with how the memory management in the game industry was. Before Xbox360 and PS3, it was more like embedded systems since there is neither OS
or virtual memory and there is mostly a single CPU core. As with Xbox360 and PS3, both have light OSs called as system software and both are multicore. Xbox360 PowerPC processor had 3 cores each with hyperthreading , on the other hand, PS3 had 1 PPU which is the main core and 6 programmable 128 bit SPUs. And finally, with PS4 and XboxOne there is 64 bit address space and HDD capability for swaps. As he talks about this changes in target game hardware, he shows how their debugging utilities evolved/improved.

Since game projects are quite large projects, obviously the memory leaks are not only the main problem. He mentioned that they had to attack other problems such as fragmentation and heap corruption. Fragmentation will slow things and also heap corruption can cause people , for example , to blame graphics programmers if there is something wrong with shadows, however later you might find out that actually, it is another team that exceeded their own buffer and wrote into shadows buffer. On the other hand, fragmentation is also important as you can find yourself easily in an out of memory case even if there is enough memory.

Notes
– In their custom heaps , they have the concept of “arena”s to particularly avoid lock contention as in jeMalloc. You can find out from which arena a pointer is. In their terminology, a heap is an arena and an allocator for that arena.

ea_arenas

– One of the interesting things is as they pass a const char* argument to new operator which is allocation category. And later they also added concept of scopes to allocators

– As a debug utility they log all allocations and frees in debug mode , therefore they have tools processing those logs That tool shows all allocations and frees by time , all shows memory areas for different categories and it looks particularly useful for fragmentation as it shows coloured visualisation of the memory.

ea_memory_snapshots
ea_mem_tool_visualization

– Mentions that memory leaks are not the hardest to find , however, the more painful thing is tracking down circular references. This can be sorted with a garbage collector, however. this time the problem is you end up with consuming more memory. Obviously, you want to lower your memory consumption as it helps to suffer from cache misses less. He also mentions that in the previous generation consoles , they did everything to reduce memory consumption including reducing the code size. This is also something mentioned
about EASTL , specifically they want to avoid code bloat that comes with templates. Regarding the memory usage, an example is in EA STL documentation on http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html#Appendix_16 In temporary and permanent memory section, it mentions that some games consume system memory so they definitely needed memory usage optimisation.

– One of the problems they had to solve is polymorphism in order to be able to use one allocator for all different classes.

– Their  DeltaViewer in their TurboTuner which stores various data , not only memory allocations and also threads, IO and others , in tables and is able to provide different views. Also it has a memory leak investigator. As for memory leak handling , they capture allocations and also frees however they are not only interested in that all allocations are freed , but also when they are freed. So an allocation happened for a particular level should be freed at the end of that level , not at the beginning or middle of next level as higher memory usage
leads to low performance.

ea_leaks

– Also they can categorize memory usage. This feature is quite useful as they can see the memory usage level by level.

ea_memory_categorisation

Performance cost of these debug mode features : He mentioned he is surprised about the performance cost of it as it was like 3-4 frames in a second.

 

Ubisoft tools : Ubisoft Montreal delivered some quite nice talks. Like EA, they also do memory tagging / memory allocations with names. Additionally, they have their “break”s which triggers breakpoints in VS debugger on certain events :

Here is CPPCon 2014 presentation :

 

And in CPPCon 2016 , they have made another presentation in which they have showed their profiling tools which also help issues such as deadlocks and memory leaks. Here is the Youtube video “Rainbow Six : Quest for performance”  :  https://www.youtube.com/watch?v=tD4xRNB0M_Q

  • They mention similar about cost of memory allocations and finding issues in games. Another common thing with EA presentation is the tagging events which obviously greatly helping them in identifiying issues. It is mentioned that Rainbow Six Siege is 70 FPS and with memory allocation profiling turned on,  it only drops to 20 FPS.
  • Differently from EA`s presentation, they showed their array analyzer tool which gives statistics about their array implementation such as number of instances, reserve sizes, peak buffer sizes , linear traversals and copy counts :ubisoft_array_analyzer
  • Another nice looking tool is  their lock analyzer which shows them lock contention :ubisoft_lock_analyzer

Unreal engine : Unreal engine 4 is one most used game engines in the world. And of course it comes with its own toolchain. You can actually use Unreal Engine freely and explore some of these tools :

https://docs.unrealengine.com/latest/INT/Engine/UI/ClassViewer/index.html Let`s you explore all classes in the Unreal engine

https://docs.unrealengine.com/latest/INT/Gameplay/Tools/VisualLogger/ That is the log watching utility. The feature I really like in that is its being timelined based which is always like that in game industry tools.

https://docs.unrealengine.com/latest/INT/Engine/Performance/ Performance and profiling tools

https://docs.unrealengine.com/latest/INT/Gameplay/Tools/NetworkProfiler/ Network profiler

Cryengine : Another common AAA game engine which also you can try for free. Some of tools it has are :

http://docs.cryengine.com/display/SDKDOC2/Debugging+and+Profiling+Tools  Profiling tools , here is the screenshot of memory profiling :

meminfo201

RadGameTools Telemetry  : RADGameTools is a game middleware company which is famous for their Bink video codecs. They also offer a profiling tool designed for games which is similar to the tool mentioned :

http://www.radgametools.com/telemetry.htm

It can also show thread utilization , lock contention and even context switches. You can watch is video on :

Posted in c++

Multithreading : Memory orderings and fine grained SPSC lockless queue benchmark

1. Introduction : In a previous article, I wrote about SPSC lockfree queue and compared it against lock based SPSC queue : https://nativecoding.wordpress.com/2015/06/17/multithreading-lockless-thread-safe-spsc-ring-buffer-queue/

In this article, I will improve it by changing the memory ordering. You can find the source code on https://github.com/akhin/benchmarks/tree/master/lockless_spsc_sequantial_vs_acquirerelease

2. Memory orderings and C++ Standard Memory Model : Both the processor and the compilers can do reordering of operations in memory.  That is mainly intended for optimisations.  Developers had to know how to use hardware and also compiler specific instructions in order to have atomic variables and to have multicore synchronisation by using memory fences :

A nice example for compiler reordering : http://preshing.com/20120625/memory-ordering-at-compile-time/

A nice example for CPU reordering : http://preshing.com/20120515/memory-reordering-caught-in-the-act/

And here is Wikipedia`s definition of a memory model :

 

In computing, a memory model describes the interactions of threads through memory and their shared use of the data.

C++ has a standard memory model since C++11. This model tells compilers how to access memory for threads. And it also helps to program accordingly considering the memory orderings. You can see the link below for the details :

http://en.cppreference.com/w/cpp/language/memory_model

3. Sequential Consistency: By default atomic operations have sequential consistency. So that basically means all operations such as load and store will happen as they appear in the source code and that applies to all threads. This is the strongest guarantee and quite useful. Also in C++11 standard library, this is the default memory ordering type. In the previous SPSC ring buffer implementation, the atomic variables had sequential consistency and it was enough for us to provide synchronisation between one producer and one consumer.

4. Acquire / Release : Acquire/release ordering on the other hand can provide us a more “relaxed” ordering. In short in this one, if thread A loads a variable with acquire ordering , and if thread B  stores to the same shared variable with release ordering , it is guaranteed that the store operation will happen before the load. Below you can see an example  which is taken from http://en.cppreference.com/w/cpp/atomic/memory_order page. This example is equivalent of mutual exclusion by using acquire release memory ordering from C++11 standard model  :

5. Implementation : So what we will do is in the previous lockless SPSC code is that we will specify stores with release ordering and load operations with acquire ordering. Therefore by leaving sequantial consistency, we will be “relaxing” the memory ordering. Rather than providing synchronisation for all source as they appear, we will be providing synchronisation for only shared variables between threads :

6. Sequential vs Acquire/Release benchmark : I ran the benchmark on a 64 bit Ubuntu system with 8 core and 64 byte cache line Intel I7 Haswell CPU. For the benchmark I used a bash script which you can find at https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca

And here is the results for 5 million iterations :

———————————————-

Program executed : ./lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 118 milliseconds

Minimum time : 72 milliseconds

Average time : 85.40000 milliseconds
———————————————-

Program executed : ./fine_grained_lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 46 milliseconds

Minimum time : 34 milliseconds

Average time : 38.80000 milliseconds

———————————————-

As you can see we managed to get a better result by relaxing the memory order.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see

https://github.com/akhin/cpp_multithreaded_order_matching_engine/tree/master/source/concurrent

Multithreading : Lock contention and fine-grained vs coarse-grained benchmark

1. Introduction : In concurrent programming, lock contention might be a bottleneck in your software , if a thread holds a lock for a while and if the other threads wait  on that lock. It can also be described as a competition between thread to acquire a lock. There are other terms around this such as “coarse-grained” locking and “fine-grained” locking. In fine grained locking, you might end up using more locks but protecting a smaller portion of data. Note that implementation will be harder. On the other hand in coarse grained locking, you lock larger data and also because of this it is easier to implement and safer to use.

2. Why fine grained locking and what we will benchmark : As mentioned you can easily use coarse-grained lock to make sure that your code is thread safe, however you can also notice that you are only ensuring that your code is thread safe but not benefiting from concurrency for speed. In this blog article, I will first show a coarse-grained MPMC ( multiple producer multiple consumer) unbounded queue. And after that I will show the implementation of a fine grained one. Finally I will compare results and use Intel VTune to identify the contention.

3. Coarse-grained MPMC unbounded queue : Below you can see the implementation. Basically we use a mutex for protecting an std::queue and we also use a condition variable in order to notify the pop/dequeue calls that the queue is not empty :

4. Fine grained MPMC unbounded queue: For the implementation we will not use an std::queue but we will be implementing a simple linked list in order to divide the tail and the head of the queue. Also in order to avoid to access head element in a enqueue call ( to see whether the head is null and updating the head pointer during adding the first element), we will start the queue with a pre-allocated dummy node. This implementation is based on : https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html#nbq

5. Benchmark : I ran the benchmark on a 64 bit Ubuntu system with 64 byte cache line Intel I7 Haswell CPU. The benchmark code created 50 writer threads and 50 reader threads. And each thread did 1000 iterations. Also we benchmarked both executables in 100 iterations. You can find the source code at :

https://github.com/akhin/benchmarks/tree/master/fine_grained_vs_coarse_grained

I used a simple bash script for the benchmark to run both executables in N iterations and collect max,min and average times : https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca

Program executed : ./fine_grained_queue

Iteration times : 100

Maximum time : 34 milliseconds

Minimum time : 18 milliseconds

Average time : 20.81000 milliseconds

And below you see the coarse grained benchmark :

Program executed : ./coarse_grained_queue

Iteration times : 100

Maximum time : 60 milliseconds

Minimum time : 45 milliseconds

Average time : 47.23000 milliseconds

As you can see for 50 reader and 50 writer threads, the average difference is 27 milliseconds in 100( program iteration ) * 1000 ( thread iteration ).

6. Intel VTune and spotting the lock contention : I used Intel VTune on Ubuntu to see a report of both executable benchmarks :

vtune_locks

As you can notice the right hand side fine grained statistics whereas the left hand side is the coarse grained one. In the fine grained one , you will notice 2 concentrated pieces -> 1 mutex for the tail and 1 mutex for the head whereas in the coarse grained one there is one busy looking mutex lock. The time figures show the time waited on these locks by all threads , therefore the time waited on a single lock in the coarse-grained example is more than 3 times than the time waited on 2 mutexes in the fine grained sample.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see :

https://github.com/akhin/multithreaded_order_matching_engine/tree/master/source/core/concurrency