Context switches : Epoll vs multithreaded IO benchmark

1. Introduction and multiplexed io
You need to implement thread per client solution in the classic server implementation  that needs to handle multiple clients simultaneously. This approach requires constantly calling  socket apis.

However when using multiplexed io , you wait for events instead of polling everytime. Another advantage it brings is that you can implement such a server using only one thread which is nice to avoid  context switches. This post`s purpose is measuring different IO mechanisms which are available in user space. The implementations will use TCP, however all tests done on the same machine via loopback adapter in order to avoid network effects but focusing on kernel.

2. Select, poll and epoll

Linux provides select, poll and epoll for multiplexed io :

http://man7.org/linux/man-pages/man2/select.2.html

http://man7.org/linux/man-pages/man2/poll.2.html

http://man7.org/linux/man-pages/man7/epoll.7.html

We will be using epoll  in this post as its biggest advantage is that you traverse events rather than  file descriptors to look for events. Therefore you don`t need to loop idle file descriptors.

Another note about epoll is that it provides various modes :

Level triggered mode : You will get an event notification as long as there is data to process. It means that , you will still continue
to get same events if you didn`t process the buffer.

Edge triggered mode : You will get notifications only once regardless you processed the buffer or not.

 3. About IO Patterns : Reactor and Proactor
Select, poll and epoll allows you to implement reactor IO pattern in which you wait for events : https://en.wikipedia.org/wiki/Reactor_pattern

Another similar IO pattern is called proactor : https://en.wikipedia.org/wiki/Proactor_pattern

In proactor pattern , you instead wait for completion of reading from a descriptor such as  a socket.  As searching for proactor patterns a while , I don`t think it is truly possible to implement in Linux as such kernel mechanism is not provided : https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports

On the other hand it is possible to implement it in Windows using IO completion ports. You can see an example implementation here : https://xania.org/200807/iocp

Note that Boost.ASIO is actually using epoll beneath in order to implement proactor, therefore we can`t say it is a true proactor in Linux.

4. Thread per client implementation

Thread per client implementation has an always running thread to accept new connection and spawns a new thread per connection. It is using std::mutex only when a new connection happens in order to syncronise book keeping of connected clients.

The implementation of base TCPServer class which is used by both thread-per-client and reactor implementations to manage the connected peers  :

The  implementation of thread per client server which is derived from TCPServer on :

5. Reactor ( Epoll ) server implementation 

Reactor implementation accepts new connections and handles client events all on the same thread. Therefore does not require any syncronisation. It uses level triggered epoll for simplicity.

Firstly the implemenation of io_event_listener_epoll.cpp which an epoll wrapper on :

And you can see server_reactor.cpp which uses the epoll wrapper to implement a reactor server :

6. Dropped connections and socket buffer sizes 

I observed disconnection issues with high number of sockets and threads.  For example,  1024 sockets and threads for client automation  and same for thread-per-client server implementation even using loopback adapter on the same machine. All had the same symptom : client  automation program got socket error code 104 ( Connection reset by peer ) whereas could not spot any socket error in the server side.  However , one thing I noticed is that increasing socket receive and send buffer sizes helped. In order to set socket send and receive buffer sizes  system-wide :

echo ‘net.ipv4.tcp_wmem= 10240 1024000 12582912’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem= 10240 1024000 12582912’ >> /etc/sysctl.conf

And eventually type “sysctl -p” ,therefore the system picks the changes up. Tried different default buffer sizes and observed similar results for different system-wide socket receive and send default buffer sizes  such as 1024000, 87380, 10240 and 128 bytes. Observed a high number of disconnections while benchmarking thread per client  server with 1024 clients/threads when socket buffer sizes were only 128 byte.

7. Benchmark

As I am measuring IO performances of different kernel mechanism from user space, I benchmarked in the same machine.. That is also useful to avoid any network effect as mainly interested only in IO and context switches.

You can specify number of clients and number of messages when using the client automation which I wrote for benchmarking. A thread will be spawned for each thread  and each thread will send the specified amount of messages to the connected server. And each thread will expect a response per message for automation to end.
At the end , the client automation will show you total elapsed time and average RTT ( round trip time ). It will also report number of disconnections which gives an idea about the accurracy of the results.

You can find the all source code of servers and client automation on : https://github.com/akhin/low_latency_experiments/tree/master/epoll_vs_multithreaded_io

During the benchmark system wide TCP buffer sizes were as below :

net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 87380 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

In all benchmarks , I used 100 ping-pongs between client automation and have changed number of clients ( threads ) in each benchmark in each benchmark. For 100 ping-pongs :

Client number         Epoll RTT                   Thread per client RTT
4                                   20 microseconds       62.5 microseconds
128                               23 microseconds       95 microseconds
1024                             30 microseconds       148 microseconds

8. Measuring context switches per thread using systemTap

I wanted to display the context per thread. Therefore first , I used named threads in server implementations using pthread_setname_np :

http://man7.org/linux/man-pages/man3/pthread_setname_np.3.html

That allowed me to give a OS-level name to each thread ( basically process as thread are light-weight processes : https://en.wikipedia.org/wiki/Light-weight_process )

After that I prepared a short systemTap ( https://sourceware.org/systemtap/ ) script to measure context switch via Linux kernel sched_switch call :

In order to run the script above for a specific program :

stap context_switch.stp -c program_name

When you run this systemTap script , it will report number of context switches per thread. You can easily notice high number of total context switches in thread-per-client implementation compared to epoll/reactor implementation.

SystemTap probes are working system-wide therefore slowing down the system. So I have got outputs for 32 clients from thread per client server and epoll server.

Thread per client server context switch counts per thread :

systemtap_thread_per_client

 

Epoll/Reactor server context switch count for the single epolling thread :

 

systemtap_epoll

Advertisements

Multithreading : Memory orderings and fine grained SPSC lockless queue benchmark

1. Introduction : In a previous article, I wrote about SPSC lockfree queue and compared it against lock based SPSC queue : https://nativecoding.wordpress.com/2015/06/17/multithreading-lockless-thread-safe-spsc-ring-buffer-queue/

In this article, I will improve it by changing the memory ordering. You can find the source code on https://github.com/akhin/benchmarks/tree/master/lockless_spsc_sequantial_vs_acquirerelease

2. Memory orderings and C++ Standard Memory Model : Both the processor and the compilers can do reordering of operations in memory.  That is mainly intended for optimisations.  Developers had to know how to use hardware and also compiler specific instructions in order to have atomic variables and to have multicore synchronisation by using memory fences :

A nice example for compiler reordering : http://preshing.com/20120625/memory-ordering-at-compile-time/

A nice example for CPU reordering : http://preshing.com/20120515/memory-reordering-caught-in-the-act/

And here is Wikipedia`s definition of a memory model :

 

In computing, a memory model describes the interactions of threads through memory and their shared use of the data.

C++ has a standard memory model since C++11. This model tells compilers how to access memory for threads. And it also helps to program accordingly considering the memory orderings. You can see the link below for the details :

http://en.cppreference.com/w/cpp/language/memory_model

3. Sequential Consistency: By default atomic operations have sequential consistency. So that basically means all operations such as load and store will happen as they appear in the source code and that applies to all threads. This is the strongest guarantee and quite useful. Also in C++11 standard library, this is the default memory ordering type. In the previous SPSC ring buffer implementation, the atomic variables had sequential consistency and it was enough for us to provide synchronisation between one producer and one consumer.

4. Acquire / Release : Acquire/release ordering on the other hand can provide us a more “relaxed” ordering. In short in this one, if thread A loads a variable with acquire ordering , and if thread B  stores to the same shared variable with release ordering , it is guaranteed that the store operation will happen before the load. Below you can see an example  which is taken from http://en.cppreference.com/w/cpp/atomic/memory_order page. This example is equivalent of mutual exclusion by using acquire release memory ordering from C++11 standard model  :

5. Implementation : So what we will do is in the previous lockless SPSC code is that we will specify stores with release ordering and load operations with acquire ordering. Therefore by leaving sequantial consistency, we will be “relaxing” the memory ordering. Rather than providing synchronisation for all source as they appear, we will be providing synchronisation for only shared variables between threads :

6. Sequential vs Acquire/Release benchmark : I ran the benchmark on a 64 bit Ubuntu system with 8 core and 64 byte cache line Intel I7 Haswell CPU. For the benchmark I used a bash script which you can find at https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca

And here is the results for 5 million iterations :

———————————————-

Program executed : ./lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 118 milliseconds

Minimum time : 72 milliseconds

Average time : 85.40000 milliseconds
———————————————-

Program executed : ./fine_grained_lockless_ring_buffer_spsc

Iteration times : 5

Maximum time : 46 milliseconds

Minimum time : 34 milliseconds

Average time : 38.80000 milliseconds

———————————————-

As you can see we managed to get a better result by relaxing the memory order.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see

https://github.com/akhin/cpp_multithreaded_order_matching_engine/tree/master/source/concurrent

Multithreading : Lock contention and fine-grained vs coarse-grained benchmark

1. Introduction : In concurrent programming, lock contention might be a bottleneck in your software , if a thread holds a lock for a while and if the other threads wait  on that lock. It can also be described as a competition between thread to acquire a lock. There are other terms around this such as “coarse-grained” locking and “fine-grained” locking. In fine grained locking, you might end up using more locks but protecting a smaller portion of data. Note that implementation will be harder. On the other hand in coarse grained locking, you lock larger data and also because of this it is easier to implement and safer to use.

2. Why fine grained locking and what we will benchmark : As mentioned you can easily use coarse-grained lock to make sure that your code is thread safe, however you can also notice that you are only ensuring that your code is thread safe but not benefiting from concurrency for speed. In this blog article, I will first show a coarse-grained MPMC ( multiple producer multiple consumer) unbounded queue. And after that I will show the implementation of a fine grained one. Finally I will compare results and use Intel VTune to identify the contention.

3. Coarse-grained MPMC unbounded queue : Below you can see the implementation. Basically we use a mutex for protecting an std::queue and we also use a condition variable in order to notify the pop/dequeue calls that the queue is not empty :

4. Fine grained MPMC unbounded queue: For the implementation we will not use an std::queue but we will be implementing a simple linked list in order to divide the tail and the head of the queue. Also in order to avoid to access head element in a enqueue call ( to see whether the head is null and updating the head pointer during adding the first element), we will start the queue with a pre-allocated dummy node. This implementation is based on : https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html#nbq

5. Benchmark : I ran the benchmark on a 64 bit Ubuntu system with 64 byte cache line Intel I7 Haswell CPU. The benchmark code created 50 writer threads and 50 reader threads. And each thread did 1000 iterations. Also we benchmarked both executables in 100 iterations. You can find the source code at :

https://github.com/akhin/benchmarks/tree/master/fine_grained_vs_coarse_grained

I used a simple bash script for the benchmark to run both executables in N iterations and collect max,min and average times : https://gist.github.com/akhin/40b8afabe1ae5a82c7ece633b5d26eca

Program executed : ./fine_grained_queue

Iteration times : 100

Maximum time : 34 milliseconds

Minimum time : 18 milliseconds

Average time : 20.81000 milliseconds

And below you see the coarse grained benchmark :

Program executed : ./coarse_grained_queue

Iteration times : 100

Maximum time : 60 milliseconds

Minimum time : 45 milliseconds

Average time : 47.23000 milliseconds

As you can see for 50 reader and 50 writer threads, the average difference is 27 milliseconds in 100( program iteration ) * 1000 ( thread iteration ).

6. Intel VTune and spotting the lock contention : I used Intel VTune on Ubuntu to see a report of both executable benchmarks :

vtune_locks

As you can notice the right hand side fine grained statistics whereas the left hand side is the coarse grained one. In the fine grained one , you will notice 2 concentrated pieces -> 1 mutex for the tail and 1 mutex for the head whereas in the coarse grained one there is one busy looking mutex lock. The time figures show the time waited on these locks by all threads , therefore the time waited on a single lock in the coarse-grained example is more than 3 times than the time waited on 2 mutexes in the fine grained sample.

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see :

https://github.com/akhin/multithreaded_order_matching_engine/tree/master/source/core/concurrency

Multithreading : Multicore programming and false-sharing benchmark

1. Introduction : Multicore programming is essential to benefit from power of hardware as it allows us truely make our code run on different CPU cores. However, someone doing multithreaded programming has to understand the underlying hardware to fully utilize it.

In most of processors, L1 cache and L2 caches are private which means they are per core. One exception is L1 can be shared between 2 virtual processors if there is hyperthreading. See hyperthreading article on Wikipedia : https://en.wikipedia.org/wiki/Hyper-threading. On the other hand L3 caches are shared by different CPU cores.

This shared cache lines introduces the problem as known as false sharing. For the simplest example, if there is 2 variables in 1 shared cache line used by 2 different threads on different cores, when one of the threads updates the value of  its own variable, the cache line will be invalidated and updated since it is shared :

false_sharing

This can be a performance penalty if you have many shared cache lines. In following sections, I will show how false sharing affects the execution time. Therefore it is a very good practise to align your data to cache line. You can easily do this by using alignas type specifier in C++11.

2. Benchmark code  : In the benchmark code, we have a static struct variable which has 3 32-bit integers. ( The benchmark will be executed on a 64 bit processor with 64 bytes of cache line size ). We fire 3 threads , each operates on different member of that static struct variable. For controlling the benchmark, I added macros that enables/disables alignment and CPU core IDs that our 3 threads will be bound to :

3. Benchmark  : I ran the benchmark on a 64 bit Ubuntu system with 8 core and 64 byte cache line Intel I7 Haswell CPU :

root@akhin-GS60-2PC-Ghost:/home/akhin/Desktop/aligned# cat /proc/cpuinfo |grep cache
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64
cache size : 6144 KB
cache_alignment : 64

As mentioned in code section, I controlled alignment and thread CPU affinities by using macros. Here is the results of test steps :

1. No alignment , all threads on the same core : 0.746 seconds.

2. No alignment , thread 1 on core 0 , thread 2 on core 2, thread 3 on core 4 : 1.0295 seconds

3. Alignment , thread 1 on core 0 , thread 2 on core 2, thread 3 on core 4 : 0.2903 seconds

As you can see from the results, when we used 3 threads on 3 cores,  rather than performance improvement , we got a slower execution time compared to 3 threads on a single core. However as soon as I enabled alignment , I got the best result.

You can download this code from : https://github.com/akhin/benchmarks/tree/master/false_sharing

Multithreading : Lockless thread safe SPSC ring-buffer queue

1. Introduction : We need different thread safe data structures/containers in multithreaded programming. Even though many applications use MPMC ( multi producer multi consumer ) , sometimes we need single producer single consumer solutions.

In this article , I will show a SPSC thread safe queue implementation without using mutual exclusion/critical section based on ring buffer data structure.This implementation is based on a video I watched on that address : https://skillsmatter.com/skillscasts/6163-high-performance-single-producer-single-consumer-in-memory-queue

And based on the video , the original is implementation is based on LMAX Distruptor project , which is an opensource high performance library used in LMAX trading system : https://lmax-exchange.github.io/disruptor/

You can find source code for this article on :

https://github.com/akhin/benchmarks/tree/master/concurrency_spsc_ringbuffer_queue

2. What is a ring buffer ? : Ring buffer is a cyclic/circular buffer. It can easily be used as underlying data structure for FIFO ( First in First out ) queue implementations.

Below you can see its structure :

ring_buffer_structure

Below you can see how it is implemented on memory :

ring_buffer_memory

For the implementation you need to pointers pointing to head and tail. And you increment them as you make enqueue and dequeue operations. Below you can see a non-thread-safe ring buffer operating on 64 bit ints :

It is quite advantageous using it under memory constraints as you don`t need to allocate more memory. On the other hand one disadvantage is you have to know your maximum queue size at the beginning. Therefore it might lead to data overriding.

3. Lock based thread safe queue implementation : For a simple lock based thread safe implementation , we can use mutual exclusion. For this one we will add locks to push and pop functions :


4. Lockless thread safe queue implementation :  Rather than using locks, we will get help from atomic variables.  To make it thread safe, what we need to ensure is that head pointer never catches tail pointer from behind or vice versa. For the implementation we will introduce 2 more helper functions called try_push and try_pop :

As you can see above , try_push tries to see if head is about to catch the tail from its behind.And as for synchronisation we make an atomic load for the comparison. The same logic applies to try_pop. Actually by doing this we introduce a virtual limit to a ring buffer queue , therefore our unbounded queue somehow becomes a virtually bounded queue.

5. Benchmark code :

For the bechmark we can create only 1 producer and 1 consumer thread. However we can do as many operation as we want. For benchmarking I specified ring buffer size as 1000 and each thread are making 1 million operations :

 

6. Benchmark result – Lockless vs Locked queue : Here is my test system :

uname -a
Linux akhin-GS60-2PC-Ghost 3.16.0-40-generic #54~14.04.1-Ubuntu SMP Wed Jun 10 17:30:45 UTC 2015 x86_64 x86_64 x86_64 GNU/Lin

cat /proc/cpuinfo

model name : Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz

For benchmarking, I have used “perf stat” on my Ubuntu 14.0.4 LTS :

Lock based queue  :  0.437855869 seconds time elapsed

Lockless queue : 0.109763995 seconds time elapsed

Note : Please note that the example classes are only for displaying features, they are not production ready as they are not even freeing memory allocated. For a more complete example , please see :

https://github.com/akhin/cpp_multithreaded_order_matching_engine/tree/master/source/concurrent