Context switches : Epoll vs multithreaded IO benchmark

1. Introduction and multiplexed io
You need to implement thread per client solution in the classic server implementation  that needs to handle multiple clients simultaneously. This approach requires constantly calling  socket apis.

However when using multiplexed io , you wait for events instead of polling everytime. Another advantage it brings is that you can implement such a server using only one thread which is nice to avoid  context switches. This post`s purpose is measuring different IO mechanisms which are available in user space. The implementations will use TCP, however all tests done on the same machine via loopback adapter in order to avoid network effects but focusing on kernel.

2. Select, poll and epoll

Linux provides select, poll and epoll for multiplexed io :

http://man7.org/linux/man-pages/man2/select.2.html

http://man7.org/linux/man-pages/man2/poll.2.html

http://man7.org/linux/man-pages/man7/epoll.7.html

We will be using epoll  in this post as its biggest advantage is that you traverse events rather than  file descriptors to look for events. Therefore you don`t need to loop idle file descriptors.

Another note about epoll is that it provides various modes :

Level triggered mode : You will get an event notification as long as there is data to process. It means that , you will still continue
to get same events if you didn`t process the buffer.

Edge triggered mode : You will get notifications only once regardless you processed the buffer or not.

 3. About IO Patterns : Reactor and Proactor
Select, poll and epoll allows you to implement reactor IO pattern in which you wait for events : https://en.wikipedia.org/wiki/Reactor_pattern

Another similar IO pattern is called proactor : https://en.wikipedia.org/wiki/Proactor_pattern

In proactor pattern , you instead wait for completion of reading from a descriptor such as  a socket.  As searching for proactor patterns a while , I don`t think it is truly possible to implement in Linux as such kernel mechanism is not provided : https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports

On the other hand it is possible to implement it in Windows using IO completion ports. You can see an example implementation here : https://xania.org/200807/iocp

Note that Boost.ASIO is actually using epoll beneath in order to implement proactor, therefore we can`t say it is a true proactor in Linux.

4. Thread per client implementation

Thread per client implementation has an always running thread to accept new connection and spawns a new thread per connection. It is using std::mutex only when a new connection happens in order to syncronise book keeping of connected clients.

The implementation of base TCPServer class which is used by both thread-per-client and reactor implementations to manage the connected peers  :

The  implementation of thread per client server which is derived from TCPServer on :

5. Reactor ( Epoll ) server implementation 

Reactor implementation accepts new connections and handles client events all on the same thread. Therefore does not require any syncronisation. It uses level triggered epoll for simplicity.

Firstly the implemenation of io_event_listener_epoll.cpp which an epoll wrapper on :

And you can see server_reactor.cpp which uses the epoll wrapper to implement a reactor server :

6. Dropped connections and socket buffer sizes 

I observed disconnection issues with high number of sockets and threads.  For example,  1024 sockets and threads for client automation  and same for thread-per-client server implementation even using loopback adapter on the same machine. All had the same symptom : client  automation program got socket error code 104 ( Connection reset by peer ) whereas could not spot any socket error in the server side.  However , one thing I noticed is that increasing socket receive and send buffer sizes helped. In order to set socket send and receive buffer sizes  system-wide :

echo ‘net.ipv4.tcp_wmem= 10240 1024000 12582912’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem= 10240 1024000 12582912’ >> /etc/sysctl.conf

And eventually type “sysctl -p” ,therefore the system picks the changes up. Tried different default buffer sizes and observed similar results for different system-wide socket receive and send default buffer sizes  such as 1024000, 87380, 10240 and 128 bytes. Observed a high number of disconnections while benchmarking thread per client  server with 1024 clients/threads when socket buffer sizes were only 128 byte.

7. Benchmark

As I am measuring IO performances of different kernel mechanism from user space, I benchmarked in the same machine.. That is also useful to avoid any network effect as mainly interested only in IO and context switches.

You can specify number of clients and number of messages when using the client automation which I wrote for benchmarking. A thread will be spawned for each thread  and each thread will send the specified amount of messages to the connected server. And each thread will expect a response per message for automation to end.
At the end , the client automation will show you total elapsed time and average RTT ( round trip time ). It will also report number of disconnections which gives an idea about the accurracy of the results.

You can find the all source code of servers and client automation on : https://github.com/akhin/low_latency_experiments/tree/master/epoll_vs_multithreaded_io

During the benchmark system wide TCP buffer sizes were as below :

net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 87380 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216

In all benchmarks , I used 100 ping-pongs between client automation and have changed number of clients ( threads ) in each benchmark in each benchmark. For 100 ping-pongs :

Client number         Epoll RTT                   Thread per client RTT
4                                   20 microseconds       62.5 microseconds
128                               23 microseconds       95 microseconds
1024                             30 microseconds       148 microseconds

8. Measuring context switches per thread using systemTap

I wanted to display the context per thread. Therefore first , I used named threads in server implementations using pthread_setname_np :

http://man7.org/linux/man-pages/man3/pthread_setname_np.3.html

That allowed me to give a OS-level name to each thread ( basically process as thread are light-weight processes : https://en.wikipedia.org/wiki/Light-weight_process )

After that I prepared a short systemTap ( https://sourceware.org/systemtap/ ) script to measure context switch via Linux kernel sched_switch call :

In order to run the script above for a specific program :

stap context_switch.stp -c program_name

When you run this systemTap script , it will report number of context switches per thread. You can easily notice high number of total context switches in thread-per-client implementation compared to epoll/reactor implementation.

SystemTap probes are working system-wide therefore slowing down the system. So I have got outputs for 32 clients from thread per client server and epoll server.

Thread per client server context switch counts per thread :

systemtap_thread_per_client

 

Epoll/Reactor server context switch count for the single epolling thread :

 

systemtap_epoll

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s