1. Introduction and multiplexed io
You need to implement thread per client solution in the classic server implementation that needs to handle multiple clients simultaneously. This approach requires constantly calling socket apis.
However when using multiplexed io , you wait for events instead of polling everytime. Another advantage it brings is that you can implement such a server using only one thread which is nice to avoid context switches. This post`s purpose is measuring different IO mechanisms which are available in user space. The implementations will use TCP, however all tests done on the same machine via loopback adapter in order to avoid network effects but focusing on kernel.
2. Select, poll and epoll
Linux provides select, poll and epoll for multiplexed io :
We will be using epoll in this post as its biggest advantage is that you traverse events rather than file descriptors to look for events. Therefore you don`t need to loop idle file descriptors.
Another note about epoll is that it provides various modes :
Level triggered mode : You will get an event notification as long as there is data to process. It means that , you will still continue
to get same events if you didn`t process the buffer.
Edge triggered mode : You will get notifications only once regardless you processed the buffer or not.
3. About IO Patterns : Reactor and Proactor
Select, poll and epoll allows you to implement reactor IO pattern in which you wait for events : https://en.wikipedia.org/wiki/Reactor_pattern
Another similar IO pattern is called proactor : https://en.wikipedia.org/wiki/Proactor_pattern
In proactor pattern , you instead wait for completion of reading from a descriptor such as a socket. As searching for proactor patterns a while , I don`t think it is truly possible to implement in Linux as such kernel mechanism is not provided : https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports
On the other hand it is possible to implement it in Windows using IO completion ports. You can see an example implementation here : https://xania.org/200807/iocp
Note that Boost.ASIO is actually using epoll beneath in order to implement proactor, therefore we can`t say it is a true proactor in Linux.
4. Thread per client implementation
Thread per client implementation has an always running thread to accept new connection and spawns a new thread per connection. It is using std::mutex only when a new connection happens in order to syncronise book keeping of connected clients.
The implementation of base TCPServer class which is used by both thread-per-client and reactor implementations to manage the connected peers :
The implementation of thread per client server which is derived from TCPServer on :
5. Reactor ( Epoll ) server implementation
Reactor implementation accepts new connections and handles client events all on the same thread. Therefore does not require any syncronisation. It uses level triggered epoll for simplicity.
Firstly the implemenation of io_event_listener_epoll.cpp which an epoll wrapper on :
And you can see server_reactor.cpp which uses the epoll wrapper to implement a reactor server :
6. Dropped connections and socket buffer sizes
I observed disconnection issues with high number of sockets and threads. For example, 1024 sockets and threads for client automation and same for thread-per-client server implementation even using loopback adapter on the same machine. All had the same symptom : client automation program got socket error code 104 ( Connection reset by peer ) whereas could not spot any socket error in the server side. However , one thing I noticed is that increasing socket receive and send buffer sizes helped. In order to set socket send and receive buffer sizes system-wide :
echo ‘net.ipv4.tcp_wmem= 10240 1024000 12582912’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem= 10240 1024000 12582912’ >> /etc/sysctl.conf
And eventually type “sysctl -p” ,therefore the system picks the changes up. Tried different default buffer sizes and observed similar results for different system-wide socket receive and send default buffer sizes such as 1024000, 87380, 10240 and 128 bytes. Observed a high number of disconnections while benchmarking thread per client server with 1024 clients/threads when socket buffer sizes were only 128 byte.
As I am measuring IO performances of different kernel mechanism from user space, I benchmarked in the same machine.. That is also useful to avoid any network effect as mainly interested only in IO and context switches.
You can specify number of clients and number of messages when using the client automation which I wrote for benchmarking. A thread will be spawned for each thread and each thread will send the specified amount of messages to the connected server. And each thread will expect a response per message for automation to end.
At the end , the client automation will show you total elapsed time and average RTT ( round trip time ). It will also report number of disconnections which gives an idea about the accurracy of the results.
You can find the all source code of servers and client automation on : https://github.com/akhin/low_latency_experiments/tree/master/epoll_vs_multithreaded_io
During the benchmark system wide TCP buffer sizes were as below :
net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 87380 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
In all benchmarks , I used 100 ping-pongs between client automation and have changed number of clients ( threads ) in each benchmark in each benchmark. For 100 ping-pongs :
Client number Epoll RTT Thread per client RTT
4 20 microseconds 62.5 microseconds
128 23 microseconds 95 microseconds
1024 30 microseconds 148 microseconds
8. Measuring context switches per thread using systemTap
I wanted to display the context per thread. Therefore first , I used named threads in server implementations using pthread_setname_np :
That allowed me to give a OS-level name to each thread ( basically process as thread are light-weight processes : https://en.wikipedia.org/wiki/Light-weight_process )
After that I prepared a short systemTap ( https://sourceware.org/systemtap/ ) script to measure context switch via Linux kernel sched_switch call :
In order to run the script above for a specific program :
stap context_switch.stp -c program_name
When you run this systemTap script , it will report number of context switches per thread. You can easily notice high number of total context switches in thread-per-client implementation compared to epoll/reactor implementation.
SystemTap probes are working system-wide therefore slowing down the system. So I have got outputs for 32 clients from thread per client server and epoll server.
Thread per client server context switch counts per thread :
Epoll/Reactor server context switch count for the single epolling thread :