1. Introduction: In this article, I will compare virtual methods versus static polymorphism ( Curiously Recurring Template Pattern) Regarding CRTP , you can look at my previous blog entry about C++ idioms : https://nativecoding.wordpress.com/2015/01/11/important-c-idioms/
Basically unlike virtual methods invokes, you don`t have vtables but using templates to mimic “polymorphism” behaviour.
2. My expectation before benchmark : As you know, even though C++ standard doesn`t specify the implementation details of virtual methods such as vtable and vptr, modern compilers implement it this way. For reference , see the topic “What happens in the hardware when we call a virtual function” at https://isocpp.org/wiki/faq/virtual-functions Therefore it is not possible to inline virtual methods as long as you use pointers or references. ( And obviously you will want to use them to benefit from polymorphism) As inlining helps you to avoid function call costs, static polymorphism might help in big loops.
3. Test environment and the code : As OS, I used CentOS7:
Linux localhost.localdomain 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Here is some of CPU details :
cat /proc/cpuinfo processor :
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i3-2370M CPU @ 2.40GHz
stepping : 7
microcode : 0x29
cpu MHz : 799.968
cache size : 3072 KB
And finally the compiler I`ve used :
gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
You can find a link to Github which has full source , makefiles and assembly outputs. Here is the important bits :
a) Virtual methods test loop :
b) CRTP test loop :
We will pass test_run parameter as 25000 which means there will be 25000*25000 iterations. Finally as for the benchmark tools , we will use gprof and “perf stats”.
4. Benchmark 1 , No-inlining case : In compilation settings, we initially disable compiler`s inlining and don`t specify any optimisations :
CFLAGS= -I. -std=c++11 -c -fno-inline
Virtual method results ( 3.901376310 seconds time elapsed ) :
CRTP results ( 7.564107505 seconds time elapsed ) :
As for the initial benchmark with no inlining and no optimisations specified, and we clearly see that virtual methods are faster than CRTP. Even though virtual methods can`t be inlined and there is an indirection due to query to the vtable and virtual methods runtime suffered less than CRTP when it came to function calls. In gprof we can see only VirtualDerived::tick , whereas we see more than one function call of CRTP implementation. By looking at perf results , it looks like a non-inlined CRTP returns us as more CPU cycles, more branches created in CPU pipeline.
We see that virtual methods implementation is almost 2 (1,93 ) times faster than CRTP implementation
5. Benchmark 2 , 03 and no-inlining : In this step, we keep inlining as disabled and as an addition, we use 03 flag :
CFLAGS= -I. -std=c++11 -c -O3 -fno-inline
Virtual methods results ( 1.873269419 seconds time elapsed ) :
CRTP results ( 3.716805969 seconds time elapsed ) :
As we introduced O3 optimisation, we notice less instructions and clock cycles and branches for virtual methods compared to benchmark1. However interestingly that wasn`t the case for CRTP as GCC 03 introduced more clock cycles, instructions and branches.
We see that virtual methods implementation is almost 2 (1.98) times faster than CRTP implementation
6. Benchmark 3, 03 and inlining : Finally we unleash the compiler`s power by enabling O3 and inlining at the same time :
CFLAGS= -I. -std=c++11 -c -O3
Virtual methods results ( 1.611001145 seconds time elapsed ) :
CRTP results ( 0.245045202 seconds time elapsed ) :
Finally , we enable inlining along with 03. We see an incredible performance boost in CRTP, where all parameters especially branches decreased greatly.
We see that CRTP implementation is 6.5 times faster than virtual methods implementation
7. Let`s look at assembly output from benchmark 3 : First let`s see virtual method output :
And here is CRTP output :
The first thing to notice is that virtual method assembly output shows us that it goes through vtable and therefore there is an extra jump ( Notice the star operator next to callq instruction). On the other hand there is no specific jump in CRTP assembly. Another important thing about CRTP code, you will notice SIMD instructions ( movdqa ). It looks like we can get support of vectorisation by enabling inlining.
8. Conclusion :
- We can verify that CRTP when inlining enabled can be much faster than traditional virtual methods.
- By looking at gprof for function calls and perf stat , inlining also helps us to be more CPU pipeline friendly by producing less instructions and branches. Therefore this also leads to less branch misses.
- By looking at the results of benchmark2 ( o3 and no inline ) and benchmark 3 ( 03 and inline ) , we have seen an incredible performance increase. Therefore we can say that inlining looks like allowing us much further optimisation options such as SIMD instructions.
9. Github Link for source files, makefile and assembly outputs :