Introduction

Single instruction, multiple data (SIMD) instructions or multimedia extensions have been available for many years. They are designed to significantly accelerate code execution, however they require expertise to be used correctly, depends on non-uniform compiler support, the use of low-level intrinsics, or vendor-specific libraries.

bSIMD is a C++ library which aims to simplify the error-prone process of developing applications using SIMD instructions sets. bSIMD is designed to seamlessly integrate into existing projects so that you can quickly and easily start developing high performance, portable and future proof software.

Why use bSIMD?


bSIMD standardizes and simplifies the use of SIMD instructions across hardware by not relying on verbose, low-level SIMD instructions. Furthermore, the portability of bSIMD eliminates the need to re-write cumbersome code for each revision of each target architecture, accounting for each architecture's vendor provided API as well as architecture dependent implementation details. This greatly reduces the design, complexity and maintenance of SIMD code, significantly decreasing the time required to develop, test and deploy software as well as decreasing the scope for introducing bugs.

bSIMD allows you to focus on the important part of your work: the development of new features and functionality. We take care of all of the architecture and compiler specific details and we provide updates when new architectures are released by manufacturers. All you have to do is re-compile your code every time you wish to target a new architecture. bSIMD does this by providing the following components:

  • a proper value semantic wrapper for SIMD registers
  • over 300 vectorized mathematical functions
  • an automatic system to detect and exploit architecture specific optimization opportunities
  • standard compliant iterators to iterate over contiguous range of data in a SIMD compatible way

A Short Example


Let's take a simple case where we calculate the sum of two vectors of 32-bit floats:

for (int i = 0; i < size; ++i) {
res[i] = data0[i] + data1[i];
}

Each element of the results vector is independent of every other element - therefore this function may easily be vectorized as there is latent data parallelism which may be exploited. This simple loop may be vectorized for an x86 processor using Intel intrinsic functions. For example, the following code vectorizes this loop for an SSE enabled processor:

for (int i = 0; i < size; i += 4) {
__m128 v0_sse = _mm_load_ps(&data0[i]);
__m128 v1_sse = _mm_load_ps(&data1[i]);
__m128 r_sse = _mm_add_ps(v0_sse, v1_sse);
_mm_store_ps(&res[i], r_sse);
}

Looks difficult? How about we vectorize it for the following generation of Intel processor equipped with AVX instructions:

for (int i = 0; i < size; i += 8) {
__m256 v0_avx = _mm256_load_ps(&data0[i]);
__m256 v1_avx = _mm256_load_ps(&data1[i]);
__m256 r_avx = _mm256_add_ps(v0_avx, v1_avx);
_mm256_store_ps(&res[i], r_avx);
}

Both of these processors are manufactured by Intel yet two different versions of the code are required to get the best performance possible from each processor. Imagine the complication of moving to another manufacturer's processor, for example ARM processors, which are found in most smartphones. Development for a smart phone is much more difficult than for a desktop PC as you have significantly less processing power to play with as well as limited battery life to consider. Software performance is even more important in such a difficult environment! Thankfully the bSIMD development team has been thinking about you so bSIMD is designed to seamlessly integrate into any mobile development environment.

Let's try re-write this same simple loop for a smartphone with a NEON equipped processor:

float32x4_t v0_arm, v1_arm, r_arm;
std::size_t f_card_arm = 4;
for (int i = 0; i < size; i += f_card_arm) {
v0_arm = vld1q_f32(&data0[i]);
v1_arm = vld1q_f32(&data1[i]);
r_arm = vaddq_f32(v0_arm, v1_arm);
vst1q_f32(&res[i], r_arm);
}

This is quicky getting complicated and annoying. Wouldn't life be much easier if someone else took care of this mess? Imagine being able to write one version of your code, which has optimal performance across all architectures, compilers and operating systems? Imagine not having to worry about re-writing your code for each new processor released? Well, imagine no more and behold the beauty and simplicity of bSIMD.

Now, look at how the code can become simpler with bSIMD :

using pack_t = bs::pack<float>;
for (int i = 0; i < size; i += bs::pack<float>::static_size) {
bs::pack<float> v0(&data0[i]), v1(&data1[i]);
bs::aligned_store(v0 + v1, &res[i]);
}

And of course, if your C++ sense is tingling, you may have noticed that this piece of code can actually be written using a standard algorithm such as std::transform. The good news is that bSIMD also provides a vectorized version of these algorithms:

boost::simd::transform(&data[0], &data[0] + size, &data1[0], &res[0], boost::simd::plus);

Happy user of C++14 ? bSIMD also supports the latest additions to the language such as generic lambda functions:

boost::simd::transform(&data[0], &data[0] + size, &data1[0], &res[0],
[](auto const& a, auto const& b) { return a + b; });

Supported Compilers and Hardware


bSIMD includes support for Intel, ARM, AMD and IBM processors:

Architecture Extensions
Intel x86/86_64, Xeon Phi KNC/KNL SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, FMA3, AVX2
Intel Xeon Phi KNC AVX512 (IMCI)
Intel Xeon Phi KNL AVX512-F
ARM NEON, AArch64 (ARM64)
IBM Power 6, 7 & 8 VMX (Altivec), VSX
AMD x86/x86_64 SSE2, SSE3, SSSE3, SSE4.1, SSE4.A, AVX, XOP, FMA4, AVX2

bSIMD requires a C++11 compliant compiler and is thoroughly tested on the following compilers:

Compiler Version
g++ 4.8 or above
clang++ 3.5 or above
Microsoft Visual Studio 2015 update 1 or above

bSIMD requires Boost version 1.60 or newer.