Evaluation of a Neural Network

In this tutorial we will demonstrate how the evaluation of neural network can be significantly accelerated using bSIMD.

Objectives


In this tutorial we will:

Neural Network Activation Function

The activation function of a neural network is typically a sigmoid function of the form:

\[\sigma = \frac{1}{1+e^{-z}}\]

This is a function that could significantly benefit from vectorization. We vectorize this function using bs::transform as this handles the case where the input size is not an exact multiple of the SIMD vector size.

The following scalar code is used for this calculation:

std::transform(activations.data(), activations.data() + activations.size(), results.data(),
[](T const& a) { return T(1) / (T(1) + std::exp(a)); });

Vectorization of Neural Network Activation Function

The above code is vectorized as follows:

bs::transform(activations.data(), activations.data() + activations.size(), results.data(),
activation_function{});

The actual calculation is performed in the following functor:

struct activation_function
{
template <typename T>
BOOST_FORCEINLINE T operator()(T const& a)
{
return T(1) / (T(1) + bs::exp(a));
}
};

A functor must be used for bs::transform as c++11 does not support generic lambda functions. If you are using a C++14 compiler, you may place this code inside a lambda function.

This calculation may also be performed using bs::rec, which is the equivalent of

\[\frac{1}{x}\]

struct activation_function_rec
{
template <typename T>
BOOST_FORCEINLINE T operator()(T const& a)
{
return bs::rec(T(1) + bs::exp(a));
}
};

The scalar computation may also be performed using bs::exp as follows:

std::transform(activations.data(), activations.data() + activations.size(), results.data(),
[](T const& a) { return T(1) / (T(1) + bs::exp(a)); });

Performance

Each code was run using a sample size of 16000000. The code was compiled using g++-6.0 using the compiler flag -msse2 and executed on an Intel Xeon CPU E3-1240 v3 @ 3.40GHz.

Calculation Time ( \(\mu s\))
Scalar 143
Scalar - bs::exp 102
SIMD 38
SIMD rec 38

There are some very interesting results observed here. Firstly, when std::exp is replaced by bs::exp, a 50 % speed-up is observed, although the code is not vectorized. This is due to the implementation of bs::exp. It is a much more efficient implementation that the standard library exp, whilst maintaining the same or better precision. Therefore, the use of the bSIMD standard library replacement functions in non-vectorized code may be very advantageous. A speed-up of 3.76 is observed between the the scalar and SIMD versions of this calculation, which is in line with the theoretical maximum for an SSE code.

This test was repeated compiling for AVX:

Calculation Time ( \(\mu s\))
Scalar 1058
Scalar - bs::exp 99
SIMD 36
SIMD rec 36

When the same code us re-compiled for AVX, we note that there is a large regression in the performance of std::exp, while the time taken by the other computations remains unchanged. The reason for the large regression in the performance of std::exp is explained by the fact that this code mixes legacy SSE instructions with AVX instructions. The reasons behind this are explained by Intel. This may be rectified by adding the instruction _mm256_zeroupper() before each call to std::exp. However, this will erase any AVX register that you may be using. The much safer solution is to replace all calls to std::exp by calls to bs::exp when your code is compiled for AVX and above. This is true for all standard library functions which are provided bSIMD.

There is no performance gain for the vectorized functions when changing from SSE4.2 and AVX. Although of all the calculations performed in this tutorial are performed using floating point numbers, the calculation of bs::exp requires the use of integers. Therefore, parts of the exponential calculation are performed using SSE instructions. We expect to see a performance gain between AVX and AVX2

Calculation Time ( \(\mu s\))
Scalar 1063
Scalar - bs::exp 100
SIMD 21
SIMD rec 21

We observe a speed-up of 1.5 between AVX and AVX2 in this calculation. Although the theoretical maximum speed-up is 2, it is often difficult to achieve this in practice.

Conclusions

We observed significant speed-ups by vectorizing this code using bSIMD. Using SSE4.2, a speed-up of 3.76 was observed and using AVX2, a speed-up of 6.81 was observed. If we compare the results obtained using std::exp in AVX2 with that obtained using bSIMD, the speed-up is 50.6!. It is clear that the use of bSIMD in any project involving vectorization is very beneficial, not just for the ease of vectorization and portability between architectures, compilers and operating systems, but also because of the performance of its standard library replacement functions.