Julia Set

Table of Contents

In this tutorial we will calculate a Julia set using bSIMD. This example is interesting because the number of iterations required for each pixel of the image is different. This means that we need a method of exiting the calculation loop when all of the pixels currently being worked on have converged.

Objectives


In this tutorial we will:

Julia Set

The calculation of the Julia set is a example of a problem which is completely compute bound, therefore we expect to observe significant speed-up using bSIMD.

Here is the scalar version:

void evaluate_scalar()
{
for (int i = 0; i < size; ++i) {
T x0 = T(i) / T(size) * x_range + x_min;
for (int j = 0; j < size; ++j) {
int iteration = 0;
T y0 = T(j) / T(size) * y_range + y_min;
T x = 0;
T y = 0;
T x2 = x * x;
T y2 = y * y;
while (x2 + y2 < 4 && iteration < max_iter) {
x2 = x * x;
y2 = y * y;
T x_temp = x2 - y2 + x0;
y = 2 * x * y + y0;
x = x_temp;
++iteration;
}
iterations[j + i * size] = iteration;
}
}
}

This code is vectorized as follows:

void evaluate_simd()
{
pack_t step =
bs::enumerate<pack_t>(0); // produce a vector containing {0, 1, ..., pack_t::static_size-1}
for (int i = 0; i < size; ++i) {
pack_t x0{T(i) / T(size) * x_range + x_min};
pack_t fac{y_range / T(size)};
pack_t y_min_t{y_min};
for (int j = 0; j < size; j += pack_t::static_size) {
int iteration = 0;
pack_t y0 = bs::fma(step + j, fac, y_min_t);
pack_t x{0};
pack_t y{0};
pack_i iter{0};
pack_l mask;
do {
pack_t x2 = bs::sqr(x);
pack_t y2 = bs::sqr(y);
y = bs::fma(x + x, y, y0);
x = x2 - y2 + x0;
mask = x2 + y2 < 4;
++iteration;
iter = bs::if_inc(mask, iter);
} while (bs::any(mask) && iteration < max_iter);
bs::aligned_store(iter, &iterations[j + i * size]);
}
}
}

The function of interest here is bs::if_inc, which increments each element of the iter vector which has not yet converged. This allows us to continue our calculation on the relevant elements.

We have also used the function bs::any returns a boolean value if any of its parameter element is non zero, We have also used the bs::sqr function which squares its argument and the bs::fma function (fused multiply add) which can accelerate and increase accuracy "a*b+c" computations on some architectures.

Performance

This code was run using an image size of 1024 pixels square. The code was compiled using g++-6.0 for both SSE4.2 and AVX2 and executed on an Intel Xeon CPU E3-1240 v3 @ 3.40GHz.

Loop Time ( \(\mu s\)) Speed-up
Scalar 768
SIMD SSE4.2 210 x3.65
SIMD AVX2 119 x6.45