N-Body Problem

In this tutorial we will show you how to optimize an n-body simulation using bSIMD,

Objectives

In this tutorial we will:

N-Body Problem

This problem is concerned with the calculation of the individual motion of celestial objects caused by gravity. This is an example of a problem of $$N^2$$ complexity. The formula to calculate the gravitational force between two objects is:

$F=\frac{Gm_im_j(q_i - q_j)}{{\|q_i-q_j\|}^3}$

Where: $$\mathbf{q_i}$$ is the position vector of object $$i$$ $$\mathbf{m_i}$$ is the mass of object $$i$$ $$G$$ is the gravitational constant

Calculation

The scalar calculation of this problem is done as follows:

void nbody_step_scalar(particles& ps)
{
T epsilon = 0.00125f;
for (std::size_t i = 0; i < ps.size(); ++i) {
T ax{0};
T ay{0};
T az{0};
T pix = ps.x[i];
T piy = ps.y[i];
T piz = ps.z[i];
T pim = ps.m[i];
for (std::size_t j = i + 1; j < ps.size(); ++j) {
T pjx = ps.x[j];
T pjy = ps.y[j];
T pjz = ps.z[j];
T dx = pjx - pix;
T dy = pjy - piy;
T dz = pjz - piz;
T inorm = 6.667408e-11 / std::sqrt(dx * dx + dy * dy + dz * dz + epsilon);
T fi = ps.m[j] * inorm * inorm * inorm;
T fj = pim * inorm * inorm * inorm;
ax += dx * fi;
ay += dy * fi;
az += dz * fi;
ps.vx[j] -= dx * fj;
ps.vy[j] -= dy * fj;
ps.vz[j] -= dz * fj;
}
T pivx = ps.vx[i];
T pivy = ps.vy[i];
T pivz = ps.vz[i];
pivx += ax;
pivy += ay;
pivz += az;
pix += pivx;
piy += pivy;
piz += pivz;
ps.vx[i] = pivx;
ps.vy[i] = pivy;
ps.vz[i] = pivz;
ps.x[i] = pix;
ps.y[i] = piy;
ps.z[i] = piz;
}
}

Note the use of a small constant, $$\epsilon$$. This is referred to as softening, a numerical trick to prevent division by zero if two particles are too close together.

This calculation is trivially vectorizable as follows:

void nbody_step(particles& ps)
{
pack_t grav_con{6.67408e-11};
pack_t epsilon{0.00125f};
for (std::size_t i = 0; i < ps.size(); i += pack_t::static_size) {
pack_t ax{0};
pack_t ay{0};
pack_t az{0};
pack_t pix{&ps.x[i]};
pack_t piy{&ps.y[i]};
pack_t piz{&ps.z[i]};
pack_t pim{&ps.m[i]};
for (std::size_t j = i + 1; j < ps.size(); ++j) {
auto dx = pjx - pix;
auto dy = pjy - piy;
auto dz = pjz - piz;
auto inorm = grav_con / bs::sqrt(dx * dx + dy * dy + dz * dz + epsilon);
auto inorm3 = inorm * inorm * inorm;
auto fi = bs::load<pack_t>(&ps.m[j]) * inorm3;
auto fj = pim * inorm3;
ax += dx * fi;
ay += dy * fi;
az += dz * fi;
pjvx -= dx * fj;
bs::store(pjvx, &ps.vx[j]);
pjvy -= dy * fj;
bs::store(pjvy, &ps.vy[j]);
pjvz -= dz * fj;
bs::store(pjvz, &ps.vz[j]);
}
pack_t pivx{&ps.vx[i]};
pack_t pivy{&ps.vy[i]};
pack_t pivz{&ps.vz[i]};
pivx += ax;
pivy += ay;
pivz += az;
pix += pivx;
piy += pivy;
piz += pivz;
bs::aligned_store(pivx, &ps.vx[i]);
bs::aligned_store(pivy, &ps.vy[i]);
bs::aligned_store(pivz, &ps.vz[i]);
bs::aligned_store(pix, &ps.x[i]);
bs::aligned_store(piy, &ps.y[i]);
bs::aligned_store(piz, &ps.z[i]);
}
}
Note
All constants are loaded into SIMD vectors outside of the main loop, so that these vectors are not generated each iteration of the calculation.

It is not possible to use aligned stores in the interior loop as these stores will not be aligned for each store.

Performance

This code as compiled using g++6.0 for sse4.2 and avx2. It was executed on an Intel Xeon CPU E3-1240 v3 @ 3.40GHz. The following results were obtained:

Loop Time ( $$\mu s$$) Speed-up
Scalar 13185
SIMD SSE4.2 4104 x3.21
SIMD AVX2 2321 x5.68

The performance obtained is broadly in line with the theoretical maximum possible speed-ups of x4 for SSE4.2 and x8 for AVX2. There are very few problems where these theoretical speed-ups are obtainable.