Application Benchmarks with Arch-R

The following benches were performed on Numscale's physical hardware, including Intel, ARM, AMD and IBM. Results to follow on more architectures. For further information or specific analysis requests, please contact us.

Descriptive Statistics Using Arch-R

Descriptive statistics are mathematical functions that gives a quantitative summary of a data set's properties including measures of central tendency like the mean or the median, measures of variability like the variance, kurtosis, skewness and extremum. Arch-R provides a sensible selection of such functions.

Sum, mean and related measures

The first family of measures is the central tendency measures. Those functions include the sum, the mean, the weighted mean, the sum of absolute value (asum) and the sum of squared value (asum2). All those functions are provided by Arch-R for both single and double precision data set. Benchmarks has been done on array of 2048 32-bits floating-point values and compared to a C++ implementation using the Standard Library.

sum
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

mean
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

weigthed mean
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

asum
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

asum2
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

Interaction with invalid values

In some cases, the data being analyzed are corrupted or invalid. A sensor may fail or a person may have wrongly filled his or her question form. In these cases, a common practice is to use the NaN value to indicate the error or missing data. Obviously, we can not run functions on such data as the results will be invalid. Arch-R provides functions to either detect the number of invalid values (invalid_count) and to perform central tendency measures while ignoring these (filtered_mean). Benchmarks of the Arch-R implementation of those functions have been done on array of 2048 32-bits floating-point values and compared to a C++ implementation using the Standard Library.

invalid count
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

filtered mean
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

Variance and associated measures

The variance measures how far a set of values are spread out from their average value. Along with the standard deviation, it is used in various application like statistical inference or Monte Carlo sampling. Benchmarks of the Arch-R implementation of those functions have been done on array of 2048 32-bits floating-point values and compared to a C++ implementation using the Standard Library.

standard deviation
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

variance
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

Skewness and associated measures

skewness is a measure of the asymmetry of the distribution of values around their mean. Two different measure of skewness are usually : Pearson's and sampled version, both being provided by Arch-R. Benchmarks of the Arch-R implementation of those functions have been done on array of 2048 32-bits floating-point values and compared to a C++ implementation using the Standard Library.

skewness
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

Kurtosis and associated measures

kurtosis is the last batch of dispersion measure functions provided by Arch-R in both its classic implementation and in its excess version. Benchmarks of the Arch-R implementation of those functions have been done on array of 2048 32-bits floating-point values and compared to a C++ implementation using the Standard Library.

kurtosis & excess kurtosis
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2
IBM Power8
VSX

Fast Fourier Transform Using Arch-R

Numscale specialises in portable, high performance software. Many of our clients require an FFT with the best possible performance on x86_64, ARM and PowerPC. In this benchmark, the performance of a 2D Fourier Transform of size 2048x2048 in single precision using Arch-R is compared to the famous FFTW. The FFTW is distributed under the highly restrictive GPL licence.

FFT
ARM aarch64
NEON
IBM Power8
VSX
Intel AVX
AVX

Arch-R's 2D FFT is almost 50% faster than the FFTW on Intel x86, ARM and PowerPC, without sacrificing any numerical precision!

FFT
ARM aarch64
NEON
IBM Power8
VSX
Intel AVX
AVX

We are faster in 1D too! Arch-R FFT was used to calculate a single precision 1D FFT of size 2048 and this was benchmarked against the FFTW on ARM, PowerPC and Intel x86_64. Archr-R FFT again is faster than the FFTW on all architectures! Arch-R FFT, is light-weight, easy to use and is easy to integrate into your existing projects, while still being high-performance and cross platform. It's time to use Arch-R in all of your critical projects!

Image Morphology using Arch-R

In the following benchmarks, several common image processing operations are implemented in Arch-R using bSIMD so that the maximum performance possible is realised across all architectures. These are benchmarked against OpenCV 3.1.0 on ARM and x86.

Binary image morphology is widely in used image processing and computer vision to locate objects of a certain size or to eliminate noise of a certain form. Image morphology can be very expensive as the value of each pixel is computed as a function of its neighbouring pixels. In the following benchmarks, we will apply various structuring elements to input images and benchmark the time taken by a version implemented using Archr-R against OpenCV 3.1.0, the most popular image processing library, on ARM and x86. In each benchmark, OpenCV is compiled with all optimizations enabled to ensure that a accurate comparison of the performance between Archr-R and OpenCV is obtained. As the image morphology code is implemented using Archr-R, the exact same code is used on both ARM and x86.

In the first test, we compare the time taken to perform a "closing" operation. This involves performing a dilation followed by an erosion with the same structuring element. In this test, a circular structuring of radius one is used on an 8 bit image of size 2048x2048.

Closing
ARM aarch64
NEON
Intel x86_64
SSE4.2
AVX2

It is clear from the results here that interesting speed-ups are obtained on both ARM and x86. However, we can see that OpenCV has a highly optimized image morphology functionality on x86, as we can see from the results here. These same optimizations are clearly not implemented on ARM, explaining why Archr-R is so much more efficient. The speed-up of the Archr-R implementation is due to the use of vector intrinsics and efficient memory accesses. On SSE and NEON, a speed-up of 16 is expected for an 8-bit and on AVX2, a speed-up of 32 is expected versus a scalar implementation. However, the actual speed-ups observed lead us to conclude that OpenCV uses Intel Intrinsics in its x86 version. Nevertheless, the Archr-R version of this algorithm is still significantly quicker.

In the second test, we compare a binary dilation performed using a circular structuring element of radius 9 implemented using Archr-R against OpenCV 3.1.0, again on ARM and x86.

DilationDisk
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2

In this test, we observe very impressive speed-ups for Archr-R versus OpenCV. These speed-ups are again explained by the combination of the use of SIMD instructions via bSIMD, combined with efficient memory accesses and a highly optimized algorithm. As the structuring element in this example is significant larger than that in the previous example, the effect is multiplied. Again, we note the Archr-R ode in this example is identical on both ARM and x86.

In the final image morphology test, we compare an erosion performed using a square structuring element of size 25.

DilationSquare
ARM aarch64
NEON
Intel x86_64
SSE 4.2
AVX 2

Again, Archr-R performs extremely well on all architectures.