ARM NEON Compositor  master
Fast SIMD alpha overlay and blending for ARM
examples/perf_test/perf_test.py

Performance tests

This example tests the performance of the overlay_alpha* functions with different sizes of random images.
Ctypes is used to dynamically load the alpha-lib library in Python.

SIMD intrinsics

The following two graphs show the results of four experiments comparing the performance of overlaying one image onto another, using GCC's -O3 optimization level on the one hand, and using hand-crafted NEON intrinsics on the other hand. Especially for small images, the NEON version is much faster. For larger images, memory throughput and caching effects start to become more important factors than raw processing power, but the NEON version is still significantly faster than the version without intrinsics.

Rounding methods

The difference between the different scaling and rounding methods is negligible. As expected, an exact rounding division by 255 is slowest. An approximation is slightly faster, because it eliminates a vector load instruction to load the rounding constant. An exact flooring division by 255 is a tiny bit faster still.

The fastest option is to divide by 256 instead of 255, as both the rounding and flooring divisions by powers of two can be implemented using a single bit shift instruction.
This does result in a small error in the output image. Most notably, combining two white pixels with color values 0xFF will result in a slightly less white pixel, with color value 0xFE.

This graph also clearly shows the slightly better performance when the image size is a multiple of eight. The reason is the size of the NEON registers, which is four words, or eight 16-bit integers. When the number of columns of the foreground image is not a multiple of eight, extra code is needed to process the last pixels of each row, resulting in lower performance.

1 #!/usr/bin/env python3.8
2 
3 
53 import os.path as path
54 import cv2
55 import numpy as np
56 import ctypes
57 import time
58 import platform
59 import argparse
60 
61 dir = path.dirname(path.realpath(__file__))
62 
63 parser = argparse.ArgumentParser(description='Benchmark for overlay_alpha')
64 parser.add_argument('--no-simd', dest='SIMD', action='store_false',
65  help='Disable SIMD intrinsics')
66 parser.add_argument('--N', dest='N', type=int, default=25,
67  help='The number of different sizes to test')
68 parser.add_argument('--min', dest='min_size', type=int, default=10,
69  help='The size in pixels of the smallest image in the test')
70 parser.add_argument('--max', dest='max_size', type=int, default=2000,
71  help='The size in pixels of the largest image in the test')
72 parser.add_argument('--it', dest='max_iterations', type=int, default=10,
73  help='The number of test iterations for the largest images')
74 parser.add_argument('--rescale', dest='rescale', choices=['div255_round',
75  'div255_round_approx', 'div255_floor', 'div256_round',
76  'div256_floor'], default='div255_round',
77  help='The number of test iterations for the largest images')
78 args = parser.parse_args()
79 print(args)
80 
81 # C types
82 uint8_t_p = ctypes.POINTER(ctypes.c_uint8)
83 size_t = ctypes.c_size_t
84 void = None
85 # Load the function and declare its signature
86 so = "libalpha-lib-"
87 so += platform.machine()
88 if not args.SIMD: so += "-no-simd"
89 so += ".so"
90 dll = ctypes.cdll.LoadLibrary(path.join(dir, so))
91 overlay_alpha = dll['overlay_alpha_stride_' + args.rescale]
92 overlay_alpha.argtypes = [
93  uint8_t_p, # bg_img
94  uint8_t_p, # fg_img
95  uint8_t_p, # out_img
96  size_t, # bg_full_cols
97  size_t, # fg_rows
98  size_t, # fg_cols
99  size_t, # fg_full_cols
100 ]
101 overlay_alpha.restype = void
102 
103 # The different image sizes
104 sizes = np.linspace(args.min_size, args.max_size, args.N, dtype=np.int)
105 times = np.zeros((args.N, ))
106 
107 # Run the function for all sizes
108 for i, size in enumerate(sizes):
109  print(i + 1, '/', args.N, ':', size)
110 
111  # Generate some random images of the given size
112  bg_img = np.random.randint(255, size=(size, size, 4), dtype=np.uint8)
113  fg_img = np.random.randint(255, size=(size, size, 4), dtype=np.uint8)
114  out_img = np.zeros((size, size, 4), dtype=np.uint8)
115  bg_img_p = bg_img.ctypes.data_as(uint8_t_p)
116  fg_img_p = fg_img.ctypes.data_as(uint8_t_p)
117  out_img_p = out_img.ctypes.data_as(uint8_t_p)
118 
119  # Overlay the random images, do it multiple times for accurate timing
120  iterations = int(round(args.max_size * args.max_iterations / size))
121  start_time = time.perf_counter()
122  for _ in range(iterations):
123  overlay_alpha(bg_img_p, fg_img_p, out_img_p, size, size, size, size)
124  end_time = time.perf_counter()
125  times[i] = (end_time - start_time) / iterations
126 
127 # Save the results as a CSV file
128 results = np.column_stack((sizes, times))
129 simd = (' simd ' if args.SIMD else ' no-simd ')
130 name = str(time.asctime()) + simd + args.rescale + ' ' + platform.machine()
131 np.savetxt(name + '.csv', results, delimiter=',')
132 
133 # Plot the results
134 import matplotlib.pyplot as plt
135 plt.plot(sizes, times, '.-')
136 plt.xlabel('Image size [pixels]')
137 plt.ylabel('Time [s]')
138 plt.savefig(name + '.svg')
139 plt.show()
perf_test.int
int
Definition: perf_test.py:66
perf_test.overlay_alpha
overlay_alpha
Definition: perf_test.py:91