# C++ Implementation

*Pieter P*

#### Table of Contents *list*

### Dividing by Powers of 2

The factor

Both floating point multiplication and integer division are relatively
expensive operations, especially on embedded devices or microcontrollers.

We can, however, choose the value for

This is useful, because a division by a power of two can be replaced by a
very fast right bitshift:

We can now rewrite the difference equation of the EMA with this optimization
in mind:

#### Negative Numbers

There's one caveat though: this doesn't work for negative numbers.
For example, if we try to calculate the integer division

This means we'll have to be careful not to use this trick on any negative
numbers. In our difference equation, both the input

__Base case__:

__Induction step__:

Since

Therefore, the entire right-hand side is always positive or zero, because it is a sum of two numbers that are themselves greater than or equal to zero.

#### Rounding

A final improvement we can make to our division algorithm is to round the
result to the nearest integer, instead of truncating it towards zero.

Consider the rounded result of the division

### Implementation in C++

We now have everything in place to write a basic implementation of the EMA in C++:

`#pragma once`

`#include <cstdint> // uint8_t, uint16_t`

`#include <type_traits> // std::is_unsigned`

`/// The first Exponential Moving Average implementation for unsigned integers.`

`///`

`/// @note An improved implementation is presented further down the page.`

`template <uint8_t K, class uint_t = uint16_t>`

`class EMA {`

`public:`

`/// Update the filter with the given input and return the filtered output.`

`uint_t operator()(uint_t input) {`

`state += input;`

`uint_t output = (state + half) >> K;`

`state -= output;`

`return output;`

`}`

`static_assert(std::is_unsigned<uint_t>::value,`

`"The `uint_t` type should be an unsigned integer, "`

`"otherwise, the division using bit shifts is invalid.");`

`/// Fixed point representation of one half, used for rounding.`

`constexpr static uint_t half = 1 << (K - 1);`

`private:`

`uint_t state = 0;`

`};`

Note how we save

### Signed Rounding Division

It's possible to implement a signed division using bit shifts as well. The only difference is that we have to subtract 1 from the dividend if it's negative.

On ARM and x86 platforms, the absolute performance difference between the signed and unsigned version is not too big, it requires just a few more instructions. However, on some other architectures, like the AVR architecture used by some Arduino microcontrollers, the division is by far the most expensive step of the EMA algorithm, so a slower signed division might have a significant impact on the overall performance. In theory, it should only take a couple instructions to conditionally subtract 1, based on the sign of the dividend, but this sometimes causes the compiler to refactor the entire division, resulting in a much slower algorithm.

I provided two implementations of the signed division. Notice how on x86 and ARM the second one is faster, while on AVR, the first one is faster. The unsigned division is included for reference.

The code was compiled using the `-O2`

optimization level.

#### Implementation of Signed and Unsigned Division by a Multiple of Two

`constexpr unsigned int K = 3;`

`signed int div_s1(signed int val) {`

`int round = val + (1 << (K - 1));`

`if (val < 0)`

`round -= 1;`

`return round >> K;`

`}`

`signed int div_s2(signed int val) {`

`int neg = val < 0 ? 1 : 0;`

`return (val + (1 << (K - 1)) - neg) >> K;`

`}`

`unsigned int div_u(unsigned int val) {`

`return (val + (1 << (K - 1))) >> K;`

`}`

#### Assembly Generated on x86_64 (GCC 9.2)

`div_s1(int):`

`mov eax, edi`

`not eax`

`shr eax, 31`

`lea eax, [rax+3+rdi]`

`sar eax, 3`

`ret`

`div_s2(int):`

`lea eax, [rdi+4]`

`shr edi, 31`

`sub eax, edi`

`sar eax, 3`

`ret`

`div_u(unsigned int):`

`lea eax, [rdi+4]`

`shr eax, 3`

`ret`

#### Assembly Generated on ARM 64 (GCC 8.2)

`div_s1(int):`

`mvn w1, w0`

`add w0, w0, w1, lsr 31`

`add w0, w0, 3`

`asr w0, w0, 3`

`ret`

`div_s2(int):`

`add w1, w0, 4`

`sub w0, w1, w0, lsr 31`

`asr w0, w0, 3`

`ret`

`div_u(unsigned int):`

`add w0, w0, 4`

`lsr w0, w0, 3`

`ret`

#### Assembly Generated on AVR (GCC 5.3)

`div_s1(int):`

`sbrc r25,7 # Skip if Bit in Register is Cleared: val >= 0`

`rjmp .L2`

`# val >= 0`

`adiw r24,4 # Add Immediate to Word: val + (1 << (K - 1)) = val + 4`

`asr r25 # Arithmetic Shift Right: shift high byte (preserve sign)`

`ror r24 # Rotate Right through Carry: shift low byte`

`asr r25 # Two more times`

`ror r24`

`asr r25`

`ror r24`

`ret`

`.L2:`

`# val < 0`

`adiw r24,3 # Add Immediate to Word: val + (1 << (K - 1)) - 1 = val + 3`

`asr r25 # Arithmetic Shift Right: shift high byte (preserve sign)`

`ror r24 # Rotate Right through Carry: shift low byte`

`asr r25 # Two more times`

`ror r24`

`asr r25`

`ror r24`

`ret`

`div_s2(int):`

`movw r18,r24`

`subi r18,-4 # Subtract immediate: val + (1 << (K - 1)) = val + 4`

`sbci r19,-1 # Subtract Immediate with Carry: (low byte)`

`mov r24,r25`

`rol r24 # Rotate Left through Carry: C flag is now sign bit`

`clr r24 # Clear Register`

`rol r24 # Rotate Left through Carry: original sign bit is now lsb`

`movw r20,r18`

`sub r20,r24 # Subtract without Carry: val + 4 - neg`

`sbc r21,__zero_reg__ # Subtract with Carry: (low byte)`

`movw r24,r20`

`asr r25 # Arithmetic Shift Right: shift high byte (preserve sign)`

`ror r24 # Rotate Right through Carry: shift low byte`

`asr r25 # Two more times`

`ror r24`

`asr r25`

`ror r24`

`ret`

`div_u(unsigned int):`

`adiw r24,4 # Add Immediate to Word: val + (1 << (K - 1)) = val + 4`

`lsr r25 # Logical Shift Right: shift high byte (no sign extension)`

`ror r24 # Rotate Right through Carry: shift low byte`

`lsr r25 # Two more times`

`ror r24`

`lsr r25`

`ror r24`

`ret`

Keep in mind that an `int`

on AVR is only 16 bits wide, whereas
an `int`

on ARM or x86 is 32 bits wide.

If you use 32-bit integers on AVR, the result is even more atrocious.

You can experiment with the different implementations yourself on the Compiler Explorer.

The main takeaway from this section is that signed (rounding) division is more expensive than unsigned division.

#### A better alternative for signed division

Since the EMA is a linear filter, adding a constant offset to the input results in the same output, but with the same offset added to it. This means that we don't have to worry about negative numbers, we can just add a constant offset to the negative inputs, resulting in only positive numbers. At the output of the filter, the offset is simply removed again.

This approach turns out to be significantly more efficient than the signed divisions discussed above. It allows us to use only unsigned rounding divisions, which are very cheap, and just a single extra subtraction to handle signed types. (Yes, just one, it turns out that adding the offset to the input and subtracting it again from the output can be combined.)

The assumption that the EMA is a linear filter is not really valid anymore, because of the rounding and truncation errors introduced by the use of integers in the algorithm. Luckily, the output of the filter turns out to be exactly the same, it doesn't matter if you use true signed rounding division or an unsigned rounding division with offset.

### Improved C++ implementation

The following snippet is an improved version of the previous implementation: it supports both signed and unsigned inputs, allows initialization to a specific value, and has a check to prevent overflow.

`#pragma once`

`#include <type_traits> // std::make_unsigned_t, make_signed_t, is_unsigned`

`#include <limits> // std::numeric_limits`

`#include <cstdint> // uint_fast16_t`

`/**`

`* @brief Exponential moving average filter.`

`*`

`* Fast integer EMA implementation where the weight factor is a power of two.`

`*`

`* Difference equation: @f$ y[n] = \alpha·x[n]+(1-\alpha)·y[n-1] @f$`

`* where @f$ \alpha = \left(\frac{1}{2}\right)^{K} @f$, @f$ x @f$ is the`

`* input sequence, and @f$ y @f$ is the output sequence.`

`*`

`* [An in-depth explanation of the EMA filter](https://tttapa.github.io/Pages/Mathematics/Systems-and-Control-Theory/Digital-filters/Exponential%20Moving%20Average/)`

`*`

`* @tparam K`

`* The amount of bits to shift by. This determines the location`

`* of the pole in the EMA transfer function, and therefore the`

`* cut-off frequency.`

`* The higher this number, the more filtering takes place.`

`* The pole location is @f$ 1 - 2^{-K} @f$.`

`* @tparam input_t`

`* The integer type to use for the input and output of the filter.`

`* Can be signed or unsigned.`

`* @tparam state_t`

`* The unsigned integer type to use for the internal state of the`

`* filter. A fixed-point representation with @f$ K @f$ fractional`

`* bits is used, so this type should be at least @f$ M + K @f$ bits`

`* wide, where @f$ M @f$ is the maximum number of bits of the input.`

`*`

`* Some examples of different combinations of template parameters:`

`*`

`* 1. Filtering the result of `analogRead`: analogRead returns an integer`

`* between 0 and 1023, which can be represented using 10 bits, so`

`* @f$ M = 10 @f$. If `input_t` and `output_t` are both `uint16_t`,`

`* the maximum shift factor `K` is @f$ 16 - M = 6 @f$. If `state_t``

`* is increased to `uint32_t`, the maximum shift factor `K` is`

`* @f$ 32 - M = 22 @f$.`

`* 2. Filtering a signed integer between -32768 and 32767: this can be`

`* represented using a 16-bit signed integer, so `input_t` is `int16_t`,`

`* and @f$ M = 16 @f$. (2¹⁵ = 32768)`

`* Let's say the shift factor `K` is 1, then the minimum width of`

`* `state_t` should be @f$ M + K = 17 @f$ bits, so `uint32_t` would be`

`* a sensible choice.`

`*/`

`template <uint8_t K,`

`class input_t = uint_fast16_t,`

`class state_t = std::make_unsigned_t<input_t>>`

`class EMA {`

`public:`

`/// Constructor: initialize filter to zero or optional given value.`

`EMA(input_t initial = input_t(0))`

`: state(zero + (state_t(initial) << K) - initial) {}`

`/// Update the filter with the given input and return the filtered output.`

`input_t operator()(input_t input) {`

`state += state_t(input);`

`state_t output = (state + half) >> K;`

`output -= zero >> K;`

`state -= output;`

`return input_t(output);`

`}`

`constexpr static state_t`

`max_state = std::numeric_limits<state_t>::max(),`

`half_state = max_state / 2 + 1,`

`zero = std::is_unsigned<input_t>::value ? state_t(0) : half_state,`

`half = K > 0 ? state_t(1) << (K - 1) : state_t(0);`

`static_assert(std::is_unsigned<state_t>::value,`

`"state type should be unsigned");`

`static_assert(max_state >= std::numeric_limits<input_t>::max(),`

`"state type cannot be narrower than input type");`

`/// Verify the input range to make sure it's compatible with the shift`

`/// factor and the width of the state type.`

`template <class T>`

`constexpr static bool supports_range(T min, T max) {`

`using sstate_t = std::make_signed_t<state_t>;`

`return min <= max &&`

`min >= std::numeric_limits<input_t>::min() &&`

`max <= std::numeric_limits<input_t>::max() &&`

`(std::is_unsigned<input_t>::value`

`? state_t(max) <= (max_state >> K)`

`: min >= -sstate_t(max_state >> (K + 1)) - 1 &&`

`max <= sstate_t(max_state >> (K + 1)));`

`}`

`private:`

`state_t state;`

`};`

When the type is signed, an offset of

To check the range of the input for specific template parameters, you can use the `supports_range`

method:

`EMA<5, int_fast16_t, uint_fast16_t> filter;`

`static_assert(filter.supports_range(-1024, 1023),`

`"use a wider state or input type, or a smaller shift factor");`

### Arduino Example

`template <uint8_t K, class uint_t = uint16_t>`

`class EMA {`

`public:`

`/// Update the filter with the given input and return the filtered output.`

`uint_t operator()(uint_t input) {`

`state += input;`

`uint_t output = (state + half) >> K;`

`state -= output;`

`return output;`

`}`

`static_assert(`

`uint_t(0) < uint_t(-1), // Check that `uint_t` is an unsigned type`

`"The `uint_t` type should be an unsigned integer, otherwise, "`

`"the division using bit shifts is invalid.");`

`/// Fixed point representation of one half, used for rounding.`

`constexpr static uint_t half = 1 << (K - 1);`

`private:`

`uint_t state = 0;`

`};`

`void setup() {`

`Serial.begin(115200);`

`while (!Serial);`

`}`

`const unsigned long interval = 10000; // 10000 µs = 100 Hz`

`void loop() {`

`static EMA<2> filter;`

`static unsigned long prevMicros = micros() - interval;`

`if (micros() - prevMicros >= interval) {`

`int rawValue = analogRead(A0);`

`int filteredValue = filter(rawValue);`

`Serial.print(rawValue);`

`Serial.print('\t');`

`Serial.println(filteredValue);`

`prevMicros += interval;`

`}`

`}`

### Additional resources

The idea to use a constant offset to deal with negative inputs originated in this PJRC forum thread. It also includes a discussion about how the filter works, simulations comparing integer EMA implementations with and without rounding, a pure C implementation, etc.