SAving valuable flash (and data) space was the main objective for creating the fp64lib library. this was definitely achieved. My main use case, a HP-style RPN calculator was constantly needing more than 32KByte of flash memory on the Arduino Nano when I used the avr_f64 library. With the fp64lib, all the needed routines fitted nicely into 32KByte.
However, as a side effect, also the execution time speeded up dramatically. On average, fp64lib routines are 4 times faster than their avr_f64 counter parts, which were written in C and taking advantage of internal knowledge of the IEEE 754 data format and some very clever algorithms.
The following table compares execution times between fp64lib and their avr_f64 counterparts – or my C implementation of a reference algorithm if there was not avr_f64 counterpart. All the library and C files files were compiled with the AVR gcc cross-compiler version 5.4.0-atmel3.6.1-arduino2 with the -Os option (optimize for space).
How to read the table:
function fp64lib avr_f64 diff fp64_sqrt 156296 2907808 -94.62% __fp64_mulsf3x 32876 274340 -88.02%
Column function refers to the tested fp64lib function, e.g. fp64_sqrt. Usually, the avr_f64 equivalent name starts with f_, e.g. f_sqrt.
Each line corresponds to a 1000 calls to the fp64lib/avr_f64 function with the same argument. The time is measured for these 1000 calls and expressed in microseconds, measured on a 16 MHz Arduino Pro Mini clone. The timing includes all the necessary overhead, like incrementing counters or calling the function.
So, from the first line above we see that 1000 calls to fp64_sqrt take 156296 microseconds, which is 0,156 seconds (columnt fp64lib), and 1000 calls to f_sqrt take 2907808 microseconds, which is 2,908 seconds (column avr_f64).
The last column “diff” calculates how much time was saved by using fp64lib compared to avr_f64. 0,156 seconds compared to 2,908 seconds is 95% faster – or if you turn it the other way around, fp64_sqrt is 2,908/0,156 = 18,6 times faster than f_sqrt.
However, fp64lib is not always faster. Handling subnormal numbers is very time consuming, as for some routine the subnormal has to be shifted, and multi-byte/multi-register time-shifting is a very time consuming operation. Also, for some routines, e.g. fp64_frexp, a significandly smaller code size could be produced by following standard procedures, operating on unpacked data instead of faster, but bigger routines operating on the packed data.