FAQ

Why don't you use just the standard 'double' datatype?

On the AVR/Arduino microprocessors, double is identical to float, both referring to 32-bit floating point numbers with 7-8 digits of precison. So, you do not get extra precision if you rewrite your code for double. It is still just float arithmetic.

Why is the fp64lib not included in gcc as standard?

Integration into the AVR toolchain for gcc is a project of a totally different scale than creating a stand-alone library. It took me already quite some time to create this library, but it’s fine and I think it can also be used by others.

However, even as the library includes all the entry points for gcc (as far as I know them), integrating into gcc is quite on a different level. You definitely need to have some background knowledge on gcc code generation. Maybe, if enough of you like this library and you’ll find someone in the gcc-team to support it, the library could be integrated into gcc – maybe into a special version, wehere 64-bit floating point arithmetic is only activated by supplying a special command line switch as there is much too much code out there that assumes that double and float are the same data types on the Arduino.

Who needs a 64-bit floating point library on an 8-bit microprocessor?

I definitely did have the need when creating a project for a clone of a HP RPN-style calculator with minimum production costs. There it was clear to use one of  the AVR microprocessor, and I went for an Arduino Pro Mini with an AVR 328P. And for a calculator, having 6-7 digits of precision and a number range from 1E-38 to 1E+38 is just not sufficient. You definitely expect at least 10 digits of precision and a number range from 1E-99 to 1E+99.

There is another use case where other people already having asked for a 64-bit library: if you attach a GPS sensor  and want to do some calculations based on the current position. A standard NMEA position is XXXYY.ZZZZ, with XXX being degrees, YY minutes and ZZZZ seconds. E.g, 03541.0200 N, 13946.2800 E is a position in Tokyo, Japan. To get a decimal reading, the conversion is XXX + YY.ZZZZ / 60, e.g. 35.683888888889N, 139,77444444444444E for the above example of Tokyo. With the normal 32-bit library, you only get a precision of 7, sometimes 8 digits (counting digits in scientific notation), making 35.68389N, 139,7744E. If you do not take precautions, these little differences will sum up and after a short period of time your position, your mileage or your speed will differ significandly.

Other use cases include steering and tracking telescopes, where you also need some celestial computations with more than 7 digits of precision.

Why didn't you just compile the softfloat library from Berkeley or the generic gcc math library?


Short answer:

Because they are generic and therefore are not optimized for the AVR/arduino microprocessors.

Long answer:

The instruction set of the AVR 328 microprocessor is not very well suited for operations that effect more than 1 register = 1 byte. There are only a few operations affecting a register pair, like sbiw or movw. And some of the operations you definitely need for mathematics, like working with bitmasks, can not be applied on the lower 16 of the available 32 registers.

So, even with the -O2 option of the gcc compiler, code for manipulating 64-bit values tends to be quite large and sometimes clumsy. If you do have some insights in the used algorithms, you can often find ways to code it with fewer instructions.

On top, the mentioned libraries are reference implementation for compatibility with IEEE 754 – and this takes up quite some code. Full compatibility is not always needed on a microcontroller. Of course, if you are computing a numerical simulation, you need to control the rounding mode. But if you just want to calculate distance between several readings of a GPS sensors, you get away with a default rounding mode.

So, fp64lib sacrificed some IEEE 754 compliance, sometimes some accuracy in favor of a “fully” space optimized (there is already room for improvement) library for the AVR 328 microprocessor.

Why did you not choose a different microprocessor, e.g. ESP32?

Yes, there are other microprocessors out there that already have full 64-bit libraries. But these other makes all come with some drawbacks: they are either more cost expensive, the libraries you are looking for are not available, the user community is smaller, the pin layout is different,… you name it.

The AVR/Arduino microprocessors have a huge community, you can definitely find everything for these items. Getting easy support was one of my motivations to move to that product line.

Why do you support only the AVR ATMega product line and not the AVR ATtiny or the AT90?

The ATTiny or the AT90 product line miss important instructions that are used at several times throughout the code, like MUL. It is already a challenge to create a 64-bit multiplication on top of just an 8-bit multiplication – but doing that without any multiplication instruction is way more awkward and time consuming. This is also the reason why the standard 32-bit floating point library is not available for these models, too.

What is all that nonsense of NaN and Inf?

Actually, both NaN (for “Not a Number”) and Inf (for “Infinity”) are very useful concepts, especially in mcroprocessors. Both concepts help you to concentrate your checking for errors to a very few places.

Routines will return NaN if the result of the computation is not defined. Typical examples are the square root of negative numbers like -1.0, or the inverse sine of a number >1. fp64_sqrt() will return NaN, if you supply it with a negative number. The good thing is, that most routines will also return NaN, if one of their arguments is a NaN. So, you can daisy-chain calculations without checking intermediate results and only do a final check, e.g. before displaying the result. This saves valuable flash space and keeps your code clean.

Same is with Inf. Inf is used to extend the range of usable numbers. If for example you are multiplying two big numbers, e.g. 1e200 * 2e150, the result would be 1e350 which is outside the range of numbers. However, both 1e200 and 2e150 are valid arguments to the multiplication. Returning NaN is therefore no option. Some implementation return the biggest number that can be represented, 1e308. This is a bad option, as this result is not correct (it is off by 1e42) and cannot be distinguished from a correct result (like 1e200*1e108). So, returning +Inf is the solution. And as with NaN, +Inf is propagated throughout daisy-chained multiplications.

Again, both NaN and Inf will help you to save valuable flash space as you can reduce your error checking to a few places with the result of cleaner code.

Why did you implement the trigonometric functions using the CORDIC algorithm and not with a Taylor series?

Taylor series is great for approximating even the most complex functions. However, you have to understand the mechanics behind them and what the consequences are.

First, Taylor series approximate best in the close neighbourhood of the point you selected for deriving your Taylor series. The more you move away, the bigger the numerical difference will be between the exact value and the approximation by the Taylor series. This can of course be compensated by adding more and more terms to the Taylor series.

The Tylor series usally used for sine is

sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ... - ...

With the above series with n items (the example above stops at n = 4) , the maximum error is abs(x^(2*n+1))/(2*n+1)!

ErrSin(x) <= abs(x^(2*n+1))/(2*n+1)!

This is great for small angles: for PI/18 (10 degrees), the maximum error is <= 3.3*10^-13. And if we add one more item to the series, the maximum error drops to 9.2*10-17. So with just 5 items, we have 17 digits precision.

However, the error is depending on x and looks really bad for bigger angels, Let’s take PI/4 (45 degrees): the meximum error for 4 item taylor series is 2.5e-7, so 6 digits less than for PI/18. And by adding a 5th item, it drops only to 1.4*10^-9, 8 digits away from our 17 digits precision!

As a consequence, either more items have to added or other other techniques, like argument folding, are needed to come to a definite precision. To add another item to the Taylor series, at least two floating point multiplications ( x^2 and 1/(2n+1)! ) and one floating point addition are needed, all very time consuming operations on the AVR microprocessors.

The CORDIC algorithm avoids this drawbacks. HP pocket calculators already in the 60s used CORDIC, as it gives guaranteed precision per iteration and only uses simple add and shift operations per step. As a positive side effect, this gives us also guaranteed maximum timing on our microprocessor.

On top, both the sine and the cosine value are calculated in each iteration, so for calculating the tangent, only a final division is needed. And as additional benefit, calculating the inverse functions (inverse sine, cosine) can be done by the same engine, so flash space does not increase if we add the inverse functions to our project.

Why ... ?

42