FAQ – fp64lib.org

Why don't you use just the standard 'double' datatype?

On the AVR/Arduino microprocessors, double is identical to float, both referring to 32-bit floating point numbers with 7-8 digits of precison. So, you do not get extra precision if you rewrite your code for double. It is still just float arithmetic.

Why is the fp64lib not included in gcc as standard?

Integration into the AVR toolchain for gcc is a project of a totally different scale than creating a stand-alone library. It took me already quite some time to create this library, but it’s fine and I think it can also be used by others.

However, even as the library includes all the entry points for gcc (as far as I know them), integrating into gcc is quite on a different level. You definitely need to have some background knowledge on gcc code generation. Maybe, if enough of you like this library and you’ll find someone in the gcc-team to support it, the library could be integrated into gcc – maybe into a special version, where 64-bit floating point arithmetic is only activated by supplying a special command line switch as there is much too much code out there that assumes that double and float are the same data types on the Arduino.

[Update 15.03.2020] gcc 10 now supports 64-bit long double as an option, and fp64lib is compatible with that, starting with V1.1.13

Who needs a 64-bit floating point library on an 8-bit microprocessor?

I definitely did have the need when creating a project for a clone of a HP RPN-style calculator with minimum production costs. There it was clear to use one of the AVR microprocessor, and I went for an Arduino Pro Mini with an AVR 328P. And for a calculator, having 6-7 digits of precision and a number range from 1E-38 to 1E+38 is just not sufficient. You definitely expect at least 10 digits of precision and a number range from 1E-99 to 1E+99.

There is another use case where other people already having asked for a 64-bit library: if you attach a GPS sensor and want to do some calculations based on the current position. A standard NMEA position is XXXYY.ZZZZ, with XXX being degrees, YY minutes and ZZZZ seconds. E.g, 03541.0200 N, 13946.2800 E is a position in Tokyo, Japan. To get a decimal reading, the conversion is XXX + YY.ZZZZ / 60, e.g. 35.683888888889N, 139,77444444444444E for the above example of Tokyo. With the normal 32-bit library, you only get a precision of 7, sometimes 8 digits (counting digits in scientific notation), making 35.68389N, 139,7744E. If you do not take precautions, these little differences will sum up and after a short period of time your position, your mileage or your speed will differ significandly.

Other use cases include steering and tracking telescopes, where you also need some celestial computations with more than 7 digits of precision.

Why didn't you just compile the softfloat library from Berkeley or the generic gcc math library?

Short answer:

Because they are generic and therefore are not optimized for the AVR/arduino microprocessors.

Long answer:

The instruction set of the AVR 328 microprocessor is not very well suited for operations that effect more than 1 register = 1 byte. There are only a few operations affecting a register pair, like sbiw or movw. And some of the operations you definitely need for mathematics, like working with bitmasks, can not be applied on the lower 16 of the available 32 registers.

So, even with the -O2 option of the gcc compiler, code for manipulating 64-bit values tends to be quite large and sometimes clumsy. If you do have some insights in the used algorithms, you can often find ways to code it with fewer instructions.

On top, the mentioned libraries are reference implementation for compatibility with IEEE 754 – and this takes up quite some code. Full compatibility is not always needed on a microcontroller. Of course, if you are computing a numerical simulation, you need to control the rounding mode. But if you just want to calculate distance between several readings of a GPS sensors, you get away with a default rounding mode.

So, fp64lib sacrificed some IEEE 754 compliance, sometimes some accuracy in favor of a “fully” space optimized (there is already room for improvement) library for the AVR 328 microprocessor.

Why did you not choose a different microprocessor, e.g. ESP32?

Yes, there are other microprocessors out there that already have full 64-bit libraries. But these other makes all come with some drawbacks: they are either more cost expensive, the libraries you are looking for are not available, the user community is smaller, the pin layout is different,… you name it.

The AVR/Arduino microprocessors have a huge community, you can definitely find everything for these items. Getting easy support was one of my motivations to move to that product line.

Why do you support only the AVR ATMega product line and not the AVR ATtiny or the AT90?

The ATTiny or the AT90 product line miss important instructions that are used at several times throughout the code, like MUL. It is already a challenge to create a 64-bit multiplication on top of just an 8-bit multiplication – but doing that without any multiplication instruction is way more awkward and time consuming. This is also the reason why the standard 32-bit floating point library is not available for these models, too.

What is all that nonsense of NaN and Inf?

Actually, both NaN (for “Not a Number”) and Inf (for “Infinity”) are very useful concepts, especially in mcroprocessors. Both concepts help you to concentrate your checking for errors to a very few places.

Routines will return NaN if the result of the computation is not defined. Typical examples are the square root of negative numbers like -1.0, or the inverse sine of a number >1. fp64_sqrt() will return NaN, if you supply it with a negative number. The good thing is, that most routines will also return NaN, if one of their arguments is a NaN. So, you can daisy-chain calculations without checking intermediate results and only do a final check, e.g. before displaying the result. This saves valuable flash space and keeps your code clean.

Same is with Inf. Inf is used to extend the range of usable numbers. If for example you are multiplying two big numbers, e.g. 1e200 * 2e150, the result would be 1e350 which is outside the range of numbers. However, both 1e200 and 2e150 are valid arguments to the multiplication. Returning NaN is therefore no option. Some implementation return the biggest number that can be represented, 1e308. This is a bad option, as this result is not correct (it is off by 1e42) and cannot be distinguished from a correct result (like 1e200*1e108). So, returning +Inf is the solution. And as with NaN, +Inf is propagated throughout daisy-chained multiplications.

Again, both NaN and Inf will help you to save valuable flash space as you can reduce your error checking to a few places with the result of cleaner code.

Why did you use function calls instead of classes and methods

Because it uses less memory and execution time, and you are fully in control of memory consumption. Remember: the Arduino UNO only has 2048 bytes of memory for variables, and every 64-bit floating numbers takes up 64bit = 8 bytes.

Why ... ?