fp64lib Library reference (A-Z)

The ordering scheme of this reference is alphabetically, with the exception of the conversion functions from and to float64_t. All the the conversion functions are listed near the end of the reference, firstly all the functions converted 64-bit float data into other data formats, followed by the functions converted data from other formats into 64-bit floating point.

Where available, the headline show the corresponding name or concept for a standard CPU and math.h compatible double implementation, e.g. abs or (long). Then the exact calling conversion is listed, e.g. float64_t fp64_abs( float64_t x);, followed by a brief description of the function. When there are any special conditions/cases for one or more of the function arguments, these are then listed in a table. The value any for a parameter refers to any other value that is not listed in the table for the respective parameter.

Reference by Functional Groups see here…

Constants

fp64lib comes with the constants listed in the following table, defined in fp64lib.h. These constants can be used for any of parameters of a function, and can be directly assigned to a variable. They always have the maximum possible 64-bit precision, even if their decimal representation shown below is anly an approximation.

constant name	(approx.) value	comment
float64_EULER_E	2.7182818284590452	Euler’s number e
float64_NUMBER_PI	3.1415926535897932	pi, π
float64_NUMBER_PIO2	1.5707963267948966	π/2
float64_NUMBER_2PI	6.2831853071795862	2*π
float64_NUMBER_ONE	+1
float64_NUMBER_MINUS_ONE	-1
float64_NUMBER_PLUS_ZERO	0	0 or +0
float64_NUMBER_MINUS_ZERO	-0	*see below
float64_ONE_POSSIBLE_NAN_ REPRESENTATION	NaN	**see below
float64_PLUS_INFINITY	+INF	+∞
float64_MINUS_INFINITY	-INF	-∞

*The IEEE754 concept of having two representations of 0 looks unfamiliar for most users, as “there is only one 0”. However, in algorithms for numerical approximations, having 2 interpretations of 0 has the big advantage of providing the additional information, from which side 0 was approximated. E.g., fp64_mul and fp64_div will return -0 in case the result of the operation is an underflow on the negative side, and +0 for an underflow on the positive side.

**To allow a function to return some partial information about the failed operation, that information can be included as part of the NaN, resulting into Nan not being a single result, but NaN having many possible representations, whereas positive and negative infinity are clearly indicated with a single “value”. So, your code may compare against +Inf or -Inf, but tests against Nan will most problably fail. Instead of comparing to Nan, use fp64_isnan() instead.

abs

float64_t fp64_abs( float64_t x);

Return absolute value of x, removing sign. This works even if x is NaN, Inf or zero, e.g. fp64_abs(-INF) will return +INF.

acos

float64_t fp64_acos( float64_t x );

Returns the arcus cosine of x, the inverse function to fp64_cos(). The following special cases apply:

case	x	fp64_acos(x)
1	NaN	NaN
2	x > +1 (incl. +INF)	NaN
3	x < -1 (incl. -INF)	NaN

acosh

float64_t fp64_acosh( float64_t x );

returns the inverse hyperbolic cosine (a.k.a. area hyperbolic cosine) of x, which is defined mathematically as

arcosh(x) = ln(x + sqrt(x*x-1)) for x >= 1

If x < 1, NaN is returned.

Available with release 1.1.20 onwards.

add (+)

float64_t fp64_add( float64_t A, float64_t B );

Adds two 64 bit floating point numbers and rounds result to a 64bit floating point number. The following special cases apply:

case	A	B	fp64_add(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3a	+Inf	-Inf	NaN
3b	+Inf	+Inf	+Inf
3c	-Inf	-Inf	-Inf
3d	-Inf	+Inf	NaN
4	Inf	any	Inf
5	any	Inf	Inf
6	0	any	any
7	any	0	any

asin

float64_t fp64_asin(float64_4 x );

Returns the arcus sine of x, the inverse function to fp64_sin(). The following special cases apply:

case	x	fp64_asin(x)
1	NaN	NaN
2	x > +1 (incl. +INF)	NaN
3	x < -1 (incl. -INF)	NaN

asinh

float64_t fp64_asinh( float64_t x );

returns the inverse hyperbolic sine (a.k.a. area hyperbolic sine) of x, which is defined mathematically as

arsinh(x) = ln(x + sqrt(x*x+1))

Available with release 1.1.20 onwards.

The following special cases apply:

case	x	fp64_asinh(x)
1	NaN	NaN
2a	+Inf	+Inf
2b	-Inf	-Inf

atan

float64_t fp64_atan( float64_t x );

Returns the arcus tangent of x, the inverse function to fp64_tan(). The following special cases apply:

case	x	fp64_atan(x)
1	NaN	NaN
2a	+Inf	+PI/2
2b	-Inf	-PI/2

atan2

float64_t fp64_atan2 ( float64_t A, float64_t B );

The fp64_atan2() function calculates the arc tangent of the two variables A and B, where A can be interpreted as the y coordinate and B as the x coordinate. It is similar to calculating the arc tangent of A/B, except that the signs of both arguments are used to determine the quadrant of the result. The fp64_atan2() function returns the result in radians, which is between -PI and PI (inclusive):

Sector|   x  |   y  | |y|<=|x| | result range
------+------+------+----------+-----------------
I     | >= 0 | >= 0 |    yes   | 0..PI/4
II    | >= 0 | >= 0 |     no   | PI/4..PI/2
III   | < 0  | >= 0 |     no   | PI/2..3*PI/4
IV    | < 0  | >= 0 |    yes   | 3*PI/4..PI
V     | < 0  | < 0  |    yes   | -PI..-3*PI/4
VI    | < 0  | < 0  |     no   | -3*PI/4..-PI/2
VII   | >= 0 | < 0  |     no   | -PI/2..-PI/4
VIII  | >= 0 | < 0  |     yes  | -PI/4..0

Note:
Unlike the 32bit floating library, this implementation respects all four
cases of 0/0, like x86 (GCC/Glibc): +0/+0, +0/-0, -0/+0 and -0/-0.

Note: Unlike the 32bit floating library, this implementation respects all four cases of 0/0, like x86 (GCC/Glibc): +0/+0, +0/-0, -0/+0 and -0/-0.

The following special cases apply:

case	A = y	B = x	fp64_atan2(A,B)
1a	NaN	any	NaN
1b	any	NaN	NaN
2a	+0	>= 0	+0
2b	+0	< 0	+PI
2c	-o	>= 0	-0
2d	-0	< 0	-PI
3a	+finite	+Inf	+0
3b	+finite	-Inf	+PI
3c	-finite	+Inf	-0
3d	-finite	-Inf	-PI
3e	+/- Inf	+/- Inf	NaN
3f	+/- Inf	any	+/- PI/2
4	\|y\| <= \|x\|	>= 0	see sector I & VIII: atan(y/x)+0
5	\|y\| <= \|x\|	< 0	see sector IV & V: atan(y/x)+/-PI
6	>= 0	\|y\| > \|x\|	see sector II & III: -atan(x/y)+PI/2
7	< 0	\|y\| > \|x\|	see sector VI & VII: -atan(x/y)-PI/2
8a	+any	+/- 0	+PI/2
8a	-any	+/-0	-PI/2

atanh

float64_t fp64_atanh( float64_t x );

returns the inverse hyperbolic tangent (a.k.a. area hyperbolic tangent) of x, which is defined mathematically as

artanh(x) = 0.5*ln((1+x)/(1-x)) for |x| < 1

If x is >= 1, NaN is returned.

Available with release 1.1.20 onwards.

atof

float64_t fp64_atof( char *str );

converts a string to a number, handles also NaN, +INF, -INF.
A valid number (non NaN or INF) can be expressed by the
following regex: [+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?

cbrt

float64_t fp64_cbrt(float64_t x);

The fp64_cbrt() returns the cubic root of x. Works for positive and
negative values of x.

ceil

float64_t fp64_ceil( float64_t A );

Rounds A upwards to the nearest integer greater or equal A.
For A < 0 this is identical to trunc(A). The following rules apply:

case	A	fp64_ceil(A)
0	x < 0	fp64_trunc(A)
1	NaN	NaN
2	+Inf	+Inf
3	+0	+0
4	A >= 2^52	A
5	0 < A < 1	1.0
6	1 <= A < 2^52	fp64_ceil(A)

classify

int fp64_classify( float64_t x );

Categorizes floating point value x into the following categories:
zero, subnormal, normal, infinite, and NAN. For easier handling, the FP_xxx constants shown below are defined in fp64lib.h and can be used instead of the numerical values.
The following cases apply:

case	A	fp64_classify(x)
1	NaN	1 (FP_NAN)
2	+/- Inf	2 (FP_INFINITE)
3	+/- 0	0 (FP_ZERO)
4	< 2^-1023	3 (FP_SUBNORMAL)
5	any	4 (FP_NORMAL)

compare ( <, ==, > )

float64_t fp64_compare (float64_t A, float64_t B);

Compares the two values A and B and returns -1/0/1, based on the following rules:

case	A	B	fp64_compare(A,B)
1	NaN	any	+1
2	any	NaN	+1
3	A < B	any	-1
4	A == B	any	0
5	A > B	any	1

copysign

float64_t fp64_copysign ( float64_t x, float64_t y );

The fp64_copysign() function returns x but with the sign of y. This works even if x or y are NaN, Inf or zero.

cos

float64_t fp64_cos( float64_t phi );

Returns the cosine of phi. The following special cases apply:

case	x	fp64_cos(x)
1	NaN	NaN
2	+INF	NaN
3	-INF	NaN

See also the article on fp64_sin(float64_NUMBER_PI) is not 0.

cosh

float64_t fp64_cosh( float64_t x );

The fp64_cosh() function returns the hyperbolic cosine of x, which is
defined mathematically as (exp(x) + exp(-x)) / 2.

cotan

float64_t fp64_cotan( float64_t x );

The fp64_cotan() function returns the cotangent of x, which is
defined mathematically as cos(x) / sin(x) or 1/tan(x).

fp64_cut_noninteger_fraction

float64_t fp64_cut_noninteger_fraction( float64_t x );

Alias to fp64_trunc, see there.

div ( / )

float64_t fp64_div( float64_t A, float64_t B );

Divides two 64 bit floating point numbers and rounds result of A/B to a 64bit floating point number. The following special cases apply:

case	A	B	fp64_div(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3	Inf	0	NaN
4	Inf	Inf	NaN
5	0	Inf	0
6	Inf	!= 0	Inf
7	!= 0	Inf	0
8	0	0	NaN
9	0	!= 0	0

etoa

char fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )

converts a number to a string with maxDigits of significand in engineering format, where the exponent is always a multiple of 3.

parameters:
x                    number x to convert in float64_t format
maxDigits maximum number of digits in significand
expSep         flag to store significand and exponent seperately
exp10            if != NULL, store exponent base 10 here, will be a multiple of 3

NOTE: this function supports conversion of up to 17 digits. However, accuracy of IEEE754 is only 52 bit which is 15-16 decimal digits!

WARNING: function returns a pointer to a static temporary scratch area which might be used also for other functions. The returned string might become invalid/scrambled if one of the other fp64_ functions will be called. So copy the result to some other allocated memory that you have under control before calling further fp64lib-routines.

exp

float64_t fp64_exp (float64_t x);

The fp64_exp() function returns the value of e (the base of natural
logarithms) raised to the power of x (e^x). The following rules apply:

case	x	fp64_exp(x)
1	NaN	NaN
2	+Inf	+Inf
3	-Inf	0
4	0	1
5	> 709	+Inf (Overflow)
6	< -744	0 (Underflow)
7	!= 0	A
8	0	NaN
9	0	0

exp10

float64_t fp64_exp10 (float64_t x);

The fp64_exp10() function returns the value of 10 raised to the power of x, i.e. 10^x.

fdim

float_64t fp64_fdim (float_64t A, float_64t B);

The fp64_fdim() function returns max(A-B,0). If x or y or both are NaN, NaN is returned.

floor

float64_t fp64_floor( float64_t A );

Rounds A downwards to the nearest integer less or equal A. For A > 0 this is identical to trunc(A).

Examples:
fp64_floor(1.9) --> 1.0
fp64_floor(-1.9) --> -2.0

The following rules apply:

case	A	fp64_floor(A)
0	x > 0	fp64_trunc(A)
1	NaN	NaN
2	Inf	Inf
3	+0	+0
4	A <= -2^52	A
5	-1 <= A < 0	-1.0
6	-2^52 <= A < -1	fp64_floor(A)

fma

float64_t fp64_fma (float64_t A, float64_t B, float64_t C);

The `fp64_fma’ function performs floating-point multiply-add. This is the operation (A * B) + C. Current implementation is space optimized, so no effort is taken to work on internal intermediate result. Only advantage for the caller is to save space for the call sequence.

fmax

float64_t fp64_fmax (float64_t A, float64_t B);

The fp64_fmax() function returns the greater of the two values A and B.
The following rules apply:

case	A	B	fp64_fmax(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3	< 0	>= 0	B
4	>= 0	< 0	A
5	A >= B	any	A
6	A < B	any	B

fmin

float64_t fp64_fmin(float64_t A, float64_t B);

The fp64_fmin() function returns the lesser of the two values A and B.
The following rules apply:

case	A	B	fp64_fmax(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3	< 0	>= 0	A
4	>= 0	< 0	B
5	A >= B	any	B
6	A < B	any	A

fmod ( % )

float64_t fp64_fmod (float64_t x, float64_t y);

The fp64_fmod() function computes the remainder of dividing x by y. The
return value is x – n*y, where n is the quotient of x/y, rounded
towards zero to an integer. The following rules apply:

case	A	B	fp64_fmod(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3	Inf	0	NaN
4	Inf	Inf	NaN
5	Inf	!= 0	NaN
6	0	Inf	0
7	!= 0	Inf	A
8	0	0	NaN
9	0	!= 0	0

fmodn

float64_t fp64_fmodn( float64_t x, float64_t y, unsigned long *np );

Like fp64_fmod, fp64_fmodn() function computes the remainder of dividing x by y. The return value is x – n*y, where n is the quotient of x/y, rounded towards zero to an integer. If pointer a pointer np != NULL is passed, the value of n will be stored in *np. See fp64_fmod for further rules.

Available with release 1.08 onwards.

fp64_fmodx_pi2

float64_t fp64_fmodx_pi2( float64_t x, unsigned long *np );

Computes the remainder of dividing x by Π/2 with extended 96-bit internal precision to avoid loss of significand digits for x >> Π. The return value is Π/2 – n*y, where n is the quotient of x/(Π/2), rounded towards zero to an integer. If pointer a pointer np != NULL is passed, the value of n will be stored in *np. See fp64_fmod for further rules.

Available with release 1.08 onwards.

frexp

float64_t fp64_frexp (float64_t A, int *pexp);

The fp64_frexp() function is used to split the number A into a normalized
fraction and an exponent which is stored by pexp.

Return:
If A is a normal float point number, the fp64_frexp() function returns the
value v, such that v has a magnitude in the interval [1/2, 1) or zero,
and A equals v times 2 raised to the power *pexp. If A is zero, both
parts of the result are zero. If A is not a finite number, the fp64_frexp()
returns A as is and stores 0 by pexp.

Note:
This implementation permits a zero pointer as a directive to skip
a storing the exponent.

ftoa

char fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )

converts a number to a string with maxDigits of significand. Details see fp64_to_decimalExp.

hypot

float64_t fp64_hypot(float64_t x, float64_t y);

The float64_t() function returns `sqrt (x*x + y*y)’. This is the length
of the hypotenuse of a right triangle with sides of length x and y,
or the distance of the point (x, y) from the origin.

ilogb

int fp64_ilogb(float64_t x)

Returns the integral part of the logarithm of |x| to base 2. This is the exponent used internally to express the floating-point value x, when it uses a significand between 1.0 and 2, so that, for a positive x

x = significand * 2^exponent

Generally, the value returned by this function is one less than the exponent obtained with frexp (because of the different significand normalization as [1.0,2.0) instead of [0.5,1.0)).

For easier testing of the result, the FP_ILOGBxxx constants shown below are defined in fp64lib.h

The following rules apply:

case	x	fp64_ilogb(x)
1	NaN	FP_ILOGBNAN
2	+/-INF	INT_MAX
3	+/-0.0	FP_ILOGB0
4	else	significand * 2^exponent

inverse

float64_t inverse (float64_t A);

The inverse() function returns 1/A value.

isfinite

int fp64_isfinite (float64_t x);

The isfinite() function returns a nonzero value if x is finite: not
plus or minus infinity, and not NaN.

isinf

int fp64_isinf (float64_t);

The fp64_isinf () function returns -1 if value represents negative
infinity, 1 if value represents positive infinity, and 0 otherwise.

isnan

int fp64_isnan( float64_t x );

The fp64_isnan() function returns a non-zero value if value is
“not-a-number” (NaN), and 0 otherwise.

ldexp

float64_t fp64_ldexp (float64_t x, int exp);

The fp64_ldexp() function returns the result of multiplying the
floating-point number x by 2 raised to the power exp.
The following rules apply:

case	x	fp64_ldexp(x,exp)
1	NaN	NaN
2	+/- Inf	+/- Inf
3	+/- 0	+/- o
4	exponent(x)+exp>1023	+/- Inf (Overflow)
5	exponent(x)+exp<-1074	+/- 0.0 (Underflow)
6	exponent(x)+exp<-1023	x*2^exp as subnormal number
7	exponent(x)+exp<-1023	x*2^exp

log

float64_t fp64_log( float64_t x );

returns the natural logarithm ln of x. The following rules apply:

case	x	fp64_log(x)
1	< 0	NaN (< 0 includes -Inf)
2	NaN	NaN
3	+Inf	+/- o
4	0	-Inf
5	1	0

logb

float64_t fp64_logb( float64_t x );

returns the base 2 logarithm of |x|. logb(A) = log2(|x|) = log(|x|) / log(2)

Available with release 1.1.20 onwards.

log10

float64_t fp64_log10( float64_t A );

The fp64_log10() function returns the base 10 logarithm of A.
log10(A) = log(A) / log(10).

log2

float64_t fp64_log2( float64_t A );

The fp64_log2() function returns the base 2 logarithm of A.
log2(A) = log(A) / log(2).

lrint

long fp64_lrint (float64_t A);

The fp64_lrint() function rounds A to the nearest integer, rounding the
halfway cases to the even integer direction. (That is both 1.5 and
2.5 values are rounded to 2). This function is similar to rint()
function, but it differs in type of return value and in that an
overflow is possible.
Return:
The rounded long integer value. If A is infinite, NaN or an overflow
was, this realization returns the LONG_MIN value (0x80000000).

Examples:
fp64_lrint(1.25) --> 1L
fp64_lrint(1.5)  --> 2L
fp64_lrint(2.5)  --> 2L
fp64_lrint(2.75) --> 3L
fp64_lrint(3.5)  --> 4L
fp64_lrint(-1.25) --> -1L
fp64_lrint(-1.5)  --> -2L
fp64_lrint(-2.5)  --> -2L
fp64_lrint(-2.75) --> -3L
fp64_lrint(-3.5)  --> -4L

The following rules apply:

case	A	fp64_lrint(A)
1	NaN	LONG_MIN (0x80000000)
2	+/- Inf	LONG_MIN (0x80000000)
3	+/- 0	0L
4	\|A\|>=2^31	LONG_MIN (0x80000000)
5	0 < \|A\|<= 0.5	0L
6	0.5 < \|A\|< 1.0	+/- 1L
7	1.0 <=\|A\|< 2^31	trunc(x) if (\|x\|-trunc(\|x\|)) < 0.5 trunc(x+sign(x)) if (\|x\|-trunc(\|x\|)) >= 0.5

lround

long fp64_lround (float64_t x);

The fp64_lround() function rounds x to the nearest integer, but rounds
halfway cases away from zero (instead of to the nearest even integer).
This function is similar to round() function, but it differs in
type of return value and in that an overflow is possible.
Return:
The rounded long integer value. If x is infinite, NaN or an overflow
was, this realization returns the LONG_MIN value (0x80000000).

Examples:
fp64_lround(1.1) --> 1L
fp64_lround(1.5) --> 2L
fp64_lround(-1.1) --> -1L
fp64_lround(-1.5) --> -2L

The following rules apply:

case	A	fp64_lrint(A)
1	NaN	LONG_MIN (0x80000000)
2	+/- Inf	LONG_MIN (0x80000000)
3	+/- 0	0L
4	\|A\|>=2^31	LONG_MIN (0x80000000)
5	0 < \|A\|< 0.5	0L
6	0.5 <= \|A\|< 1.0	+/- 1L
7	1.0 <=\|A\|< 2^31	trunc(x) if (\|x\|-trunc(\|x\|)) < 0.5 trunc(x+sign(x)) if (\|x\|-trunc(\|x\|)) >= 0.5

modf

float64_t fp64_modf (float64_t x, float64_t *iptr);

The fp64_modf() function breaks the argument x into an integral part and a fractional part, each of which has the same sign as x. The integral part
is stored in iptr. This implementation skips writing by zero pointer.

Examples:
modf(123.45) makes 0.45 (return value) + 123.0 (stored in *iptr)

The following rules apply:

case	x	fp64_modf(x)	*iptr
1	NaN	NaN	NaN
2	Inf	Inf	0.0
3	0.0	0.0	0.0
4	\|x\| > 2^53	0.0	x
5	\|x\| < 1.0	x	0.0

mul (*)

fp64 fp64_mul( fp64 A, fp64 B );

Multiplies two 64 bit floating point numbers and rounds result to a 64bit floating point number.

Examples:
modf(123.45) makes 0.45 (return value) + 123.0 (stored in *iptr)

The following rules apply:

case	A	B	fp64_mul(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3	Inf	0.0	NaN
4	0.0	Inf	NaN
5	Inf	!= 0	Inf
6	!= 0	Inf	Inf

neg ( -x )

float64_t fp64_neg ( float64_t A );

Returns -A. Also works for special cases (NaN, +Inf, -Inf), subnormal numbers and +0 and -0.

pow

float64_t fp64_pow(float64_t x, float64_t y);

The fp64_pow() function returns the value of x raised to the power of y. The following rules apply:

case	x	y	fp64_pow(x,y)
1	any	0	1.0 (includes x = NaN or Inf or 0)
2	+1.0	any	1.0 (includes y = NaN or Inf)
3	any	+1.0	x (includes x = NaN or Inf)
4a	x < 0	integer > 0	x^y
4b	x < 0	any	NaN, if y is not a positive integer
5	x > 0	any	x^y

pow10

float64_t fp64_pow10 (float64_t x);

The fp64_exp10() function returns the value of 10 raised to the power of x, i.e. 10^x. It is an alias to fp64_exp10(x).

round

float64_t fp64_round( float64_t x );

Rounds x upwards to the nearest integer, but rounds halfway cases away from 0.

Examples: 
fp64_round(1.1) --> 1.0
fp64_round(1.5) --> 2.0
fp64_round(-1.1) --> -1.0
fp64_round(-1.5) --> -2.0

The following rules apply:

case	x	fp64_round(x)
1	NaN	NaN
2	+/- Inf	+/- Inf
3	+/- 0	+/- o
4	\|x\| >= 2^52	x
5	0 < \|x\| < 0.5	0.0
6	0.5 <= \|x\| < 1	+/- 1.0
7	1 <= \|x\| <= 2^52	trunc(x) if (\|x\|-trunc(\|x\|)) < 0.5 trunc(x+sign(x)) if (\|x\|-trunc(\|x\|)) >= 0.5

scalbln

float64_t fp64_scalbln (float64_t x, long n);

fp64_scalbln behaves identical to fp64_ldexp() or fp64_scalbn,
but can be called with n being of type long instead of int.

Available with release 1.1.20 onwards.

scalbn

float64_t fp64_scalbn (float64_t x, int n);

fp64_scalbn is an alias to fp64_ldexp(), see there. It returns x * 2ⁿ.

Available with release 1.1.20 onwards.

signbit

int fp64_signbit (float64_t x);

The fp64_signbit() function returns a nonzero value if the value of x has
its sign bit set. This is not the same as `x < 0.0′, because IEEE 754
floating point allows zero to be signed. The comparison `-0.0 < 0.0′
is false, but `fp64_signbit (-0.0)’ will return a nonzero value.

This implementation returns 1 if sign bit is set. This corresponds
to builtin GCC realization for float.

sin

float64_t fp64_sin( float64_t phi );

Returns the sine of phi. The following special cases apply:

case	phi	fp64_sin(x)
1	NaN	NaN
2	+INF	NaN
3	-INF	NaN

See also the article on fp64_sin(float64_NUMBER_PI) is not 0.

sinh

float64_t fp64_sinh( float64_t x);

The fp64_sinh() function returns the hyperbolic sine of x, which is
defined mathematically as (exp(x) – exp(-x)) / 2.

sqrt

float64_t fp64_sqrt (float64_t);

Square root function. The following special cases apply:

sqrt(NaN) = NaN
sqrt(+Inf) = +Inf
sqrt(-Inf) = NaN
sqrt(+-0) = +-0

Be especially aware of the last special case: fp64_sqrt(-0) = -0. This is fully compliant to IEEE 754 standard, see Wikipedia, but might cause trouble in your code if you are relying on the result always beeing positive.

square

float64_t fp64_square (float64_t A);

The fp64_square() function returns square of A = A².

strtod

float64_t fp64_strtod( char *str, char **endptr );

converts a string to a number, handles also NaN, +INF, -INF.
A valid number (non NaN or INF) can be expressed by the
following regex: [+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?

Returns the identified number as result, and *endptr contains pointer to last parsed position in string str.

sub ( – )

float64_t fp64_sub( float64_t A, float64_t B );

Subtracts two 64 bit floating point numbers and rounds result to a 64bit floating point number. The following special cases apply:

case	A	B	fp64_sub(A,B)
1	NaN	any	NaN
2	any	NaN	NaN
3a	+Inf	-Inf	+Inf
3b	+Inf	+Inf	NaN
3c	-Inf	-Inf	NaN
3d	-Inf	+Inf	-Inf
4	Inf	any	Inf
5	any	Inf	-Inf
6	0	any	-any
7	any	0	any

tan

float64_t fp64_tan( float64_4 phi );

Returns the tangent of phi. The following special cases apply:

tan(NaN) = NaN
tan(+Inf) = NaN
tan(-Inf) = NaN

tanh

float64_t fp64_tanh (float64_t x);

The fp64_tanh() function returns the hyperbolic tangent of x, which is
defined mathematically as sinh(x) / cosh(x).

fp64_to_decimalExp

char fp64_to_decimalExp( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )

converts a number to a string with maxDigits of significand.

NOTE: this function supports conversion of up to 17 digits. However, accuracy of IEEE754 is only 52 bit which is 15-16 decimal digits!

(char)

char fp64_to_int8( float64_t A );

The fp64_to_int8() function converts A to the signed integer value,
rounding towards 0, the fractional part is lost. No saturation.
Only when called in assembly language: Besides a normal 8-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_int8 (A)
1	NaN	1	0x00
2	Inf	1	0x00
3	\|A\|>=2^7	1	0x00
4	0<\|A\|<1.0	0	0x00
5	\|A\|<2^7	0	(char) A

(float)

float fp64_ds( float64_t A );

Converts a float64 to the float32 representing the same number. Significand is rounded up, if bit 25 of significand of A is set.
As the range of float is significandly smaller than float64_t,
the following rules apply:

case	A	fp64_ds(x) as hex	fp64_ds(x)
1	NaN	0x7fffffff	NaN in float
2	+Inf	0x7f800000	+Inf in float
3	-Inf	0xff800000	-Inf in float
4	0.0	0x00000000	0.0
5	-0.0	0x80000000	-0.0
6	x >= 2^128	0x7f800000	+Inf in float
7	-2^128 <= x	0xff800000	-Inf in float
8	0 < x < 2^-149	0x00000000	0.0 as float, underflow
9	-2^-149 < x < 0	0x80000000	-0.0 as float, underflow
10	2^-149 < x < 2^-126	0x00mmmmmm	subnormal number with mmmmmm < 0x800000
11	-2^-126 < x <-2^-149	0x80mmmmmm	subnormal number with mmmmmm < 0x800000
12	2^-126 <\|x\|< 2^127		(float) A

(int)

int fp64_to_int16( float64_t A );

The fp64_to_int16() function converts A to the signed integer value,
rounding towards 0, the fractional part is lost. No saturation.
Only when called in assembly language: Besides a normal 16-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_int16 (A)
1	NaN	1	0x0000
2	Inf	1	0x0000
3	\|A\|>=2^15	1	0x0000
4	0<\|A\|<1.0	0	0x0000
5	\|A\|<2^15	0	(int) A

(long)

long fp64_float64_to_long( float64_t A );

Alias to fp64_to_int32, see there.

(long int)

long fp64_to_int32( float64_t A );

The fp64_to_int32() function converts A to the signed integer value,
rounding towards 0, the fractional part is lost. No saturation.
Only when called in assembly language: Besides a normal 32-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_int32 (A)
1	NaN	1	0x00000000
2	Inf	1	0x00000000
3	\|A\|>=2^31	1	0x00000000
4	0<\|A\|<1.0	0	0x00000000
5	\|A\|<2^31	0	(long) A

(long long int)

long long fp64_to_int64( float64_t A );

The fp64_to_int64() function converts A to the signed integer value,
rounding towards 0, the fractional part is lost. No saturation.
Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_int64 (A)
1	NaN	1	0x0000000000000000
2	Inf	1	0x0000000000000000
3	\|A\|>=2^63	1	0x0000000000000000
4	0<\|A\|<1.0	0	0x0000000000000000
5	\|A\|<2^63	0	(long long) A

(unsigned char)

unsigned char fp64_to_uint8( float64_t A );

The fp64_to_uint8() function converts A to the unsigned integer value,
rounding towards 0, the fractional part is lost. No saturation.
Negative input is permissable (like GCC/x86).
Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_uint64 (A)
1	NaN	1	0x00
2	Inf	1	0x00
3	\|A\|>=2^8	1	0x00
4	0 < \|A\| <1.0	0	0x00
5	0 < A < 2^8	0	(unsigned char) A
6	-2^8 < A < 0	0	-((unsigned char) (-A)

(unsigned int)

unsigned int fp64_to_uint16( float64_t A );

The fp64_to_uint16() function converts A to the unsigned integer value,
rounding towards 0, the fractional part is lost. No saturation.
Negative input is permissable (like GCC/x86).
Only when called in assembly language: Besides a normal 16-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_uint16 (A)
1	NaN	1	0x0000
2	Inf	1	0x0000
3	\|A\|>=2^16	1	0x0000
4	0 < \|A\| <1.0	0	0x0000
5	0 < A < 2^16	0	(unsigned int) A
6	-2^16 < A < 0	0	-((unsigned int) (-A)

(unsigned long)

unsigned long fp64_to_uint32( float64_t A );

The fp64_to_uint32() function converts A to the unsigned integer value,
rounding towards 0, the fractional part is lost. No saturation.
Negative input is permissable (like GCC/x86).
Only when called in assembly language: Besides a normal 32-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_uint32 (A)
1	NaN	1	0x00000000
2	Inf	1	0x00000000
3	\|A\|>=2^32	1	0x00000000
4	0 < \|A\| <1.0	0	0x00000000
5	0 < A < 2^32	0	(unsigned long) A
6	-2^32 < A < 0	0	-((unsigned long) (-A)

(unsigned long long)

unsigned long long fp64_to_uint64( float64_t A );

The fp64_to_uint64() function converts A to the unsigned integer value,
rounding towards 0, the fractional part is lost. No saturation.
Negative input is permissable (like GCC/x86).
Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case	A	Carry	fp64_to_uint64 (A)
1	NaN	1	0x0000000000000000
2	Inf	1	0x0000000000000000
3	\|A\|>=2^64	1	0x0000000000000000
4	0 < \|A\| <1.0	0	0x0000000000000000
5	0 < A < 2^64	0	(unsigned long long) A
6	-2^64 < A < 0	0	-((unsigned long long) (-A)

fp64_to_string

char *fp64_to_string(float64_t x, uint8_t max_chars, uint8_t max_zeroes);

converts the float64 to the decimal representation of the number x. fp64_to_string adjusts the number of decimal digits so that the result
will fit into a string with max_chars characters. However, a longer string
will be returned if the minimum representation will not fit into max_chars.
Minimum representation is “mESn[nn]” for X > 0 else “-mESn[nn]”, so at least 4 to 7 characters are needed.

parameters:
x: number to convert in float64_t format
max_chars: maximum space for result
max_zeroes: use “s0.mmmmmm” when result has less than this # of 0s

The following rules apply:

case	x	fp64_to_string(x)
1	NaN	“NaN”
2	+Inf	“+INF”
3	-Inf	“-INF”
4	\|x\|<1	“s0.mmmmmm” representation without exponent if x can be displayed with less than max_zeroes “0” after “0.” leading sign s only for x < 0 else exponential form is used “sm.mmmmmESnnn”
5	log10(\|x\|)<max_chars	“smmm.mmm” representation without exponent
6	all other cases	“sm.mmmmmESn” exponential form is used leading sign s only for x < 0 exponent “Snnn” has always a sign S (“+” or “-“) and one digit to three digits nnn for the exponent

NOTE: this function supports conversion of up to 17 digits. However, accuracy of IEEE754 is only 52 bit, which is 15-16 decimal digits! Therefore, in no case more than 17 digits of precision are generated.

Available with release 1.08 onwards.

(float64_t) from float

float64_t fp64_sd (float A);

Converts a float32 to the float64 representing the same number. As float64_t fully includes all values of float, no error or truncation occurs.

(float64_t) from int

float64_t fp64_uint16_to_float64( uint16_t x );

Convert a signed 16-bit integer to float64_t. Overflow cannot occur.

(float64_t) from long

float64_t fp64_long_to_float64( long A );

Alias to fp64_int32_to_float64, see there.

(float64_t) from long int

float64_t fp64_int32_to_float64 ( long x );

Convert a signed 32-bit integer (long) to float64_t.
No overflow will occur, as 32bit long will always fit into the 53-bit significand.

(float64_t) from long long

same as for long long int, see below.

(float64_t) from long long int

float64_t int64_to_float64( long long x );

Convert a signed 64-bit integer (long long) to float64_t.
Overflow cannot occur, but loss of precision, if abs(x) > 2^53

(float64_t) from unsigned int

float64_t fp64_uint16_to_float64( uint16_t x );

Convert an unsigned 16-bit integer to float64_t. No overflow or loss of precision can occur.

(float64_t) from unsigned long

float64_t fp64_uint32_to_float64( unsigned long x );

Convert an unsigned 32-bit integer to float64_t.No overflow or loss of precision can occur.

(float64_t) from unsigned long long

float64_t fp64_uint64_to_float64( unsigned long long x );

Convert an unsigned 64-bit integer (unsigned long long) to float64_t.
Overflow cannot occur, but loss of precision, if abs(x) > 2^53

trunc

float64_t fp64_trunc( float64_t A );

Rounds A to the nearest integer not larger in absolute value,
by cutting the noninteger part ( = setting the fractional part to 0).
This is effectively the same as rounding A towards 0.

Examples: 
fp64_trunc(1.9) --> 1.0
fp64_trunc(-1.9) --> -1.0

The following rules apply:

case	A	fp64_trunc(A)
1	NaN	NaN
2	+Inf	+Inf
3	-Inf	-Inf
4	0.0	0.0
5	-0.0	-0.0
6	\|A\| >= 2^52	A
7	0 < \|A\| < 1	0.0
8	1 <= \|A\| <= 2^52	A – fractional_part(A)

Constants

abs

float64_t fp64_abs( float64_t x);

acos

float64_t fp64_acos( float64_t x );

acosh

float64_t fp64_acosh( float64_t x );

add (+)

float64_t fp64_add( float64_t A, float64_t B );

asin

float64_t fp64_asin(float64_4 x );

asinh

float64_t fp64_asinh( float64_t x );

atan

float64_t fp64_atan( float64_t x );

atan2

float64_t fp64_atan2 ( float64_t A, float64_t B );

atanh

float64_t fp64_atanh( float64_t x );

atof

float64_t fp64_atof( char *str );

cbrt

float64_t fp64_cbrt(float64_t x);

ceil

float64_t fp64_ceil( float64_t A );

classify

int fp64_classify( float64_t x );

compare ( <, ==, > )

float64_t fp64_compare (float64_t A, float64_t B);

copysign

float64_t fp64_copysign ( float64_t x, float64_t y );

cos

float64_t fp64_cos( float64_t phi );

cosh

float64_t fp64_cosh( float64_t x );

cotan

float64_t fp64_cotan( float64_t x );

fp64_cut_noninteger_fraction

float64_t fp64_cut_noninteger_fraction( float64_t x );

div ( / )

float64_t fp64_div( float64_t A, float64_t B );

etoa

char *fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t *exp10 )

exp

float64_t fp64_exp (float64_t x);

exp10

float64_t fp64_exp10 (float64_t x);

fdim

float_64t fp64_fdim (float_64t A, float_64t B);

floor

float64_t fp64_floor( float64_t A );

fma

float64_t fp64_fma (float64_t A, float64_t B, float64_t C);

fmax

float64_t fp64_fmax (float64_t A, float64_t B);

fmin

float64_t fp64_fmin(float64_t A, float64_t B);

fmod ( % )

float64_t fp64_fmod (float64_t x, float64_t y);

fmodn

float64_t fp64_fmodn( float64_t x, float64_t y, unsigned long *np );

fp64_fmodx_pi2

float64_t fp64_fmodx_pi2( float64_t x, unsigned long *np );

frexp

float64_t fp64_frexp (float64_t A, int *pexp);

ftoa

char *fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t *exp10 )

hypot

float64_t fp64_hypot(float64_t x, float64_t y);

ilogb

int fp64_ilogb(float64_t x)

inverse

float64_t inverse (float64_t A);

isfinite

int fp64_isfinite (float64_t x);

isinf

int fp64_isinf (float64_t);

isnan

int fp64_isnan( float64_t x );

ldexp

char fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )

char fp64_ftoa( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )

char fp64_to_decimalExp( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t exp10 )