### float64_t fp64_abs(float64_t x);

Return absolute value of x, removing sign. This works even if is NaN, Inf or zero.

### float64_t fp64_acos(float64_t x );

Returns the arcus cosine of x, the inverse function to fp64_cos(). The following special cases apply:

acos(NaN) = NaN

acos(x) = NaN for any x > +1, including +Inf

acos(x) = NaN for any x < -1, including -Inf

### float64_t fp64_add( float64_t A, float64_t B );

Adds two 64 bit floating point numbers and rounds result to a 64bit floating point number. The following special cases apply:

case | A | B | fp64_add(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3a | +Inf | -Inf | NaN |

3b | +Inf | +Inf | +Inf |

3c | -Inf | -Inf | -Inf |

3d | -Inf | +Inf | NaN |

4 | Inf | any | Inf |

5 | any | Inf | Inf |

6 | 0 | any | any |

7 | any | 0 | any |

### float64_t fp64_asin(float64_4 x );

Returns the arcus sine of x, the inverse function to fp64_sin(). The following special cases apply:

asin(NaN) = NaN

asin(x) = NaN for any x > +1, including +Inf

asin(x) = NaN for any x < -1, including -Inf

### float64_t fp64_atan( float64_t x);

Returns the arcus tangent of x, the inverse function to fp64_tan(). The following special cases apply:

atan(NaN) = NaN

atan(+Inf) = +PI/2

atan(-Inf) = -PI/2

### float64_t fp64_atan2 (float64_t A, float64_t B);

The fp64_atan2() function calculates the arc tangent of the two variables A and B, where A can be interpreted as the y coordinate and B as the x coordinate. It is similar to calculating the arc tangent of A/B, except that the signs of both arguments are used to determine the quadrant of the result. The fp64_atan2() function returns the result in radians, which is between -PI and PI (inclusive):

Sector| x | y | |y|<=|x| | result range

------+------+------+----------+-----------------

I | >= 0 | >= 0 | yes | 0..PI/4

II | >= 0 | >= 0 | no | PI/4..PI/2

III | < 0 | >= 0 | no | PI/2..3*PI/4

IV | < 0 | >= 0 | yes | 3*PI/4..PI

V | < 0 | < 0 | yes | -PI..-3*PI/4

VI | < 0 | < 0 | no | -3*PI/4..-PI/2

VII | >= 0 | < 0 | no | -PI/2..-PI/4

VIII | >= 0 | < 0 | yes | -PI/4..0

Note:

Unlike the 32bit floating library, this implementation respects all four

cases of 0/0, like x86 (GCC/Glibc): +0/+0, +0/-0, -0/+0 and -0/-0.

The following special cases apply:

case | A = y | B = x | fp64_atan2(A,B) |

1a | NaN | any | NaN |

1b | any | NaN | NaN |

2a | +0 | >= 0 | +0 |

2b | +0 | < 0 | +PI |

2c | -o | >= 0 | -0 |

2d | -0 | < 0 | -PI |

3a | +finite | +Inf | +0 |

3b | +finite | -Inf | +PI |

3c | -finite | +Inf | -0 |

3d | -finite | -Inf | -PI |

3e | +/- Inf | +/- Inf | NaN |

3f | +/- Inf | any | +/- PI/2 |

4 | |y| <= |x| | >= 0 | see sector I & VIII: atan(y/x)+0 |

5 | |y| <= |x| | < 0 | see sector IV & V: atan(y/x)+/-PI |

6 | >= 0 | |y| > |x| | see sector II & III: -atan(x/y)+PI/2 |

7 | < 0 | |y| > |x| | see sector VI & VII: -atan(x/y)-PI/2 |

8a | +any | +/- 0 | +PI/2 |

8a | -any | +/-0 | -PI/2 |

### float64_t fp64_atof( char *str);

converts a string to a number, handles also NaN, +INF, -INF.

A valid number (non NaN or INF) can be expressed by the

following regex: [+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?

### float64_t fp64_cbrt(float64_t x);

The fp64_cbrt() returns the cubic root of x. Works for positive and

negative values of x.

### float64_t fp64_ceil( float64_t A );

Rounds A upwards to the nearest integer greater or equal A.

For A < 0 this is identical to trunc(A). The following rules apply:

case | A | fp64_ceil(A) |

0 | x < 0 | fp64_trunc(A) |

1 | NaN | NaN |

2 | +Inf | +Inf |

3 | +0 | +0 |

4 | A >= 2^52 | A |

5 | 0 < A < 1 | 1.0 |

6 | 1 <= A < 2^52 | fp64_ceil(A) |

### int fp64_classify( float64_t x );

Categorizes floating point value x into the following categories:

zero, subnormal, normal, infinite, and NAN.

The following cases apply:

case | A | fp64_classify(x) |

1 | NaN | 1 (FP_NAN) |

2 | +/- Inf | 2 (FP_INFINITE) |

3 | +/- 0 | 0 (FP_ZERO) |

4 | < 2^-1023 | 3 (FP_SUBNORMAL) |

5 | any | 4 (FP_NORMAL) |

### float64_t fp64_compare (float64_t A, float64_t B);

Compares the two values A and B and returns -1/0/1, based on the following rules:

case | A | B | fp64_compare(A,B) |

1 | NaN | any | +1 |

2 | any | NaN | +1 |

3 | A < B | any | -1 |

4 | A == B | any | 0 |

5 | A > B | any | 1 |

### float64_t fp64_copysign (float64_t x, float64_t y);

The fp64_copysign() function returns x but with the sign of y. This works even if x or y are NaN, Inf or zero.

### float64_t fp64_cos( float64_t phi );

Returns the cosine of phi. The following special cases apply:

cos(NaN) = NaN

cos(+Inf) = NaN

cos(-Inf) = NaN

### float64_t fp64_cosh( float64_t x );

The fp64_cosh() function returns the hyperbolic cosine of x, which is

defined mathematically as (exp(x) + exp(-x)) / 2.

### float64_t fp64_cut_noninteger_fraction( float64_t x );

Alias to fp64_trunc, see there.

### float64_t fp64_div( float64_t A, float64_t B );

Divides two 64 bit floating point numbers and rounds result of A/B to a 64bit floating point number. The following special cases apply:

case | A | B | fp64_div(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3 | Inf | 0 | NaN |

4 | Inf | Inf | NaN |

5 | 0 | Inf | 0 |

6 | Inf | != 0 | Inf |

7 | != 0 | Inf | 0 |

8 | 0 | 0 | NaN |

9 | 0 | != 0 | 0 |

### float fp64_ds( float64_t A );

Converts a float64 to the float32 representing the same number. Significand is rounded up, if bit 25 of significand of A is set.

As the range of float is significandly smaller than float64_t,

the following rules apply:

case | A | result (hex) | result |

1 | NaN | 0x7fffffff | NaN in float |

2 | +Inf | 0x7f800000 | +Inf in float |

3 | -Inf | 0xff800000 | -Inf in float |

4 | 0.0 | 0x00000000 | 0.0 |

5 | -0.0 | 0x80000000 | -0.0 |

6 | x >= 2^128 | 0x7f800000 | +Inf in float |

7 | -2^128 <= x | 0xff800000 | -Inf in float |

8 | 0 < x < 2^-149 | 0x00000000 | 0.0 as float, underflow |

9 | -2^-149 < x < 0 | 0x80000000 | -0.0 as float, underflow |

10 | 2^-149 < x < 2^-126 | 0x00mmmmmm | subnormal number with mmmmmm < 0x800000 |

11 | -2^-126 < x <-2^-149 | 0x80mmmmmm | subnormal number with mmmmmm < 0x800000 |

12 | 2^-126 <|x|< 2^127 | (float) A |

### float64_t fp64_exp (float64_t x);

The fp64_exp() function returns the value of e (the base of natural

logarithms) raised to the power of x (e^x). The following rules apply:

case | x | fp64_exp(x) |

1 | NaN | NaN |

2 | +Inf | +Inf |

3 | -Inf | 0 |

4 | 0 | 1 |

5 | > 709 | +Inf (Overflow) |

6 | < -744 | 0 (Underflow) |

7 | != 0 | A |

8 | 0 | NaN |

9 | 0 | 0 |

### float_64t fp64_fdim (float_64t A, float_64t B);

The fp64_fdim() function returns max(A-B,0). If x or y or both are NaN, NaN is returned.

### long fp64_float64_to_long( float64_t A );

Alias to fp64_to_int32, see there.

### float64_t fp64_floor( float64_t A );

Rounds A downwards to the nearest integer less or equal A. For A > 0 this is identical to trunc(A).

Examples:

fp64_floor(1.9) --> 1.0

fp64_floor(-1.9) --> -2.0

The following rules apply:

case | A | fp64_floor(A) |

0 | x > 0 | fp64_trunc(A) |

1 | NaN | NaN |

2 | Inf | Inf |

3 | +0 | +0 |

4 | A <= -2^52 | A |

5 | -1 <= A < 0 | -1.0 |

6 | -2^52 <= A < -1 | fp64_floor(A) |

### float64_t fp64_fma (float64_t A, float64_t B, float64_t C);

The `fp64_fma’ function performs floating-point multiply-add. This is the operation (A * B) + C. Current implementation is space optimized, so no effort is taken to work on internal intermediate result. Only advantage for the caller is to save space for the call sequence.

### float64_t fp64_fmax (float64_t A, float64_t B);

The fp64_fmax() function returns the greater of the two values A and B.

The following rules apply:

case | A | B | fp64_fmax(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3 | < 0 | >= 0 | B |

4 | >= 0 | < 0 | A |

5 | A >= B | any | A |

6 | A < B | any | B |

### float64_t fp64_fmin(float64_t A, float64_t B);

The fp64_fmin() function returns the lesser of the two values A and B.

The following rules apply:

case | A | B | fp64_fmax(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3 | < 0 | >= 0 | A |

4 | >= 0 | < 0 | B |

5 | A >= B | any | B |

6 | A < B | any | A |

### float64_t fp64_fmod (float64_t x, float64_t y);

The fp64_fmod() function computes the remainder of dividing x by y. The

return value is x – n*y, where n is the quotient of x/y, rounded

towards zero to an integer. The following rules apply:

case | A | B | fp64_fmod(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3 | Inf | 0 | NaN |

4 | Inf | Inf | NaN |

5 | Inf | != 0 | NaN |

6 | 0 | Inf | 0 |

7 | != 0 | Inf | A |

8 | 0 | 0 | NaN |

9 | 0 | != 0 | 0 |

### float64_t fp64_frexp (float64_t A, int *pexp);

The fp64_frexp() function is used to split the number A into a normalized

fraction and an exponent which is stored by pexp.

Return:

If A is a normal float point number, the fp64_frexp() function returns the

value v, such that v has a magnitude in the interval [1/2, 1) or zero,

and A equals v times 2 raised to the power *pexp. If A is zero, both

parts of the result are zero. If A is not a finite number, the fp64_frexp()

returns A as is and stores 0 by pexp.

Note:

This implementation permits a zero pointer as a directive to skip

a storing the exponent.

### float64_t fp64_hypot(float64_t x, float64_t y);

The float64_t() function returns `sqrt (x*x + y*y)’. This is the length

of the hypotenuse of a right triangle with sides of length x and y,

or the distance of the point (x, y) from the origin.

### float64_t fp64_int32_to_float64 ( long x );

Convert a signed 32-bit integer (long) to float64_t.

No overflow will occur, as 32bit long will always fit into the 53-bit significand.

### float64_t int64_to_float64( long long x );

Convert a signed 64-bit integer (long long) to float64_t.

Overflow cannot occur, but loss of precision, if abs(x) > 2^53

### float64_t inverse (float64_t A);

The inverse() function returns 1/A value.

### int fp64_isfinite (float64_t x);

The isfinite() function returns a nonzero value if x is finite: not

plus or minus infinity, and not NaN.

### int fp64_isinf (float64_t);

The fp64_isinf () function returns -1 if value represents negative

infinity, 1 if value represents positive infinity, and 0 otherwise.

### int fp64_isnan( float64_t x );

The fp64_isnan() function returns a non-zero value if value is

“not-a-number” (NaN), and 0 otherwise.

### float64_t fp64_ldexp (float64_t x, int exp);

The fp64_ldexp() function returns the result of multiplying the

floating-point number x by 2 raised to the power exp.

The following rules apply:

case | x | fp64_ldexp(x,exp) |

1 | NaN | NaN |

2 | +/- Inf | +/- Inf |

3 | +/- 0 | +/- o |

4 | exponent(x)+exp>1023 | +/- Inf (Overflow) |

5 | exponent(x)+exp<-1074 | +/- 0.0 (Underflow) |

6 | exponent(x)+exp<-1023 | x*2^exp as subnormal number |

7 | exponent(x)+exp<-1023 | x*2^exp |

### float64_t fp64_log( float64_t x );

returns the natural logarithm ln of x. The following rules apply:

case | x | fp64_log(x) |

1 | < 0 | NaN (< 0 includes -Inf) |

2 | NaN | NaN |

3 | +Inf | +/- o |

4 | 0 | -Inf |

5 | 1 | 0 |

### float64_t fp64_log10( float64_t A );

The fp64_log10() function returns the base 10 logarithm of A.

log10(A) = log(A) / log(10)

### float64_t fp64_long_to_float64( float64_t A );

Alias to fp64_int32_to_float64, see there.

### long fp64_lrint (float64_t A);

The fp64_lrint() function rounds A to the nearest integer, rounding the

halfway cases to the even integer direction. (That is both 1.5 and

2.5 values are rounded to 2). This function is similar to rint()

function, but it differs in type of return value and in that an

overflow is possible.

Return:

The rounded long integer value. If A is infinite, NaN or an overflow

was, this realization returns the LONG_MIN value (0x80000000).

Examples:

fp64_lrint(1.25) --> 1L

fp64_lrint(1.5) --> 2L

fp64_lrint(2.5) --> 2L

fp64_lrint(2.75) --> 3L

fp64_lrint(3.5) --> 4L

fp64_lrint(-1.25) --> -1L

fp64_lrint(-1.5) --> -2L

fp64_lrint(-2.5) --> -2L

fp64_lrint(-2.75) --> -3L

fp64_lrint(-3.5) --> -4L

The following rules apply:

case | A | fp64_lrint(A) |

1 | NaN | LONG_MIN (0x80000000) |

2 | +/- Inf | LONG_MIN (0x80000000) |

3 | +/- 0 | 0L |

4 | |A|>=2^31 | LONG_MIN (0x80000000) |

5 | 0 < |A|<= 0.5 | 0L |

6 | 0.5 < |A|< 1.0 | +/- 1L |

7 | 1.0 <=|A|< 2^31 | trunc(x) if (|x|-trunc(|x|)) < 0.5 trunc(x+sign(x)) if (|x|-trunc(|x|)) >= 0.5 |

### long fp64_lround (float64_t x);

The fp64_lround() function rounds x to the nearest integer, but rounds

halfway cases away from zero (instead of to the nearest even integer).

This function is similar to round() function, but it differs in

type of return value and in that an overflow is possible.

Return:

The rounded long integer value. If x is infinite, NaN or an overflow

was, this realization returns the LONG_MIN value (0x80000000).

Examples:

fp64_lround(1.1) --> 1L

fp64_lround(1.5) --> 2L

fp64_lround(-1.1) --> -1L

fp64_lround(-1.5) --> -2L

The following rules apply:

case | x | fp64_lround(A) |

1 | NaN | LONG_MIN (0x80000000) |

2 | +/- Inf | LONG_MIN (0x80000000) |

3 | +/- 0 | 0L |

4 | |x|>=2^31 | LONG_MIN (0x80000000) |

5 | 0 < |x| < 0.5 | 0L |

6 | 0.5 <= |x| < 1.0 | +/- 1L |

7 | 1.0 <= |x| < 2^31 | trunc(x) if (|x|-trunc(|x|)) < 0.5 trunc(x+sign(x)) if (|x|-trunc(|x|)) >= 0.5 |

### float64_t fp64_modf (float64_t x, float64_t *iptr);

The fp64_modf() function breaks the argument x into an integral part and a fractional part, each of which has the same sign as x. The integral part

is stored in iptr. This implementation skips writing by zero pointer.

Examples:

modf(123.45) makes 0.45 (return value) + 123.0 (stored in *iptr)

The following rules apply:

case | x | fp64_modf(A) | *iptr |

1 | NaN | NaN | NaN |

2 | Inf | Inf | 0.0 |

3 | 0.0 | 0.0 | 0.0 |

4 | |x| > 2^53 | 0.0 | x |

5 | |x| < 1.0 | x | 0.0 |

### fp64 fp64_mul( fp64 A, fp64 B );

Multiplies two 64 bit floating point numbers and rounds result to a 64bit floating point number.

Examples:

modf(123.45) makes 0.45 (return value) + 123.0 (stored in *iptr)

The following rules apply:

case | A | B | fp64_mul(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3 | Inf | 0.0 | NaN |

4 | 0.0 | Inf | NaN |

5 | Inf | != 0 | Inf |

6 | != 0 | Inf | Inf |

### float64_t fp64_neg (float64_t A);

Returns -A. Also works for special cases (NaN, +Inf, -Inf), subnormal numbers and +0 and -0.

### float64_t fp64_pow(float64_t x, float64_t y);

The fp64_pow() function returns the value of x raised to the power of y. The following rules apply:

case | x | y | fp64_pow(x,y) |

1 | any | 0 | 1.0 (includes x = NaN or Inf) |

2 | +1.0 | any | 1.0 (includes y = NaN or Inf) |

3 | any | +1.0 | x (includes x = NaN or Inf) |

4 | x < 0 | any | NaN |

5 | x > 0 | any | x^y |

### float64_t fp64_round( float64_t x );

Rounds x upwards to the nearest integer, but rounds halfway cases away from 0.

Examples:

fp64_round(1.1) --> 1.0

fp64_round(1.5) --> 2.0

fp64_round(-1.1) --> -1.0

fp64_round(-1.5) --> -2.0

The following rules apply:

case | x | fp64_round(A) |

1 | NaN | NaN |

2 | +/- Inf | +/- Inf |

3 | +/- 0 | +/- o |

4 | |x| >= 2^52 | x |

5 | 0 < |x| < 0.5 | 0.0 |

6 | 0.5 <= |x| < 1 | +/- 1.0 |

7 | 1 <= |x| <= 2^52 | trunc(x) if (|x|-trunc(|x|)) < 0.5 trunc(x+sign(x)) if (|x|-trunc(|x|)) >= 0.5 |

### float64_t fp64_sd (float A);

Converts a float32 to the float64 representing the same number. As float64_t fully includes all values of float, no error or truncation occurs.

### int fp64_signbit (float64_t x);

The fp64_signbit() function returns a nonzero value if the value of x has

its sign bit set. This is not the same as `x < 0.0′, because IEEE 754

floating point allows zero to be signed. The comparison `-0.0 < 0.0′

is false, but `fp64_signbit (-0.0)’ will return a nonzero value.

This implementation returns 1 if sign bit is set. This corresponds

to builtin GCC realization for float.

### float64_t fp64_sin( float64_t phi );

Returns the sine of phi. The following special cases apply:

sin(NaN) = NaN

sin(+Inf) = NaN

sin(-Inf) = NaN

### float64_t fp64_sinh( float64_t x);

The fp64_sinh() function returns the hyperbolic sine of x, which is

defined mathematically as (exp(x) – exp(-x)) / 2.

### float64_t fp64_sqrt (float64_t);

Square root function. The following special cases apply:

sqrt(NaN) = NaN

sqrt(+Inf) = +Inf

sqrt(-Inf) = NaN

sqrt(+-0) = +-0

### float64_t fp64_square (float64_t A);

The fp64_square() function returns square of A.

### float64_t fp64_strtod( char *str, char **endptr );

converts a string to a number, handles also NaN, +INF, -INF

a valid number (non NaN or INF) can be expressed by the

following regex: [+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?

Returns the identified number as result, and *endptr contains pointer to last parsed position in string str.

### float64_t fp64_sub( float64_t A, float64_t B );

Subtracts two 64 bit floating point numbers and rounds result to a 64bit floating point number. The following special cases apply:

case | A | B | fp64_sub(A,B) |

1 | NaN | any | NaN |

2 | any | NaN | NaN |

3a | +Inf | -Inf | +Inf |

3b | +Inf | +Inf | NaN |

3c | -Inf | -Inf | NaN |

3d | -Inf | +Inf | -Inf |

4 | Inf | any | Inf |

5 | any | Inf | -Inf |

6 | 0 | any | -any |

7 | any | 0 | any |

### float64_t fp64_tan( float64_4 phi );

Returns the tangent of phi. The following special cases apply:

tan(NaN) = NaN

tan(+Inf) = NaN

tan(-Inf) = NaN

### float64_t fp64_tanh (float64_t x);

The fp64_tanh() function returns the hyperbolic tangent of x, which is

defined mathematically as sinh(x) / cosh(x).

### char *fp64_to_decimalExp( float64_t x, uint8_t maxDigits, uint8_t expSep, int16_t *exp10 )

converts a number to a string with maxDigits of significand.

parameters:

x number x to convert in float64_t format

maxDigits maximum number of digits in significand

expSep flag to store significand and exponent seperately

exp10 if != NULL, store exponent base 10 here

NOTE: this function supports conversion of up to 17 digits. However, accuracy of IEEE754 is only 52 bit which is 15-16 decimal digits!

WARNING: function returns a pointer to a static temporary scratch area which might be used also for other functions. The returned string might become invalid/scrambled if one of the other fp64_ functions will be called. So copy the result to some other allocated memory that you have under control before calling further fp64lib-routines.

### int fp64_to_int16( float64_t A );

The fp64_to_int16() function converts A to the signed integer value,

rounding towards 0, the fractional part is lost. No saturation.

Only when called in assembly language: Besides a normal 16-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_int16 (A) |

1 | NaN | 1 | 0x0000 |

2 | Inf | 1 | 0x0000 |

3 | |A|>=2^15 | 1 | 0x0000 |

4 | 0<|A|<1.0 | 0 | 0x0000 |

5 | |A|<2^15 | 0 | (int) A |

### long fp64_to_int32( float64_t A );

The fp64_to_int32() function converts A to the signed integer value,

rounding towards 0, the fractional part is lost. No saturation.

Only when called in assembly language: Besides a normal 32-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_int32 (A) |

1 | NaN | 1 | 0x00000000 |

2 | Inf | 1 | 0x00000000 |

3 | |A|>=2^31 | 1 | 0x00000000 |

4 | 0<|A|<1.0 | 0 | 0x00000000 |

5 | |A|<2^31 | 0 | (long) A |

### long long fp64_to_int64( float64_t A );

The fp64_to_int64() function converts A to the signed integer value,

rounding towards 0, the fractional part is lost. No saturation.

Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_int32 (A) |

1 | NaN | 1 | 0x0000000000000000 |

2 | Inf | 1 | 0x0000000000000000 |

3 | |A|>=2^63 | 1 | 0x0000000000000000 |

4 | 0<|A|<1.0 | 0 | 0x0000000000000000 |

5 | |A|<2^63 | 0 | (long long) A |

### char fp64_to_int8( float64_t A );

The fp64_to_int8() function converts A to the signed integer value,

rounding towards 0, the fractional part is lost. No saturation.

Only when called in assembly language: Besides a normal 8-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_int32 (A) |

1 | NaN | 1 | 0x00 |

2 | Inf | 1 | 0x00 |

3 | |A|>=2^7 | 1 | 0x00 |

4 | 0<|A|<1.0 | 0 | 0x00 |

5 | |A|<2^7 | 0 | (char) A |

### char *fp64_to_string(float64_t x, uint8_t max_chars, uint8_t max_zeroes);

converts the float64 to the decimal representation of the number x. fp64_to_string adjusts the number of decimal digits so that the result

will fit into a string with max_chars characters. However, a longer string

will be returned if the minimum representation will not fit into max_chars.

Minimum representation is “mESn[nn]” for X > 0 else “-mESn[nn]”.

parameters:

x: number to convert in float64_t format

max_chars: maximum space for result

max_zeroes: use “s0.mmmmmm” when result has less than this # of 0s

WARNING: function returns a pointer to a static temporary scratch area which might be used also for other functions. The returned string might become invalid/scrambled if one of the other fp64_ functions will be called. So copy the result to some other allocated memory that you have under control before calling further fp64__ routines.

The following rules apply:

case | x | fp64_to_string(x) |

1 | NaN | “NaN” |

2 | +Inf | “+INF” |

3 | -Inf | “-INF” |

4 | |x|<1 | “s0.mmmmmm” representation without exponent if x can be displayed with less than max_zeroes “0” after “0.” leading sign s only for x < 0 else exponential form is used “sm.mmmmmESnnn” |

5 | log10(|x|)<max_chars | “smmm.mmm” representation without exponent |

6 | all other cases | “sm.mmmmmESn” exponential form is used leading sign s only for x < 0 exponent “Snnn” has always a sign S (“+” or “-“) and one digit to three digits nnn for the exponent |

### unsigned int fp64_to_uint16( float64_t A );

The fp64_to_uint16() function converts A to the unsigned integer value,

rounding towards 0, the fractional part is lost. No saturation.

Negative input is permissable (like GCC/x86).

Only when called in assembly language: Besides a normal 16-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_uint16 (A) |

1 | NaN | 1 | 0x0000 |

2 | Inf | 1 | 0x0000 |

3 | |A|>=2^16 | 1 | 0x0000 |

4 | 0 < |A| <1.0 | 0 | 0x0000 |

5 | 0 < A < 2^16 | 0 | (unsigned int) A |

6 | -2^16 < A < 0 | 0 | -((unsigned int) (-A) |

### unsigned long fp64_to_uint32( float64_t A );

The fp64_to_uint32() function converts A to the unsigned integer value,

rounding towards 0, the fractional part is lost. No saturation.

Negative input is permissable (like GCC/x86).

Only when called in assembly language: Besides a normal 32-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_uint32 (A) |

1 | NaN | 1 | 0x00000000 |

2 | Inf | 1 | 0x00000000 |

3 | |A|>=2^32 | 1 | 0x00000000 |

4 | 0 < |A| <1.0 | 0 | 0x00000000 |

5 | 0 < A < 2^32 | 0 | (unsigned long) A |

6 | -2^32 < A < 0 | 0 | -((unsigned long) (-A) |

### unsigned long long fp64_to_uint64( float64_t A );

The fp64_to_uint64() function converts A to the unsigned integer value,

rounding towards 0, the fractional part is lost. No saturation.

Negative input is permissable (like GCC/x86).

Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_uint64 (A) |

1 | NaN | 1 | 0x0000000000000000 |

2 | Inf | 1 | 0x0000000000000000 |

3 | |A|>=2^64 | 1 | 0x0000000000000000 |

4 | 0 < |A| <1.0 | 0 | 0x0000000000000000 |

5 | 0 < A < 2^64 | 0 | (unsigned long long) A |

6 | -2^64 < A < 0 | 0 | -((unsigned long long) (-A) |

### unsigned char fp64_to_uint8( float64_t A );

The fp64_to_uint8() function converts A to the unsigned integer value,

rounding towards 0, the fractional part is lost. No saturation.

Negative input is permissable (like GCC/x86).

Only when called in assembly language: Besides a normal 64-bits value, the carry is returned as extra error flag.

The following rules apply:

case | A | Carry | fp64_to_uint64 (A) |

1 | NaN | 1 | 0x00 |

2 | Inf | 1 | 0x00 |

3 | |A|>=2^8 | 1 | 0x00 |

4 | 0 < |A| <1.0 | 0 | 0x00 |

5 | 0 < A < 2^8 | 0 | (unsigned char) A |

6 | -2^8 < A < 0 | 0 | -((unsigned char) (-A) |

### float64_t fp64_trunc( float64_t A );

Rounds A to the nearest integer not larger in absolute value,

by cutting the noninteger part ( = setting the fractional part to 0).

This is effectively the same as rounding A towards 0.

Examples:

fp64_trunc(1.9) --> 1.0

fp64_trunc(-1.9) --> -1.0

The following rules apply:

case | A | fp64_trunc(A) |

1 | NaN | NaN |

2 | +Inf | +Inf |

3 | -Inf | -Inf |

4 | 0.0 | 0.0 |

5 | -0.0 | -0.0 |

6 | |A| >= 2^52 | A |

7 | 0 < |A| < 1 | 0.0 |

8 | 1 <= |A| <= 2^52 | A – fractional_part(A) |

### float64_t fp64_uint64_to_float64( unsigned long long x );

Convert an unsigned 64-bit integer (unsigned long long) to float64_t.

Overflow cannot occur, but loss of precision, if abs(x) > 2^53