X-Git-Url: http://nsz.repo.hu/git/?a=blobdiff_plain;f=libm%2Findex.html;h=6eb718c2a3be826149d31ec3a4138889218c8b2d;hb=008785a2f30bfabc39893ed0937829b0c2a44572;hp=e177d6c972a686f49e95d89bec27e4c920ade1db;hpb=6a433528075d5cc62d4d4c3fefb59d42ec78355a;p=www diff --git a/libm/index.html b/libm/index.html index e177d6c..6eb718c 100644 --- a/libm/index.html +++ b/libm/index.html @@ -35,7 +35,7 @@ cvs -d $CVSROOT get src/li
|error| < 1.5 ulp-should hold. +should hold for most functions. (error is the difference between the exact result and the calculated floating-point value) (in theory correct rounding can be achieved but with big implementation cost, @@ -69,16 +69,21 @@ sqrt, trunc.
Binary representation of floating point numbers matter because bit hacks are often needed in the math code. +(in particular bit hacks are used instead of relational operations for nan +and sign checks becuase relational operators raise invalid fp exception on nan +and they treat -0.0 and +0.0 equally and more often than not these are not desired)
float and double bit manipulation can be handled in a portable way in c using union types: @@ -173,6 +181,8 @@ on the endianness and write different code for different architectures.
The ugly parts of libm hacking.
Some notes are from: http://www.vinc17.org/research/extended.en.html +
Useful info about floating-point in gcc: +http://gcc.gnu.org/wiki/FloatingPointMath
So x = a+b may give different results depending on @@ -199,6 +209,18 @@ but that's the most common one)
C99 annex F prohibits double rounding, but that's non-normative. +
+Note that the value of the result can only be ruined by +double rounding in nearest rounding mode, but the double +rounding issue haunts directed rounding modes as well: +raising the underflow flag might be omitted. +On x86 with downward rounding +
+(double)(0x1p-1070 + 0x1p-2000L) ++does not raise underflow (only inexact) eventhough the +final result is an inexact subnormal. +
@@ -229,9 +251,12 @@ So may be false when the two sides are kept in different precision. (This is not an x87 specific problem, it matters whenever there -is a higher precision fp type than the currently used one. +is a higher precision fp type than the currently used one and +FLT_EVAL_METHOD!=0. It goes away if the highest precision (long double) is used everywhere, but that can have a huge penalty). +(clang uses sse by default on i386 with FLT_EVAL_METHOD==0, +while gcc uses the 80bit x87 fp registers and FLT_EVAL_METHOD==2)
C99 has a way to control this (see
5.1.2.3 example 4,
@@ -260,6 +285,11 @@ also see gcc bug3657
gcc 4.5 fixed it with '-fexcess-precision=standard'
(it is enabled by '-std=c99', but the default is
'-fexcess-precision=fast')
+(An alternative solution would be if gcc spilled the
+registers with temporary results without rounding,
+storing the 80 bit registers entirely in memory
+which would make the behaviour under FLT_EVAL_METHOD==2
+mode more predictable)
The workaround for older gcc is to force the
compiler to store the intermediate results:
@@ -278,6 +308,40 @@ use higher precision variables when that's
what we want and don't depend on the implicit
excess precision).
+
+The standard allows 1 ulp errors in the conversion
+of decimal floating-point literals into floating-point
+values (it only requires the same result for the same
+literal thus 1.1 - 1.1 is always 0,
+but 1.1 - 11e-1 maybe +-0x1p-52 or 0).
+
+A reasonable compiler always use correctly rounded
+conversion according to the default (nearest) rounding
+mode, but there are exceptions:
+the x87 has builtin constants which are faster to load
+from hardware than from memory
+(and the hw has sticky bit set correctly for rounding).
+gcc can recognize these constants so an operation on
+
+According to the freebsd libm code gcc truncates long double
+const literals on i386.
+I assume this happens because freebsd uses 64bit long doubles by default
+(double precision) and gcc incorrectly uses the precision setting of the
+host platform instead of the target one, but i did not observe this on linux.
+(as a workaround sometimes double-double arithmetics was used
+to initialize long doubles on i386, but most of these should be
+fixed in musl's math code now)
+
Runtime and compile time semantics may be different
@@ -292,7 +356,8 @@ different precision than at runtime).
C99 actually allows most of these optimizations
but they can be turned off with STDC pragmas (see
6.10.6).
-Unfortunately gcc does not support these pragmas.
+Unfortunately gcc does not support these pragmas
+nor clang (clang bug 8100).
FENV_ACCESS ON tells the compiler that the code wants
to access the floating point environment (eg. set different rounding mode)
@@ -307,12 +372,6 @@ variables for constants like
static const volatile two52 = 0x1p52;
and using the '-frounding-math' gcc flag.
-
-(According the freebsd libm code gcc truncates
-long double const literals on i386.
-I haven't yet verified if this still the case,
-but as a workaround double-double arithmetics is used:
-initializing the long double constant from two doubles)
@@ -358,27 +417,27 @@ but feraiseexcept is not available for some reason,
then simple arithmetics can be be used just for their
exception raising side effect
(eg. 1/0.0 to raise divbyzero), however beaware
-of compiler optimizations (dead code elimination,..).
+of compiler optimizations (constant folding and dead code elimination,..).
Unfortunately gcc does not always take fp exceptions into
account: a simple x = 1e300*1e300; may not raise overflow
exception at runtime, but get optimized into x = +inf.
see compiler optimizations above.
-Another x87 gcc bug related to fp exceptions is that
+Another x87 gcc bug related to fp exceptions is that in some cases
comparision operators (==, <, etc) don't raise invalid
when an operand is nan
(eventhough this is required by ieee + c99 annex F).
(see gcc bug52451).
-The ieee standard defines signaling and quite nan
+The ieee standard defines signaling and quiet nan
floating-point numbers as well.
The c99 standard only considers quiet nan, but it allows
signaling nans to be supported as well.
Without signaling nans x * 1 is equivalent to x,
but if signaling nan is supported then the former
raises an invalid exception.
-This may complicates things further if one wants to write
+This may complicate things further if one wants to write
portable fp math code.
A further libm design issue is the math_errhandling macro:
@@ -389,8 +448,23 @@ but errno is hard to support: certain library functions
are implemented as a single asm instruction (eg sqrt),
the only way to set errno is to query the fp exception flags
and then set the errno variable based on that.
-So eventhough errno may be convenient in libm it is
+So eventhough errno may be convenient, in libm it is
not the right thing to do.
+
+For soft-float targets however errno seems to be the only option
+(which means annex K cannot be fully supported, as it requires
+the support of exception flags).
+The problem is that at context switches the fpu status should
+be saved and restored which is done by the kernel on hard-fp
+architectures when the state is in an fpu status word.
+In case of soft-fp emulation this must be done by the c runtime:
+context switches between threads can be supported with thread local
+storage of the exception state, but signal handlers may do floating-point
+arithmetics which should not alter the fenv state.
+Wrapping signal handlers is not possible/difficult for various
+reasons and the compiler cannot know which functions will be used
+as signal handlers, so the c runtime has no way to guarantee that
+signal handlers do not alter the fenv.
@@ -429,16 +503,26 @@ in gcc it can be 1.0fi).
The freebsd libm code has many inconsistencies
(naming conventions, 0x1p0 notation vs decimal notation,..),
-one of them is the integer type used for bitmanipulations:
+one of them is the integer type used for bit manipulations:
The bits of a double are unpacked into one of
-int32_t, uint32_t and u_int32_t
+int, int32_t, uint32_t and u_int32_t
integer types.
-int32_t is used most often which is wrong because of
-implementation defined signed int representation.
+int32_t is used the most often which is not wrong in itself
+but it is used incorrectly in many places.
+
+int is a bit worse because unlike int32_t it is not guaranteed
+to be 32bit two's complement representation. (but of course in
+practice they are the same)
+
+The issues found so far are left shift of negative integers
+(undefined behaviour), right shift of negative integers
+(implementation defined behaviour), signed overflow
+(implementation defined behaviour), unsigned to signed conversion
+(implementation defined behaviour).
-In general signed int is not handled carefully
-in the libm code: scalbn even depends on signed int overflow.
+It is easy to avoid these issues without performance impact,
+but a bit of care should be taken around bit manipulations.
+3.141592653589793238462643383L
+
+can turn into code that uses the fldpi
+instruction instead of memory loads.
+The only issue is that fldpi depends on
+the current rounding mode at runtime
+so the result can indeed be 1 ulp off compared
+to the compile-time rounded value.
+