X-Git-Url: http://nsz.repo.hu/git/?a=blobdiff_plain;f=libm%2Findex.html;h=6eb718c2a3be826149d31ec3a4138889218c8b2d;hb=008785a2f30bfabc39893ed0937829b0c2a44572;hp=360146e81229af8948d124fa0d41308d9fe3378b;hpb=23940a119a2f4bc56937741c9c47845a96998df3;p=www diff --git a/libm/index.html b/libm/index.html index 360146e..6eb718c 100644 --- a/libm/index.html +++ b/libm/index.html @@ -35,7 +35,7 @@ cvs -d $CVSROOT get src/li
|error| < 1.5 ulp-should hold. +should hold for most functions. (error is the difference between the exact result and the calculated floating-point value) (in theory correct rounding can be achieved but with big implementation cost, see crlibm)
Binary representation of floating point numbers matter because bit hacks are often needed in the math code. +(in particular bit hacks are used instead of relational operations for nan +and sign checks becuase relational operators raise invalid fp exception on nan +and they treat -0.0 and +0.0 equally and more often than not these are not desired)
float and double bit manipulation can be handled in a portable way in c using union types: @@ -157,9 +165,10 @@ ld128 is rare (eg. sparc64 with software emulation), it means
There are other non-conformant long double types: eg. the old SVR4 abi for ppc -uses 128 bit long doubles, but it's software emulated using -two doubles. -(the newer ppc eabi uses ld64). +uses 128 bit long doubles, but it's software emulated and traditionally +implemented using +two doubles +(also called ibm long double as this is what ibm aix used on ppc). The ibm s390 supports the ieee 754-2008 compliant binary128 floating-point format, but previous ibm machines (S/370, S/360) used slightly different representation. @@ -172,6 +181,8 @@ on the endianness and write different code for different architectures.
The ugly parts of libm hacking.
Some notes are from: http://www.vinc17.org/research/extended.en.html +
Useful info about floating-point in gcc: +http://gcc.gnu.org/wiki/FloatingPointMath
So x = a+b may give different results depending on @@ -198,6 +209,18 @@ but that's the most common one)
C99 annex F prohibits double rounding, but that's non-normative. +
+Note that the value of the result can only be ruined by +double rounding in nearest rounding mode, but the double +rounding issue haunts directed rounding modes as well: +raising the underflow flag might be omitted. +On x86 with downward rounding +
+(double)(0x1p-1070 + 0x1p-2000L) ++does not raise underflow (only inexact) eventhough the +final result is an inexact subnormal. +
@@ -228,9 +251,12 @@ So may be false when the two sides are kept in different precision. (This is not an x87 specific problem, it matters whenever there -is a higher precision fp type than the currently used one. +is a higher precision fp type than the currently used one and +FLT_EVAL_METHOD!=0. It goes away if the highest precision (long double) is used everywhere, but that can have a huge penalty). +(clang uses sse by default on i386 with FLT_EVAL_METHOD==0, +while gcc uses the 80bit x87 fp registers and FLT_EVAL_METHOD==2)
C99 has a way to control this (see
5.1.2.3 example 4,
@@ -259,6 +285,11 @@ also see gcc bug3657
gcc 4.5 fixed it with '-fexcess-precision=standard'
(it is enabled by '-std=c99', but the default is
'-fexcess-precision=fast')
+(An alternative solution would be if gcc spilled the
+registers with temporary results without rounding,
+storing the 80 bit registers entirely in memory
+which would make the behaviour under FLT_EVAL_METHOD==2
+mode more predictable)
The workaround for older gcc is to force the
compiler to store the intermediate results:
@@ -277,6 +308,40 @@ use higher precision variables when that's
what we want and don't depend on the implicit
excess precision).
+
+The standard allows 1 ulp errors in the conversion
+of decimal floating-point literals into floating-point
+values (it only requires the same result for the same
+literal thus 1.1 - 1.1 is always 0,
+but 1.1 - 11e-1 maybe +-0x1p-52 or 0).
+
+A reasonable compiler always use correctly rounded
+conversion according to the default (nearest) rounding
+mode, but there are exceptions:
+the x87 has builtin constants which are faster to load
+from hardware than from memory
+(and the hw has sticky bit set correctly for rounding).
+gcc can recognize these constants so an operation on
+
+According to the freebsd libm code gcc truncates long double
+const literals on i386.
+I assume this happens because freebsd uses 64bit long doubles by default
+(double precision) and gcc incorrectly uses the precision setting of the
+host platform instead of the target one, but i did not observe this on linux.
+(as a workaround sometimes double-double arithmetics was used
+to initialize long doubles on i386, but most of these should be
+fixed in musl's math code now)
+
Runtime and compile time semantics may be different
@@ -291,7 +356,8 @@ different precision than at runtime).
C99 actually allows most of these optimizations
but they can be turned off with STDC pragmas (see
6.10.6).
-Unfortunately gcc does not support these pragmas.
+Unfortunately gcc does not support these pragmas
+nor clang (clang bug 8100).
FENV_ACCESS ON tells the compiler that the code wants
to access the floating point environment (eg. set different rounding mode)
@@ -306,12 +372,6 @@ variables for constants like
static const volatile two52 = 0x1p52;
and using the '-frounding-math' gcc flag.
-
-(According the freebsd libm code gcc truncates
-long double const literals on i386.
-I haven't yet verified if this still the case,
-but as a workaround double-double arithmetics is used:
-initializing the long double constant from two doubles)
@@ -357,27 +417,27 @@ but feraiseexcept is not available for some reason,
then simple arithmetics can be be used just for their
exception raising side effect
(eg. 1/0.0 to raise divbyzero), however beaware
-of compiler optimizations (dead code elimination,..).
+of compiler optimizations (constant folding and dead code elimination,..).
Unfortunately gcc does not always take fp exceptions into
account: a simple x = 1e300*1e300; may not raise overflow
exception at runtime, but get optimized into x = +inf.
see compiler optimizations above.
-Another x87 gcc bug related to fp exceptions is that
+Another x87 gcc bug related to fp exceptions is that in some cases
comparision operators (==, <, etc) don't raise invalid
when an operand is nan
(eventhough this is required by ieee + c99 annex F).
(see gcc bug52451).
-The ieee standard defines signaling and quite nan
+The ieee standard defines signaling and quiet nan
floating-point numbers as well.
The c99 standard only considers quiet nan, but it allows
signaling nans to be supported as well.
Without signaling nans x * 1 is equivalent to x,
but if signaling nan is supported then the former
raises an invalid exception.
-This may complicates things further if one wants to write
+This may complicate things further if one wants to write
portable fp math code.
A further libm design issue is the math_errhandling macro:
@@ -388,8 +448,23 @@ but errno is hard to support: certain library functions
are implemented as a single asm instruction (eg sqrt),
the only way to set errno is to query the fp exception flags
and then set the errno variable based on that.
-So eventhough errno may be convenient in libm it is
+So eventhough errno may be convenient, in libm it is
not the right thing to do.
+
+For soft-float targets however errno seems to be the only option
+(which means annex K cannot be fully supported, as it requires
+the support of exception flags).
+The problem is that at context switches the fpu status should
+be saved and restored which is done by the kernel on hard-fp
+architectures when the state is in an fpu status word.
+In case of soft-fp emulation this must be done by the c runtime:
+context switches between threads can be supported with thread local
+storage of the exception state, but signal handlers may do floating-point
+arithmetics which should not alter the fenv state.
+Wrapping signal handlers is not possible/difficult for various
+reasons and the compiler cannot know which functions will be used
+as signal handlers, so the c runtime has no way to guarantee that
+signal handlers do not alter the fenv.
@@ -428,16 +503,26 @@ in gcc it can be 1.0fi).
The freebsd libm code has many inconsistencies
(naming conventions, 0x1p0 notation vs decimal notation,..),
-one of them is the integer type used for bitmanipulations:
+one of them is the integer type used for bit manipulations:
The bits of a double are unpacked into one of
-int32_t, uint32_t and u_int32_t
+int, int32_t, uint32_t and u_int32_t
integer types.
-int32_t is used most often which is wrong because of
-implementation defined signed int representation.
+int32_t is used the most often which is not wrong in itself
+but it is used incorrectly in many places.
+
+int is a bit worse because unlike int32_t it is not guaranteed
+to be 32bit two's complement representation. (but of course in
+practice they are the same)
+
+The issues found so far are left shift of negative integers
+(undefined behaviour), right shift of negative integers
+(implementation defined behaviour), signed overflow
+(implementation defined behaviour), unsigned to signed conversion
+(implementation defined behaviour).
-In general signed int is not handled carefully
-in the libm code: scalbn even depends on signed int overflow.
+It is easy to avoid these issues without performance impact,
+but a bit of care should be taken around bit manipulations.
+3.141592653589793238462643383L
+
+can turn into code that uses the fldpi
+instruction instead of memory loads.
+The only issue is that fldpi depends on
+the current rounding mode at runtime
+so the result can indeed be 1 ulp off compared
+to the compile-time rounded value.
+