Did any compiler fully use 80-bit floating point?












32















There is a paradox about floating point that I'm trying to understand.



Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)



The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.



The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.



The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.



The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.



So my question is:



Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?










share|improve this question

























  • Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday











  • Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

    – another-dave
    11 hours ago


















32















There is a paradox about floating point that I'm trying to understand.



Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)



The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.



The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.



The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.



The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.



So my question is:



Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?










share|improve this question

























  • Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday











  • Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

    – another-dave
    11 hours ago
















32












32








32


10






There is a paradox about floating point that I'm trying to understand.



Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)



The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.



The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.



The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.



The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.



So my question is:



Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?










share|improve this question
















There is a paradox about floating point that I'm trying to understand.



Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)



The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.



The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.



The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.



The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.



So my question is:



Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?







history compilers floating-point






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 19 at 9:59









unautre

31




31










asked Apr 19 at 5:19









rwallacerwallace

11.5k558170




11.5k558170













  • Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday











  • Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

    – another-dave
    11 hours ago





















  • Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday











  • Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

    – another-dave
    11 hours ago



















Comments are not for extended discussion; this conversation has been moved to chat.

– Chenmunka
yesterday





Comments are not for extended discussion; this conversation has been moved to chat.

– Chenmunka
yesterday













Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

– another-dave
11 hours ago







Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?

– another-dave
11 hours ago












5 Answers
5






active

oldest

votes


















28














Yes. For example, the C math library has had full support for long double, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double type. Conforming C and C++ compilers also perform long double math if you give the operations a long double argument. (Recall that, in C, 1.0/3.0 divides a double by another double, producing a double-precision result, and to get long double precision, you would write 1.0L/3.0L.)



GCC, in particular, even has options such as -ffloat-store to turn off computing intermediate results to a higher precision than a double is supposed to have. That is, on some architectures, the fastest way to perform some operations on double arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double intermediate values off.



Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double variables—except that they will optimize constants such as 0.5L to double when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double.



Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double 80 bits wide on that target.



Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind and SELECTED_REAL_KIND()). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10. Ada was another language that allowed the programmer to specify a minimum number of DIGITS of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended type, although its math library supported only real arguments.



Another possible example is Haskell, which provided both exact Rational types and arbitrary-precision floating-point through Data.Number.CReal. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.






share|improve this answer


























  • Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday



















26














TL:DR: no, none of the major C compilers had an option to force promoting double locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.





Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.



Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double for float * float even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)



His whole series of FP articles is excellent; index in this one.






that would cost very little performance to use (since the hardware implemented it whether you used it or not)




This is an overstatement. Working with 80-bit long double in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float without precision loss, but runtime variables usually can't make any assumptions.



Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double temporaries and locals to IEEE binary64: any time they store/reload.





  • 80-bit REAL10 / long double can't be a memory operand for fadd / fsub / fmul / fdiv / etc. Those only support using 32 or 64-bit float/double memory operands.



    So to work with an 80-bit value from memory, you need an extra fld instruction. (Unless you want it in a register separately anyway, then the separate fld isn't "extra"). On P5 Pentium, memory operands for instructions like fadd have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient for float/double.



    And you need an extra x87 stack register to load it into. fadd st5, qword [mem] isn't available (only memory source with the top of the register stack st0 as an implicit destination), so memory operands didn't help much to avoid fxch, but if you were close to filling up all 8 st0..7 stack slots then having to load might require you to spill something else.




  • fst to store st0 to memory without popping the x87 stack is only available for m32 / m64 operands (IEEE binary32 float / IEEE binary64 double).



    fstp m32/m64/m80 to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc where x[i] depends on x[i-1].



    If you want to store 80-bit long double, fstp is your only option. You might need use fld st0 to duplicate it, then fstp to pop that copy off. (You can fld / fstp with a register operand instead of memory, as well as fxch to swap a register to the top of the stack.)




80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.



Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt and fdiv performance being slower for full 80-bit precision, though.





  • P5 Pentium (in-order pipelined dual issue superscalar):





    • fld m32/m64 (load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.


    • fld m80 : 3 cycles, not pairable, and (unlike fadd / fmul which are pipelined), not overlapable with later FP or integer instructions.


    • fst(p) m32/m64 (round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapable


    • fstp m80: (note only available in pop version that frees the x87 register): 3 cycles, not pairable




  • P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)

    (Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)





    • fld m32/m64 is 1 uop for the load port.


    • fld m80 : 4 uops total: 2 ALU p0, 2 load port


    • fst(p) m32/m64 2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)


    • fstp m80: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.




Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80 can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80 is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)



Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80 has worse throughput than you'd expect from the uop counts / ports.




  • Pentium-M: 1 per 3 cycle throughput for fstp m80 6 uops. vs. 1 uop / 1-per-clock for fst(p) m32/m64, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders.

  • Core 2 (Merom) / Nehalem: fld m80: 1 per 3 cycles (4 uops)
    fstp m80 1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store.

  • Pentium 4 (pre-Prescott): fld m80 3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.
    fstp m80: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar

  • Skylake: fld m80: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.
    fstp m80: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.





  • AMD K7/K8: fld m80: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-op fld m32/m64).
    fstp m80: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelined fst(p) m32/m64). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.


  • AMD Bulldozer: fld m80: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regular float/double x87 loads have half throughput of SSE2 / AVX loads.
    fstp m80: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.

    (Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)



There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.



Fun fact: fld m32/m64 can raise / flag an FP exception (#IA) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.





So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.



x87 and MMX are de-prioritized, though, e.g. Haswell made fxch a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch.) And fmul / fadd throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.



(If you're looking at the tables yourself, fbld and fbstp m80bcd are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).








yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.




No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.



Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.



Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.




  • https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy

  • https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base


x87 (thus C FLT_EVAL_METHOD == 2) isn't the only thing that was / is problematic. C compilers that can contract x*y + z into fma(x,y,z) also avoid that intermediate rounding step.



For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.





When do compilers round:



Any time they need to pass a double to a non-inline function, obviously they store it in memory as a double. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)



So you can use sinl(x) instead of sin(x) to call the long double version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double or float) around that function call, because the whole x87 stack is call-clobbered.



When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c, your double a,b,c all get rounded to double when you do x = sinl(y). That's somewhat predictable.



But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile.



But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.





Extended precision long double is still an option



(Obviously long double will prevent auto-vectorization, so only use it if you need it when writing modern code.)



Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).



Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double.



Beware long double is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double a 10-byte type despite MSVC making it the same as 8-byte double.



GCC has a -mlong-double-64/80/128 x86 option to set the width of long double, and the docs warn that it changes the ABI.



ICC has a /Qlong-double option that makes long double an 80-bit type even on Windows.



So functions that interact with any kind of long double are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double works, except as a function return value in st0 where it's in a register already.





If you need more precision than IEEE binary64 double, your options include so-called double-double (using a pair of double values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).



(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double (53x53 => 106-bit significand) multiplication can be as simple as high = a * b; low = fma(a, b, -high); and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double inputs, it's obviously less cheap.)





Further fun facts:



The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:




  • to 80-bit long double: 64-bit significand precision. The finit default, and normal setting except with MSVC.

  • to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.

  • to 24-bit float: 24-bit significand precision.


Apparently Visual C++'s CRT startup code (that calls main) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.



So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80 would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld decodes to multiple ALU uops on modern CPUs.)



I don't know if the motivation was to speed up fdiv and fsqrt (which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.



Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos/fsin and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)



Of course you can set it back to 64-bit significand with _controlfp_s, so you could useful use asm, or call a function using long double compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float, double, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.






share|improve this answer





















  • 1





    In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

    – supercat
    Apr 19 at 21:30








  • 1





    @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

    – Peter Cordes
    Apr 19 at 21:53






  • 1





    If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

    – supercat
    Apr 19 at 22:07






  • 2





    @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

    – Peter Cordes
    Apr 19 at 22:12








  • 3





    the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

    – phuclv
    Apr 20 at 4:19





















16














I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.






share|improve this answer










New contributor




Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 4





    Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

    – rcgldr
    Apr 19 at 13:46








  • 2





    C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

    – supercat
    Apr 19 at 21:37






  • 1





    IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

    – cyco130
    Apr 20 at 11:04








  • 2





    @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

    – supercat
    Apr 20 at 16:23








  • 2





    ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

    – supercat
    Apr 20 at 16:29



















7














The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float type since at least 1998.



Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*




To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):




10 Floating point types corresponding to each floating
point format fully supported by the hardware.




and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.




...




P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).




Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.



* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.






share|improve this answer

































    6















    Did any compilers ever make full use of extended precision (i.e. 80
    bits in memory as well as in registers)? If not, why not?




    Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
    I also remember using long double even in 16-bit compilers for real mode.



    The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double datatype.




    On the other hand, Intel provided a solution with an extra eleven bits
    of precision and five bits of exponent, that would cost very little
    performance to use (since the hardware implemented it whether you used
    it or not), and yet everyone seemed to behave as though this had no
    value




    The usage of long double would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.






    share|improve this answer
























    • Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

      – rwallace
      Apr 19 at 8:12












    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "648"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f9751%2fdid-any-compiler-fully-use-80-bit-floating-point%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    28














    Yes. For example, the C math library has had full support for long double, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double type. Conforming C and C++ compilers also perform long double math if you give the operations a long double argument. (Recall that, in C, 1.0/3.0 divides a double by another double, producing a double-precision result, and to get long double precision, you would write 1.0L/3.0L.)



    GCC, in particular, even has options such as -ffloat-store to turn off computing intermediate results to a higher precision than a double is supposed to have. That is, on some architectures, the fastest way to perform some operations on double arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double intermediate values off.



    Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double variables—except that they will optimize constants such as 0.5L to double when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double.



    Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double 80 bits wide on that target.



    Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind and SELECTED_REAL_KIND()). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10. Ada was another language that allowed the programmer to specify a minimum number of DIGITS of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended type, although its math library supported only real arguments.



    Another possible example is Haskell, which provided both exact Rational types and arbitrary-precision floating-point through Data.Number.CReal. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.






    share|improve this answer


























    • Comments are not for extended discussion; this conversation has been moved to chat.

      – Chenmunka
      yesterday
















    28














    Yes. For example, the C math library has had full support for long double, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double type. Conforming C and C++ compilers also perform long double math if you give the operations a long double argument. (Recall that, in C, 1.0/3.0 divides a double by another double, producing a double-precision result, and to get long double precision, you would write 1.0L/3.0L.)



    GCC, in particular, even has options such as -ffloat-store to turn off computing intermediate results to a higher precision than a double is supposed to have. That is, on some architectures, the fastest way to perform some operations on double arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double intermediate values off.



    Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double variables—except that they will optimize constants such as 0.5L to double when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double.



    Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double 80 bits wide on that target.



    Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind and SELECTED_REAL_KIND()). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10. Ada was another language that allowed the programmer to specify a minimum number of DIGITS of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended type, although its math library supported only real arguments.



    Another possible example is Haskell, which provided both exact Rational types and arbitrary-precision floating-point through Data.Number.CReal. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.






    share|improve this answer


























    • Comments are not for extended discussion; this conversation has been moved to chat.

      – Chenmunka
      yesterday














    28












    28








    28







    Yes. For example, the C math library has had full support for long double, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double type. Conforming C and C++ compilers also perform long double math if you give the operations a long double argument. (Recall that, in C, 1.0/3.0 divides a double by another double, producing a double-precision result, and to get long double precision, you would write 1.0L/3.0L.)



    GCC, in particular, even has options such as -ffloat-store to turn off computing intermediate results to a higher precision than a double is supposed to have. That is, on some architectures, the fastest way to perform some operations on double arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double intermediate values off.



    Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double variables—except that they will optimize constants such as 0.5L to double when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double.



    Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double 80 bits wide on that target.



    Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind and SELECTED_REAL_KIND()). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10. Ada was another language that allowed the programmer to specify a minimum number of DIGITS of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended type, although its math library supported only real arguments.



    Another possible example is Haskell, which provided both exact Rational types and arbitrary-precision floating-point through Data.Number.CReal. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.






    share|improve this answer















    Yes. For example, the C math library has had full support for long double, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double type. Conforming C and C++ compilers also perform long double math if you give the operations a long double argument. (Recall that, in C, 1.0/3.0 divides a double by another double, producing a double-precision result, and to get long double precision, you would write 1.0L/3.0L.)



    GCC, in particular, even has options such as -ffloat-store to turn off computing intermediate results to a higher precision than a double is supposed to have. That is, on some architectures, the fastest way to perform some operations on double arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double intermediate values off.



    Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double variables—except that they will optimize constants such as 0.5L to double when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double.



    Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double 80 bits wide on that target.



    Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind and SELECTED_REAL_KIND()). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10. Ada was another language that allowed the programmer to specify a minimum number of DIGITS of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended type, although its math library supported only real arguments.



    Another possible example is Haskell, which provided both exact Rational types and arbitrary-precision floating-point through Data.Number.CReal. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Apr 19 at 20:34

























    answered Apr 19 at 6:57









    DavislorDavislor

    1,495411




    1,495411













    • Comments are not for extended discussion; this conversation has been moved to chat.

      – Chenmunka
      yesterday



















    • Comments are not for extended discussion; this conversation has been moved to chat.

      – Chenmunka
      yesterday

















    Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday





    Comments are not for extended discussion; this conversation has been moved to chat.

    – Chenmunka
    yesterday











    26














    TL:DR: no, none of the major C compilers had an option to force promoting double locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.





    Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.



    Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double for float * float even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)



    His whole series of FP articles is excellent; index in this one.






    that would cost very little performance to use (since the hardware implemented it whether you used it or not)




    This is an overstatement. Working with 80-bit long double in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float without precision loss, but runtime variables usually can't make any assumptions.



    Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double temporaries and locals to IEEE binary64: any time they store/reload.





    • 80-bit REAL10 / long double can't be a memory operand for fadd / fsub / fmul / fdiv / etc. Those only support using 32 or 64-bit float/double memory operands.



      So to work with an 80-bit value from memory, you need an extra fld instruction. (Unless you want it in a register separately anyway, then the separate fld isn't "extra"). On P5 Pentium, memory operands for instructions like fadd have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient for float/double.



      And you need an extra x87 stack register to load it into. fadd st5, qword [mem] isn't available (only memory source with the top of the register stack st0 as an implicit destination), so memory operands didn't help much to avoid fxch, but if you were close to filling up all 8 st0..7 stack slots then having to load might require you to spill something else.




    • fst to store st0 to memory without popping the x87 stack is only available for m32 / m64 operands (IEEE binary32 float / IEEE binary64 double).



      fstp m32/m64/m80 to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc where x[i] depends on x[i-1].



      If you want to store 80-bit long double, fstp is your only option. You might need use fld st0 to duplicate it, then fstp to pop that copy off. (You can fld / fstp with a register operand instead of memory, as well as fxch to swap a register to the top of the stack.)




    80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.



    Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt and fdiv performance being slower for full 80-bit precision, though.





    • P5 Pentium (in-order pipelined dual issue superscalar):





      • fld m32/m64 (load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.


      • fld m80 : 3 cycles, not pairable, and (unlike fadd / fmul which are pipelined), not overlapable with later FP or integer instructions.


      • fst(p) m32/m64 (round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapable


      • fstp m80: (note only available in pop version that frees the x87 register): 3 cycles, not pairable




    • P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)

      (Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)





      • fld m32/m64 is 1 uop for the load port.


      • fld m80 : 4 uops total: 2 ALU p0, 2 load port


      • fst(p) m32/m64 2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)


      • fstp m80: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.




    Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80 can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80 is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)



    Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80 has worse throughput than you'd expect from the uop counts / ports.




    • Pentium-M: 1 per 3 cycle throughput for fstp m80 6 uops. vs. 1 uop / 1-per-clock for fst(p) m32/m64, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders.

    • Core 2 (Merom) / Nehalem: fld m80: 1 per 3 cycles (4 uops)
      fstp m80 1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store.

    • Pentium 4 (pre-Prescott): fld m80 3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.
      fstp m80: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar

    • Skylake: fld m80: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.
      fstp m80: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.





    • AMD K7/K8: fld m80: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-op fld m32/m64).
      fstp m80: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelined fst(p) m32/m64). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.


    • AMD Bulldozer: fld m80: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regular float/double x87 loads have half throughput of SSE2 / AVX loads.
      fstp m80: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.

      (Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)



    There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.



    Fun fact: fld m32/m64 can raise / flag an FP exception (#IA) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.





    So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.



    x87 and MMX are de-prioritized, though, e.g. Haswell made fxch a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch.) And fmul / fadd throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.



    (If you're looking at the tables yourself, fbld and fbstp m80bcd are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).








    yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.




    No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.



    Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.



    Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.




    • https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy

    • https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base


    x87 (thus C FLT_EVAL_METHOD == 2) isn't the only thing that was / is problematic. C compilers that can contract x*y + z into fma(x,y,z) also avoid that intermediate rounding step.



    For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.





    When do compilers round:



    Any time they need to pass a double to a non-inline function, obviously they store it in memory as a double. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)



    So you can use sinl(x) instead of sin(x) to call the long double version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double or float) around that function call, because the whole x87 stack is call-clobbered.



    When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c, your double a,b,c all get rounded to double when you do x = sinl(y). That's somewhat predictable.



    But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile.



    But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.





    Extended precision long double is still an option



    (Obviously long double will prevent auto-vectorization, so only use it if you need it when writing modern code.)



    Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).



    Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double.



    Beware long double is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double a 10-byte type despite MSVC making it the same as 8-byte double.



    GCC has a -mlong-double-64/80/128 x86 option to set the width of long double, and the docs warn that it changes the ABI.



    ICC has a /Qlong-double option that makes long double an 80-bit type even on Windows.



    So functions that interact with any kind of long double are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double works, except as a function return value in st0 where it's in a register already.





    If you need more precision than IEEE binary64 double, your options include so-called double-double (using a pair of double values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).



    (On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double (53x53 => 106-bit significand) multiplication can be as simple as high = a * b; low = fma(a, b, -high); and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double inputs, it's obviously less cheap.)





    Further fun facts:



    The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:




    • to 80-bit long double: 64-bit significand precision. The finit default, and normal setting except with MSVC.

    • to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.

    • to 24-bit float: 24-bit significand precision.


    Apparently Visual C++'s CRT startup code (that calls main) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.



    So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80 would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld decodes to multiple ALU uops on modern CPUs.)



    I don't know if the motivation was to speed up fdiv and fsqrt (which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.



    Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos/fsin and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)



    Of course you can set it back to 64-bit significand with _controlfp_s, so you could useful use asm, or call a function using long double compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float, double, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.






    share|improve this answer





















    • 1





      In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

      – supercat
      Apr 19 at 21:30








    • 1





      @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

      – Peter Cordes
      Apr 19 at 21:53






    • 1





      If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

      – supercat
      Apr 19 at 22:07






    • 2





      @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

      – Peter Cordes
      Apr 19 at 22:12








    • 3





      the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

      – phuclv
      Apr 20 at 4:19


















    26














    TL:DR: no, none of the major C compilers had an option to force promoting double locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.





    Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.



    Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double for float * float even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)



    His whole series of FP articles is excellent; index in this one.






    that would cost very little performance to use (since the hardware implemented it whether you used it or not)




    This is an overstatement. Working with 80-bit long double in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float without precision loss, but runtime variables usually can't make any assumptions.



    Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double temporaries and locals to IEEE binary64: any time they store/reload.





    • 80-bit REAL10 / long double can't be a memory operand for fadd / fsub / fmul / fdiv / etc. Those only support using 32 or 64-bit float/double memory operands.



      So to work with an 80-bit value from memory, you need an extra fld instruction. (Unless you want it in a register separately anyway, then the separate fld isn't "extra"). On P5 Pentium, memory operands for instructions like fadd have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient for float/double.



      And you need an extra x87 stack register to load it into. fadd st5, qword [mem] isn't available (only memory source with the top of the register stack st0 as an implicit destination), so memory operands didn't help much to avoid fxch, but if you were close to filling up all 8 st0..7 stack slots then having to load might require you to spill something else.




    • fst to store st0 to memory without popping the x87 stack is only available for m32 / m64 operands (IEEE binary32 float / IEEE binary64 double).



      fstp m32/m64/m80 to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc where x[i] depends on x[i-1].



      If you want to store 80-bit long double, fstp is your only option. You might need use fld st0 to duplicate it, then fstp to pop that copy off. (You can fld / fstp with a register operand instead of memory, as well as fxch to swap a register to the top of the stack.)




    80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.



    Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt and fdiv performance being slower for full 80-bit precision, though.





    • P5 Pentium (in-order pipelined dual issue superscalar):





      • fld m32/m64 (load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.


      • fld m80 : 3 cycles, not pairable, and (unlike fadd / fmul which are pipelined), not overlapable with later FP or integer instructions.


      • fst(p) m32/m64 (round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapable


      • fstp m80: (note only available in pop version that frees the x87 register): 3 cycles, not pairable




    • P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)

      (Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)





      • fld m32/m64 is 1 uop for the load port.


      • fld m80 : 4 uops total: 2 ALU p0, 2 load port


      • fst(p) m32/m64 2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)


      • fstp m80: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.




    Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80 can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80 is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)



    Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80 has worse throughput than you'd expect from the uop counts / ports.




    • Pentium-M: 1 per 3 cycle throughput for fstp m80 6 uops. vs. 1 uop / 1-per-clock for fst(p) m32/m64, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders.

    • Core 2 (Merom) / Nehalem: fld m80: 1 per 3 cycles (4 uops)
      fstp m80 1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store.

    • Pentium 4 (pre-Prescott): fld m80 3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.
      fstp m80: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar

    • Skylake: fld m80: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.
      fstp m80: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.





    • AMD K7/K8: fld m80: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-op fld m32/m64).
      fstp m80: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelined fst(p) m32/m64). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.


    • AMD Bulldozer: fld m80: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regular float/double x87 loads have half throughput of SSE2 / AVX loads.
      fstp m80: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.

      (Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)



    There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.



    Fun fact: fld m32/m64 can raise / flag an FP exception (#IA) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.





    So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.



    x87 and MMX are de-prioritized, though, e.g. Haswell made fxch a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch.) And fmul / fadd throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.



    (If you're looking at the tables yourself, fbld and fbstp m80bcd are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).








    yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.




    No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.



    Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.



    Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.




    • https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy

    • https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base


    x87 (thus C FLT_EVAL_METHOD == 2) isn't the only thing that was / is problematic. C compilers that can contract x*y + z into fma(x,y,z) also avoid that intermediate rounding step.



    For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.





    When do compilers round:



    Any time they need to pass a double to a non-inline function, obviously they store it in memory as a double. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)



    So you can use sinl(x) instead of sin(x) to call the long double version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double or float) around that function call, because the whole x87 stack is call-clobbered.



    When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c, your double a,b,c all get rounded to double when you do x = sinl(y). That's somewhat predictable.



    But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile.



    But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.





    Extended precision long double is still an option



    (Obviously long double will prevent auto-vectorization, so only use it if you need it when writing modern code.)



    Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).



    Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double.



    Beware long double is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double a 10-byte type despite MSVC making it the same as 8-byte double.



    GCC has a -mlong-double-64/80/128 x86 option to set the width of long double, and the docs warn that it changes the ABI.



    ICC has a /Qlong-double option that makes long double an 80-bit type even on Windows.



    So functions that interact with any kind of long double are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double works, except as a function return value in st0 where it's in a register already.





    If you need more precision than IEEE binary64 double, your options include so-called double-double (using a pair of double values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).



    (On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double (53x53 => 106-bit significand) multiplication can be as simple as high = a * b; low = fma(a, b, -high); and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double inputs, it's obviously less cheap.)





    Further fun facts:



    The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:




    • to 80-bit long double: 64-bit significand precision. The finit default, and normal setting except with MSVC.

    • to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.

    • to 24-bit float: 24-bit significand precision.


    Apparently Visual C++'s CRT startup code (that calls main) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.



    So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80 would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld decodes to multiple ALU uops on modern CPUs.)



    I don't know if the motivation was to speed up fdiv and fsqrt (which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.



    Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos/fsin and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)



    Of course you can set it back to 64-bit significand with _controlfp_s, so you could useful use asm, or call a function using long double compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float, double, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.






    share|improve this answer





















    • 1





      In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

      – supercat
      Apr 19 at 21:30








    • 1





      @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

      – Peter Cordes
      Apr 19 at 21:53






    • 1





      If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

      – supercat
      Apr 19 at 22:07






    • 2





      @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

      – Peter Cordes
      Apr 19 at 22:12








    • 3





      the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

      – phuclv
      Apr 20 at 4:19
















    26












    26








    26







    TL:DR: no, none of the major C compilers had an option to force promoting double locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.





    Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.



    Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double for float * float even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)



    His whole series of FP articles is excellent; index in this one.






    that would cost very little performance to use (since the hardware implemented it whether you used it or not)




    This is an overstatement. Working with 80-bit long double in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float without precision loss, but runtime variables usually can't make any assumptions.



    Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double temporaries and locals to IEEE binary64: any time they store/reload.





    • 80-bit REAL10 / long double can't be a memory operand for fadd / fsub / fmul / fdiv / etc. Those only support using 32 or 64-bit float/double memory operands.



      So to work with an 80-bit value from memory, you need an extra fld instruction. (Unless you want it in a register separately anyway, then the separate fld isn't "extra"). On P5 Pentium, memory operands for instructions like fadd have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient for float/double.



      And you need an extra x87 stack register to load it into. fadd st5, qword [mem] isn't available (only memory source with the top of the register stack st0 as an implicit destination), so memory operands didn't help much to avoid fxch, but if you were close to filling up all 8 st0..7 stack slots then having to load might require you to spill something else.




    • fst to store st0 to memory without popping the x87 stack is only available for m32 / m64 operands (IEEE binary32 float / IEEE binary64 double).



      fstp m32/m64/m80 to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc where x[i] depends on x[i-1].



      If you want to store 80-bit long double, fstp is your only option. You might need use fld st0 to duplicate it, then fstp to pop that copy off. (You can fld / fstp with a register operand instead of memory, as well as fxch to swap a register to the top of the stack.)




    80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.



    Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt and fdiv performance being slower for full 80-bit precision, though.





    • P5 Pentium (in-order pipelined dual issue superscalar):





      • fld m32/m64 (load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.


      • fld m80 : 3 cycles, not pairable, and (unlike fadd / fmul which are pipelined), not overlapable with later FP or integer instructions.


      • fst(p) m32/m64 (round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapable


      • fstp m80: (note only available in pop version that frees the x87 register): 3 cycles, not pairable




    • P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)

      (Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)





      • fld m32/m64 is 1 uop for the load port.


      • fld m80 : 4 uops total: 2 ALU p0, 2 load port


      • fst(p) m32/m64 2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)


      • fstp m80: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.




    Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80 can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80 is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)



    Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80 has worse throughput than you'd expect from the uop counts / ports.




    • Pentium-M: 1 per 3 cycle throughput for fstp m80 6 uops. vs. 1 uop / 1-per-clock for fst(p) m32/m64, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders.

    • Core 2 (Merom) / Nehalem: fld m80: 1 per 3 cycles (4 uops)
      fstp m80 1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store.

    • Pentium 4 (pre-Prescott): fld m80 3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.
      fstp m80: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar

    • Skylake: fld m80: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.
      fstp m80: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.





    • AMD K7/K8: fld m80: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-op fld m32/m64).
      fstp m80: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelined fst(p) m32/m64). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.


    • AMD Bulldozer: fld m80: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regular float/double x87 loads have half throughput of SSE2 / AVX loads.
      fstp m80: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.

      (Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)



    There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.



    Fun fact: fld m32/m64 can raise / flag an FP exception (#IA) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.





    So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.



    x87 and MMX are de-prioritized, though, e.g. Haswell made fxch a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch.) And fmul / fadd throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.



    (If you're looking at the tables yourself, fbld and fbstp m80bcd are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).








    yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.




    No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.



    Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.



    Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.




    • https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy

    • https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base


    x87 (thus C FLT_EVAL_METHOD == 2) isn't the only thing that was / is problematic. C compilers that can contract x*y + z into fma(x,y,z) also avoid that intermediate rounding step.



    For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.





    When do compilers round:



    Any time they need to pass a double to a non-inline function, obviously they store it in memory as a double. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)



    So you can use sinl(x) instead of sin(x) to call the long double version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double or float) around that function call, because the whole x87 stack is call-clobbered.



    When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c, your double a,b,c all get rounded to double when you do x = sinl(y). That's somewhat predictable.



    But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile.



    But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.





    Extended precision long double is still an option



    (Obviously long double will prevent auto-vectorization, so only use it if you need it when writing modern code.)



    Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).



    Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double.



    Beware long double is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double a 10-byte type despite MSVC making it the same as 8-byte double.



    GCC has a -mlong-double-64/80/128 x86 option to set the width of long double, and the docs warn that it changes the ABI.



    ICC has a /Qlong-double option that makes long double an 80-bit type even on Windows.



    So functions that interact with any kind of long double are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double works, except as a function return value in st0 where it's in a register already.





    If you need more precision than IEEE binary64 double, your options include so-called double-double (using a pair of double values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).



    (On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double (53x53 => 106-bit significand) multiplication can be as simple as high = a * b; low = fma(a, b, -high); and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double inputs, it's obviously less cheap.)





    Further fun facts:



    The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:




    • to 80-bit long double: 64-bit significand precision. The finit default, and normal setting except with MSVC.

    • to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.

    • to 24-bit float: 24-bit significand precision.


    Apparently Visual C++'s CRT startup code (that calls main) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.



    So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80 would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld decodes to multiple ALU uops on modern CPUs.)



    I don't know if the motivation was to speed up fdiv and fsqrt (which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.



    Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos/fsin and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)



    Of course you can set it back to 64-bit significand with _controlfp_s, so you could useful use asm, or call a function using long double compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float, double, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.






    share|improve this answer















    TL:DR: no, none of the major C compilers had an option to force promoting double locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.





    Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.



    Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double for float * float even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)



    His whole series of FP articles is excellent; index in this one.






    that would cost very little performance to use (since the hardware implemented it whether you used it or not)




    This is an overstatement. Working with 80-bit long double in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float without precision loss, but runtime variables usually can't make any assumptions.



    Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double temporaries and locals to IEEE binary64: any time they store/reload.





    • 80-bit REAL10 / long double can't be a memory operand for fadd / fsub / fmul / fdiv / etc. Those only support using 32 or 64-bit float/double memory operands.



      So to work with an 80-bit value from memory, you need an extra fld instruction. (Unless you want it in a register separately anyway, then the separate fld isn't "extra"). On P5 Pentium, memory operands for instructions like fadd have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient for float/double.



      And you need an extra x87 stack register to load it into. fadd st5, qword [mem] isn't available (only memory source with the top of the register stack st0 as an implicit destination), so memory operands didn't help much to avoid fxch, but if you were close to filling up all 8 st0..7 stack slots then having to load might require you to spill something else.




    • fst to store st0 to memory without popping the x87 stack is only available for m32 / m64 operands (IEEE binary32 float / IEEE binary64 double).



      fstp m32/m64/m80 to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc where x[i] depends on x[i-1].



      If you want to store 80-bit long double, fstp is your only option. You might need use fld st0 to duplicate it, then fstp to pop that copy off. (You can fld / fstp with a register operand instead of memory, as well as fxch to swap a register to the top of the stack.)




    80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.



    Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt and fdiv performance being slower for full 80-bit precision, though.





    • P5 Pentium (in-order pipelined dual issue superscalar):





      • fld m32/m64 (load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.


      • fld m80 : 3 cycles, not pairable, and (unlike fadd / fmul which are pipelined), not overlapable with later FP or integer instructions.


      • fst(p) m32/m64 (round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapable


      • fstp m80: (note only available in pop version that frees the x87 register): 3 cycles, not pairable




    • P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)

      (Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)





      • fld m32/m64 is 1 uop for the load port.


      • fld m80 : 4 uops total: 2 ALU p0, 2 load port


      • fst(p) m32/m64 2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)


      • fstp m80: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.




    Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80 can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80 is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)



    Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80 has worse throughput than you'd expect from the uop counts / ports.




    • Pentium-M: 1 per 3 cycle throughput for fstp m80 6 uops. vs. 1 uop / 1-per-clock for fst(p) m32/m64, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders.

    • Core 2 (Merom) / Nehalem: fld m80: 1 per 3 cycles (4 uops)
      fstp m80 1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store.

    • Pentium 4 (pre-Prescott): fld m80 3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.
      fstp m80: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar

    • Skylake: fld m80: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.
      fstp m80: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.





    • AMD K7/K8: fld m80: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-op fld m32/m64).
      fstp m80: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelined fst(p) m32/m64). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.


    • AMD Bulldozer: fld m80: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regular float/double x87 loads have half throughput of SSE2 / AVX loads.
      fstp m80: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.

      (Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)



    There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.



    Fun fact: fld m32/m64 can raise / flag an FP exception (#IA) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.





    So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.



    x87 and MMX are de-prioritized, though, e.g. Haswell made fxch a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch.) And fmul / fadd throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.



    (If you're looking at the tables yourself, fbld and fbstp m80bcd are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).








    yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.




    No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.



    Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.



    Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.




    • https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy

    • https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base


    x87 (thus C FLT_EVAL_METHOD == 2) isn't the only thing that was / is problematic. C compilers that can contract x*y + z into fma(x,y,z) also avoid that intermediate rounding step.



    For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.





    When do compilers round:



    Any time they need to pass a double to a non-inline function, obviously they store it in memory as a double. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)



    So you can use sinl(x) instead of sin(x) to call the long double version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double or float) around that function call, because the whole x87 stack is call-clobbered.



    When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c, your double a,b,c all get rounded to double when you do x = sinl(y). That's somewhat predictable.



    But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile.



    But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.





    Extended precision long double is still an option



    (Obviously long double will prevent auto-vectorization, so only use it if you need it when writing modern code.)



    Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).



    Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double.



    Beware long double is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double a 10-byte type despite MSVC making it the same as 8-byte double.



    GCC has a -mlong-double-64/80/128 x86 option to set the width of long double, and the docs warn that it changes the ABI.



    ICC has a /Qlong-double option that makes long double an 80-bit type even on Windows.



    So functions that interact with any kind of long double are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double works, except as a function return value in st0 where it's in a register already.





    If you need more precision than IEEE binary64 double, your options include so-called double-double (using a pair of double values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).



    (On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double (53x53 => 106-bit significand) multiplication can be as simple as high = a * b; low = fma(a, b, -high); and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double inputs, it's obviously less cheap.)





    Further fun facts:



    The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:




    • to 80-bit long double: 64-bit significand precision. The finit default, and normal setting except with MSVC.

    • to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.

    • to 24-bit float: 24-bit significand precision.


    Apparently Visual C++'s CRT startup code (that calls main) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.



    So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80 would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld decodes to multiple ALU uops on modern CPUs.)



    I don't know if the motivation was to speed up fdiv and fsqrt (which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.



    Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos/fsin and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)



    Of course you can set it back to 64-bit significand with _controlfp_s, so you could useful use asm, or call a function using long double compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float, double, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited 9 hours ago

























    answered Apr 19 at 12:06









    Peter CordesPeter Cordes

    1,291711




    1,291711








    • 1





      In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

      – supercat
      Apr 19 at 21:30








    • 1





      @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

      – Peter Cordes
      Apr 19 at 21:53






    • 1





      If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

      – supercat
      Apr 19 at 22:07






    • 2





      @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

      – Peter Cordes
      Apr 19 at 22:12








    • 3





      the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

      – phuclv
      Apr 20 at 4:19
















    • 1





      In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

      – supercat
      Apr 19 at 21:30








    • 1





      @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

      – Peter Cordes
      Apr 19 at 21:53






    • 1





      If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

      – supercat
      Apr 19 at 22:07






    • 2





      @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

      – Peter Cordes
      Apr 19 at 22:12








    • 3





      the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

      – phuclv
      Apr 20 at 4:19










    1




    1





    In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

    – supercat
    Apr 19 at 21:30







    In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.

    – supercat
    Apr 19 at 21:30






    1




    1





    @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

    – Peter Cordes
    Apr 19 at 21:53





    @supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for sin(x) + sin(y) + sin(x*y) or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.

    – Peter Cordes
    Apr 19 at 21:53




    1




    1





    If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

    – supercat
    Apr 19 at 22:07





    If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.

    – supercat
    Apr 19 at 22:07




    2




    2





    @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

    – Peter Cordes
    Apr 19 at 22:12







    @supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on call, and empty or holding return value on ret). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstps to use on entry, and how many slow 80-bit flds to do before returning, but yuck.

    – Peter Cordes
    Apr 19 at 22:12






    3




    3





    the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

    – phuclv
    Apr 20 at 4:19







    the option in GCC is -mlong-double-64/80/128. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.

    – phuclv
    Apr 20 at 4:19













    16














    I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.






    share|improve this answer










    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.
















    • 4





      Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

      – rcgldr
      Apr 19 at 13:46








    • 2





      C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

      – supercat
      Apr 19 at 21:37






    • 1





      IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

      – cyco130
      Apr 20 at 11:04








    • 2





      @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

      – supercat
      Apr 20 at 16:23








    • 2





      ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

      – supercat
      Apr 20 at 16:29
















    16














    I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.






    share|improve this answer










    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.
















    • 4





      Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

      – rcgldr
      Apr 19 at 13:46








    • 2





      C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

      – supercat
      Apr 19 at 21:37






    • 1





      IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

      – cyco130
      Apr 20 at 11:04








    • 2





      @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

      – supercat
      Apr 20 at 16:23








    • 2





      ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

      – supercat
      Apr 20 at 16:29














    16












    16








    16







    I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.






    share|improve this answer










    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.










    I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.







    share|improve this answer










    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    share|improve this answer



    share|improve this answer








    edited Apr 19 at 11:52





















    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    answered Apr 19 at 11:47









    Pete BeckerPete Becker

    26115




    26115




    New contributor




    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





    New contributor





    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    Pete Becker is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.








    • 4





      Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

      – rcgldr
      Apr 19 at 13:46








    • 2





      C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

      – supercat
      Apr 19 at 21:37






    • 1





      IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

      – cyco130
      Apr 20 at 11:04








    • 2





      @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

      – supercat
      Apr 20 at 16:23








    • 2





      ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

      – supercat
      Apr 20 at 16:29














    • 4





      Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

      – rcgldr
      Apr 19 at 13:46








    • 2





      C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

      – supercat
      Apr 19 at 21:37






    • 1





      IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

      – cyco130
      Apr 20 at 11:04








    • 2





      @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

      – supercat
      Apr 20 at 16:23








    • 2





      ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

      – supercat
      Apr 20 at 16:29








    4




    4





    Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

    – rcgldr
    Apr 19 at 13:46







    Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.

    – rcgldr
    Apr 19 at 13:46






    2




    2





    C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

    – supercat
    Apr 19 at 21:37





    C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of long double get converted to double except when wrapped using a special macro which would pass them in a struct __wrapped_long_double, that would have avoided the need for something like printf("%10.4f", x*y); to care about the types of both x and y [since the value isn't wrapped, the value would get passed to double regardless of the types of x and y].

    – supercat
    Apr 19 at 21:37




    1




    1





    IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

    – cyco130
    Apr 20 at 11:04







    IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.

    – cyco130
    Apr 20 at 11:04






    2




    2





    @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

    – supercat
    Apr 20 at 16:23







    @MichaelKarcher: The introduction of long int came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int and int, vs. long double and double, is that every value within the range of int can be represented just as accurately by that type as by any larger type. Thus, if scale_factor will never exceed the range of int, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;, ...

    – supercat
    Apr 20 at 16:23






    2




    2





    ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

    – supercat
    Apr 20 at 16:29





    ...and then computes double x=one_tenth*y;, the calculation may be less precise than if one had written double x=y/10.0; or used long double scale_factor=0.1lL. If neither wholeQuantity1 nor wholeQuantity2 would need to accommodate values outside the range of int, the expression wholeQuantity1+wholeQuantity2 will likely be of type int or unsigned. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.

    – supercat
    Apr 20 at 16:29











    7














    The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float type since at least 1998.



    Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*




    To take a simple example, an x86 compiler that does not support 80-bit
    IEEE extended arithmetic is clearly violates B.2(10):




    10 Floating point types corresponding to each floating
    point format fully supported by the hardware.




    and is thus non-conformant. It will still be fully validatable, since
    this is not the sort of thing the validation can test with automated
    tests.




    ...




    P.S. Just to ensure that people do not regard the above as special
    pleading for non-conformances in GNAT, please be sure to realize that
    GNAT does support 80-bit float on the ia32 (x86).




    Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.



    * - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.






    share|improve this answer






























      7














      The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float type since at least 1998.



      Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*




      To take a simple example, an x86 compiler that does not support 80-bit
      IEEE extended arithmetic is clearly violates B.2(10):




      10 Floating point types corresponding to each floating
      point format fully supported by the hardware.




      and is thus non-conformant. It will still be fully validatable, since
      this is not the sort of thing the validation can test with automated
      tests.




      ...




      P.S. Just to ensure that people do not regard the above as special
      pleading for non-conformances in GNAT, please be sure to realize that
      GNAT does support 80-bit float on the ia32 (x86).




      Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.



      * - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.






      share|improve this answer




























        7












        7








        7







        The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float type since at least 1998.



        Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*




        To take a simple example, an x86 compiler that does not support 80-bit
        IEEE extended arithmetic is clearly violates B.2(10):




        10 Floating point types corresponding to each floating
        point format fully supported by the hardware.




        and is thus non-conformant. It will still be fully validatable, since
        this is not the sort of thing the validation can test with automated
        tests.




        ...




        P.S. Just to ensure that people do not regard the above as special
        pleading for non-conformances in GNAT, please be sure to realize that
        GNAT does support 80-bit float on the ia32 (x86).




        Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.



        * - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.






        share|improve this answer















        The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float type since at least 1998.



        Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*




        To take a simple example, an x86 compiler that does not support 80-bit
        IEEE extended arithmetic is clearly violates B.2(10):




        10 Floating point types corresponding to each floating
        point format fully supported by the hardware.




        and is thus non-conformant. It will still be fully validatable, since
        this is not the sort of thing the validation can test with automated
        tests.




        ...




        P.S. Just to ensure that people do not regard the above as special
        pleading for non-conformances in GNAT, please be sure to realize that
        GNAT does support 80-bit float on the ia32 (x86).




        Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.



        * - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 19 at 14:10

























        answered Apr 19 at 13:35









        T.E.D.T.E.D.

        61125




        61125























            6















            Did any compilers ever make full use of extended precision (i.e. 80
            bits in memory as well as in registers)? If not, why not?




            Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
            I also remember using long double even in 16-bit compilers for real mode.



            The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double datatype.




            On the other hand, Intel provided a solution with an extra eleven bits
            of precision and five bits of exponent, that would cost very little
            performance to use (since the hardware implemented it whether you used
            it or not), and yet everyone seemed to behave as though this had no
            value




            The usage of long double would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.






            share|improve this answer
























            • Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

              – rwallace
              Apr 19 at 8:12
















            6















            Did any compilers ever make full use of extended precision (i.e. 80
            bits in memory as well as in registers)? If not, why not?




            Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
            I also remember using long double even in 16-bit compilers for real mode.



            The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double datatype.




            On the other hand, Intel provided a solution with an extra eleven bits
            of precision and five bits of exponent, that would cost very little
            performance to use (since the hardware implemented it whether you used
            it or not), and yet everyone seemed to behave as though this had no
            value




            The usage of long double would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.






            share|improve this answer
























            • Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

              – rwallace
              Apr 19 at 8:12














            6












            6








            6








            Did any compilers ever make full use of extended precision (i.e. 80
            bits in memory as well as in registers)? If not, why not?




            Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
            I also remember using long double even in 16-bit compilers for real mode.



            The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double datatype.




            On the other hand, Intel provided a solution with an extra eleven bits
            of precision and five bits of exponent, that would cost very little
            performance to use (since the hardware implemented it whether you used
            it or not), and yet everyone seemed to behave as though this had no
            value




            The usage of long double would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.






            share|improve this answer














            Did any compilers ever make full use of extended precision (i.e. 80
            bits in memory as well as in registers)? If not, why not?




            Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
            I also remember using long double even in 16-bit compilers for real mode.



            The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double datatype.




            On the other hand, Intel provided a solution with an extra eleven bits
            of precision and five bits of exponent, that would cost very little
            performance to use (since the hardware implemented it whether you used
            it or not), and yet everyone seemed to behave as though this had no
            value




            The usage of long double would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Apr 19 at 7:23









            lvdlvd

            2,995721




            2,995721













            • Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

              – rwallace
              Apr 19 at 8:12



















            • Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

              – rwallace
              Apr 19 at 8:12

















            Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

            – rwallace
            Apr 19 at 8:12





            Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.

            – rwallace
            Apr 19 at 8:12


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Retrocomputing Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f9751%2fdid-any-compiler-fully-use-80-bit-floating-point%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Plaza Victoria

            In PowerPoint, is there a keyboard shortcut for bulleted / numbered list?

            How to put 3 figures in Latex with 2 figures side by side and 1 below these side by side images but in...