Csdrhrt

Question

The following Java program takes on average between 0.50s and 0.55s to run:

public static void main(String args) {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");

    System.out.println("n = " + n);

}

If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?

I ran each version of the program 15 times, alternating between the two. Here are the results:

 2*(i*i)  |  2*i*i

----------+----------

0.5183738 | 0.6246434

0.5298337 | 0.6049722

0.5308647 | 0.6603363

0.5133458 | 0.6243328

0.5003011 | 0.6541802

0.5366181 | 0.6312638

0.515149  | 0.6241105

0.5237389 | 0.627815

0.5249942 | 0.6114252

0.5641624 | 0.6781033

0.538412  | 0.6393969

0.5466744 | 0.6608845

0.531159  | 0.6201077

0.5048032 | 0.6511559

0.5232789 | 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.

I get similar results (slightly different numbers, but definitely noticeable and consistent gap, definitely more than sampling error) — Nov 23 at 20:47
@Krease Good that you caught my mistake. According to the new benchmark I ran 2 * i * i is slower. I'll try running with Graal as well. — Nov 23 at 21:07
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs. — Nov 23 at 21:29
You could rename your question to "Why is i * i * 2 faster than 2 * i * i?" for improved clarity that the issue is on the order of the operations. — Nov 28 at 4:02

Ian Kemp 16.4k126797 · Accepted Answer · 2018-11-27 09:18:36Z

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)¹.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.

But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)²:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]

  vmovdqa ymm3, YMMWORD PTR .LC1[rip]

  xor eax, eax

  vpxor xmm2, xmm2, xmm2

.L2:

  vpmulld ymm1, ymm0, ymm0

  inc eax

  vpaddd ymm0, ymm0, ymm3

  vpslld ymm1, ymm1, 1

  vpaddd ymm2, ymm2, ymm1

  cmp eax, 125000000      ; 8 calculations per iteration

  jne .L2

  vmovdqa xmm0, xmm2

  vextracti128 xmm2, ymm2, 1

  vpaddd xmm2, xmm0, xmm2

  vpsrldq xmm0, xmm2, 8

  vpaddd xmm0, xmm2, xmm0

  vpsrldq xmm1, xmm0, 4

  vpaddd xmm0, xmm0, xmm1

  vmovd eax, xmm0

  vzeroupper

With run times:

SSE: 0.24 s, or 2 times faster.

AVX: 0.15 s, or 3 times faster.

AVX2: 0.08 s, or 5 times faster.

¹_{To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly}

²_{The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.}

The single biggest problem the optimizer encounters in the C example is the undefined behavior invoked by signed integer overflow. Which, otherwise, would probably result in simply loading a constant as the whole loop can be calculated at compiletime. — Nov 25 at 18:28
@Damon Why would undefined behavior be a problem for the optimizer? If the optimizer sees it overflows when trying to calculate the result, that just means it can optimize it however it wants, because the behavior is undefined. — Nov 25 at 18:51
@Runemoro: if the optimizer proves that calling the function will inevitably result in undefined behaviour, it could choose to assume that the function will never be called, and emit no body for it. Or emit just a ret instruction, or emit a label and no ret instruction so execution just falls through. GCC does in fact behave this was sometimes when it encounters UB, though. For example: why ret disappear with optimization?. You definitely want to compile well-formed code to be sure the asm is sane. — Nov 25 at 22:26
It's probably just a front-end uop throughput bottleneck because of the inefficient code-gen. It's not even using LEA as a peephole for mov / add-immediate. e.g. movl RBX, R9 / addl RBX, #8 should be leal ebx, [r9 + 8], 1 uop to copy-and-add. Or leal ebx, [r9 + r9 + 16] to do ebx = 2*(r9+8). So yeah, unrolling to the point of spilling is dumb, and so is naive braindead codegen that doesn't take advantage of integer identities and associative integer math. — Nov 25 at 22:38
Vectorization for sequential reduction was disabled in C2 (bugs.openjdk.java.net/browse/JDK-8078563), but is now being considered for re-enabling (bugs.openjdk.java.net/browse/JDK-8188313). — Nov 30 at 14:31

Runemoro 2,58521239 · Accepted Answer · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — Nov 23 at 22:27
I think on the optimizedVersion, it should be n *= 2000000000; — Nov 24 at 1:19
@StefansArya - No. Consider the case where the limit is 4, and we are trying to calculate 2*1*1 + 2*2*2 + 2*3*3. It is obvious that calculating 1*1 + 2*2 + 3*3 and multiplying by 2 is correct, whereas multiply by 8 would not be. — Nov 26 at 15:57
The math equation was just like this 2(1²) + 2(2²) + 2(3²) = 2(1² + 2² + 3²). That was very simple and I just forgot it because the loop increment. — Nov 26 at 17:22
If you print out the assembly using a debug jvm, this does not appear to be correct. You will see a bunch of sall ... ,#1, which are multiplies by 2, in the loop. Interestingly, the slower version does not appear to have multiplies in the loop. — Dec 1 at 4:53

score 39 · Accepted Answer · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

But why is push-push-multiply-multiply faster than push-multiply-push-multiply? — Dec 1 at 12:04

Puzzled 45632 · Accepted Answer · 2018-11-25 18:18:01Z

Kasperd asked in a comment of the accepted answer:

The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx

xor eax, eax

.L2:

mov ecx, edx

imul ecx, edx

add edx, 1

lea eax, [rax+rcx*2]

cmp edx, 1000000000

jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

score 28 · Accepted Answer · 2018-12-03 11:11:50Z

While not directly related to the question's environment, just for the curiosity, I did the same test on .Net Core 2.1, x64, release mode.
Here is the interesting result, confirming similar phonemenia (other way around) happening over the dark side of the force. Code:

static void Main(string args)

    {

        Stopwatch watch = new Stopwatch();



        Console.WriteLine("2 * (i * i)");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * (i * i);

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }



        Console.WriteLine();

        Console.WriteLine("2 * i * i");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * i * i;

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }

    }

Result:

2 * (i * i)

result:119860736, 438ms

result:119860736, 433ms

result:119860736, 437ms

result:119860736, 435ms

result:119860736, 436ms

result:119860736, 435ms

result:119860736, 435ms

result:119860736, 439ms

result:119860736, 436ms

result:119860736, 437ms

2 * i * i

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 418ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 416ms

result:119860736, 417ms

result:119860736, 418ms

While this isn't an answer to the question, it does add value. That being said, if something is vital to your post, please in-line it in the post rather than linking to an off-site resource. Links go dead. — Nov 28 at 13:54
@JaredSmith Thanks for the feedback. Considering the link you mention is the "result" link, that image is not an off-site source. I uploaded it to the stackoverflow via its own panel. — Nov 28 at 14:32
It's a link to imgur, so yes, it is, it doesn't matter how you added the link. I fail to see what's so difficult about copy-pasting some console output. — Nov 28 at 14:49
@SamB it's still on the imgur.com domain, which means it'll survive only for as long as imgur. — Dec 1 at 10:22

paulsm4 77k999126 · Accepted Answer · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — Nov 23 at 21:11
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — Nov 23 at 21:14
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — Nov 23 at 21:16
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — Nov 23 at 22:42
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — Nov 24 at 8:06

NoDataFound 5,6671740 · Accepted Answer · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

AFAICT bytecode is pretty irrelevant for performance, and I wouldn't try to estimate what's faster based on it. It's just the source code for the JIT compiler... sure can meaning-preserving reordering source code lines change the resulting code and it's efficiency, but that all pretty unpredictable. — Nov 26 at 2:33

GhostCat 88.1k1684144 · Accepted Answer · 2018-11-30 21:07:23Z

More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:

java version "1.8.0_191"

Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))

Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

and this shows very similar results:

0.374653912 s

n = 119860736

0.447778698 s

n = 119860736

( second results using 2 * i * i ).

Interestingly enough, when running on the same machine, but using Oracle java:

Java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

results are on average a bit slower:

0.414331815 s

n = 119860736

0.491430656 s

n = 119860736

Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.

score 12 · Accepted Answer · 2018-12-03 19:23:50Z

Interesting observation using Java 11 and switching off loop unrolling with the following VM option:

-XX:LoopUnrollLimit=0

The loop with the 2 * (i * i) expression results in a more compact native code¹:

L0001: add    eax,r11d

       inc    r8d

       mov    r11d,r8d

       imul   r11d,r8d

       shl    r11d,1h

       cmp    r8d,r10d

       jl     L0001

in comparison with the 2 * i * i version:

L0001: add    eax,r11d

       mov    r11d,r8d

       shl    r11d,1h

       add    r11d,2h

       inc    r8d

       imul   r11d,r8d

       cmp    r8d,r10d

       jl     L0001

Java version:

java version "11" 2018-09-25

Java(TM) SE Runtime Environment 18.9 (build 11+28)

Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

Benchmark results:

Benchmark          (size)  Mode  Cnt    Score     Error  Units

LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op

LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

Benchmark source code:

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MILLISECONDS)

@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)

@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)

@State(Scope.Thread)

@Fork(1)

public class LoopTest {



    @Param("1000000000") private int size;



    public static void main(String args) throws RunnerException {

        Options opt =

            new OptionsBuilder().include(LoopTest.class.getSimpleName())

                                .jvmArgs("-XX:LoopUnrollLimit=0")

                                .build();

        new Runner(opt).run();

    }



    @Benchmark

    public int slow() {

        int n = 0;

        for (int i = 0; i < size; i++) {

            n += 2 * i * i;

        }

        return n;

    }



    @Benchmark

    public int fast() {

        int n = 0;

        for (int i = 0; i < size; i++) {

            n += 2 * (i * i);

        }

        return n;

    }

}

^{1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0}

Wow, that's some braindead asm. Instead of incrementing i before copying it to calculate 2*i, it does it after so it needs an extra add r11d,2 instruction. (Plus it misses the add same,same peephole instead of shl by 1 (add runs on more ports). It also misses an LEA peephole for x*2 + 2 (lea r11d, [r8*2 + 2]) if it really wants to do things in that order for some crazy instruction-scheduling reason. We could already see from the unrolled version that missing out on LEA was costing it a lot of uops, same as both loops here. — Dec 2 at 2:50
lea eax, [rax + r11 * 2] would replace 2 instructions (in both loops) if the JIT compiler had time to look for that optimization in long-running loops. Any decent ahead-of-time compiler would find it. (Unless maybe tuning only for AMD, where scaled-index LEA has 2 cycle latency so maybe not worth it.) — Dec 2 at 2:51

score 3 · Accepted Answer · 2018-12-23 14:04:23Z

The two methods of adding do generate slightly different byte code:

  17: iconst_2

  18: iload         4

  20: iload         4

  22: imul

  23: imul

  24: iadd

For 2 * (i * i) vs:

  17: iconst_2

  18: iload         4

  20: imul

  21: iload         4

  23: imul

  24: iadd

For 2 * i * i.

And when using a JMH benchmark like this:

@Warmup(iterations = 5, batchSize = 1)

@Measurement(iterations = 5, batchSize = 1)

@Fork(1)

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MILLISECONDS)

@State(Scope.Benchmark)

public class MyBenchmark {



    @Benchmark

    public int noBrackets() {

        int n = 0;

        for (int i = 0; i < 1000000000; i++) {

            n += 2 * i * i;

        }

        return n;

    }



    @Benchmark

    public int brackets() {

        int n = 0;

        for (int i = 0; i < 1000000000; i++) {

            n += 2 * (i * i);

        }

        return n;

    }



}

The difference is clear:

# JMH version: 1.21

# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28

# VM options: <none>



Benchmark                      (n)  Mode  Cnt    Score    Error  Units

MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op

MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)

Running again with Graal:

# JMH version: 1.21

# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28

# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler



Benchmark                      (n)  Mode  Cnt    Score    Error  Units

MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op

MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.

So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.

uhrrr.. your noBrackets is using brackets..
– Krease
Nov 23 at 20:55 — Nov 23 at 20:55
@Krease Oops, let me fix that.
– Jorn Vernee
Nov 23 at 20:56 — Nov 23 at 20:56

Ian Kemp 16.4k126797 · Accepted Answer · 2018-11-27 09:18:36Z

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)¹.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.

But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)²:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]

  vmovdqa ymm3, YMMWORD PTR .LC1[rip]

  xor eax, eax

  vpxor xmm2, xmm2, xmm2

.L2:

  vpmulld ymm1, ymm0, ymm0

  inc eax

  vpaddd ymm0, ymm0, ymm3

  vpslld ymm1, ymm1, 1

  vpaddd ymm2, ymm2, ymm1

  cmp eax, 125000000      ; 8 calculations per iteration

  jne .L2

  vmovdqa xmm0, xmm2

  vextracti128 xmm2, ymm2, 1

  vpaddd xmm2, xmm0, xmm2

  vpsrldq xmm0, xmm2, 8

  vpaddd xmm0, xmm2, xmm0

  vpsrldq xmm1, xmm0, 4

  vpaddd xmm0, xmm0, xmm1

  vmovd eax, xmm0

  vzeroupper

With run times:

SSE: 0.24 s, or 2 times faster.

AVX: 0.15 s, or 3 times faster.

AVX2: 0.08 s, or 5 times faster.

¹_{To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly}

²_{The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.}

The single biggest problem the optimizer encounters in the C example is the undefined behavior invoked by signed integer overflow. Which, otherwise, would probably result in simply loading a constant as the whole loop can be calculated at compiletime. — Nov 25 at 18:28
@Damon Why would undefined behavior be a problem for the optimizer? If the optimizer sees it overflows when trying to calculate the result, that just means it can optimize it however it wants, because the behavior is undefined. — Nov 25 at 18:51
@Runemoro: if the optimizer proves that calling the function will inevitably result in undefined behaviour, it could choose to assume that the function will never be called, and emit no body for it. Or emit just a ret instruction, or emit a label and no ret instruction so execution just falls through. GCC does in fact behave this was sometimes when it encounters UB, though. For example: why ret disappear with optimization?. You definitely want to compile well-formed code to be sure the asm is sane. — Nov 25 at 22:26
It's probably just a front-end uop throughput bottleneck because of the inefficient code-gen. It's not even using LEA as a peephole for mov / add-immediate. e.g. movl RBX, R9 / addl RBX, #8 should be leal ebx, [r9 + 8], 1 uop to copy-and-add. Or leal ebx, [r9 + r9 + 16] to do ebx = 2*(r9+8). So yeah, unrolling to the point of spilling is dumb, and so is naive braindead codegen that doesn't take advantage of integer identities and associative integer math. — Nov 25 at 22:38
Vectorization for sequential reduction was disabled in C2 (bugs.openjdk.java.net/browse/JDK-8078563), but is now being considered for re-enabling (bugs.openjdk.java.net/browse/JDK-8188313). — Nov 30 at 14:31

Runemoro 2,58521239 · Accepted Answer · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — Nov 23 at 22:27
I think on the optimizedVersion, it should be n *= 2000000000; — Nov 24 at 1:19
@StefansArya - No. Consider the case where the limit is 4, and we are trying to calculate 2*1*1 + 2*2*2 + 2*3*3. It is obvious that calculating 1*1 + 2*2 + 3*3 and multiplying by 2 is correct, whereas multiply by 8 would not be. — Nov 26 at 15:57
The math equation was just like this 2(1²) + 2(2²) + 2(3²) = 2(1² + 2² + 3²). That was very simple and I just forgot it because the loop increment. — Nov 26 at 17:22
If you print out the assembly using a debug jvm, this does not appear to be correct. You will see a bunch of sall ... ,#1, which are multiplies by 2, in the loop. Interestingly, the slower version does not appear to have multiplies in the loop. — Dec 1 at 4:53

score 39 · Accepted Answer · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

But why is push-push-multiply-multiply faster than push-multiply-push-multiply? — Dec 1 at 12:04

Puzzled 45632 · Accepted Answer · 2018-11-25 18:18:01Z

Kasperd asked in a comment of the accepted answer:

The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx

xor eax, eax

.L2:

mov ecx, edx

imul ecx, edx

add edx, 1

lea eax, [rax+rcx*2]

cmp edx, 1000000000

jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

score 28 · Accepted Answer · 2018-12-03 11:11:50Z

While not directly related to the question's environment, just for the curiosity, I did the same test on .Net Core 2.1, x64, release mode.
Here is the interesting result, confirming similar phonemenia (other way around) happening over the dark side of the force. Code:

static void Main(string args)

    {

        Stopwatch watch = new Stopwatch();



        Console.WriteLine("2 * (i * i)");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * (i * i);

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }



        Console.WriteLine();

        Console.WriteLine("2 * i * i");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * i * i;

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }

    }

Result:

2 * (i * i)

result:119860736, 438ms

result:119860736, 433ms

result:119860736, 437ms

result:119860736, 435ms

result:119860736, 436ms

result:119860736, 435ms

result:119860736, 435ms

result:119860736, 439ms

result:119860736, 436ms

result:119860736, 437ms

2 * i * i

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 418ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 416ms

result:119860736, 417ms

result:119860736, 418ms

While this isn't an answer to the question, it does add value. That being said, if something is vital to your post, please in-line it in the post rather than linking to an off-site resource. Links go dead. — Nov 28 at 13:54
@JaredSmith Thanks for the feedback. Considering the link you mention is the "result" link, that image is not an off-site source. I uploaded it to the stackoverflow via its own panel. — Nov 28 at 14:32
It's a link to imgur, so yes, it is, it doesn't matter how you added the link. I fail to see what's so difficult about copy-pasting some console output. — Nov 28 at 14:49
@SamB it's still on the imgur.com domain, which means it'll survive only for as long as imgur. — Dec 1 at 10:22

paulsm4 77k999126 · Accepted Answer · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — Nov 23 at 21:11
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — Nov 23 at 21:14
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — Nov 23 at 21:16
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — Nov 23 at 22:42
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — Nov 24 at 8:06

NoDataFound 5,6671740 · Accepted Answer · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

AFAICT bytecode is pretty irrelevant for performance, and I wouldn't try to estimate what's faster based on it. It's just the source code for the JIT compiler... sure can meaning-preserving reordering source code lines change the resulting code and it's efficiency, but that all pretty unpredictable. — Nov 26 at 2:33

GhostCat 88.1k1684144 · Accepted Answer · 2018-11-30 21:07:23Z

More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:

java version "1.8.0_191"

Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))

Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

and this shows very similar results:

0.374653912 s

n = 119860736

0.447778698 s

n = 119860736

( second results using 2 * i * i ).

Interestingly enough, when running on the same machine, but using Oracle java:

Java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

results are on average a bit slower:

0.414331815 s

n = 119860736

0.491430656 s

n = 119860736

Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.

score 12 · Accepted Answer · 2018-12-03 19:23:50Z

Interesting observation using Java 11 and switching off loop unrolling with the following VM option:

-XX:LoopUnrollLimit=0

The loop with the 2 * (i * i) expression results in a more compact native code¹:

L0001: add    eax,r11d

       inc    r8d

       mov    r11d,r8d

       imul   r11d,r8d

       shl    r11d,1h

       cmp    r8d,r10d

       jl     L0001

in comparison with the 2 * i * i version:

L0001: add    eax,r11d

       mov    r11d,r8d

       shl    r11d,1h

       add    r11d,2h

       inc    r8d

       imul   r11d,r8d

       cmp    r8d,r10d

       jl     L0001

Java version:

java version "11" 2018-09-25

Java(TM) SE Runtime Environment 18.9 (build 11+28)

Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

Benchmark results:

Benchmark          (size)  Mode  Cnt    Score     Error  Units

LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op

LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

Benchmark source code:

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MILLISECONDS)

@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)

@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)

@State(Scope.Thread)

@Fork(1)

public class LoopTest {



    @Param("1000000000") private int size;



    public static void main(String args) throws RunnerException {

        Options opt =

            new OptionsBuilder().include(LoopTest.class.getSimpleName())

                                .jvmArgs("-XX:LoopUnrollLimit=0")

                                .build();

        new Runner(opt).run();

    }



    @Benchmark

    public int slow() {

        int n = 0;

        for (int i = 0; i < size; i++) {

            n += 2 * i * i;

        }

        return n;

    }



    @Benchmark

    public int fast() {

        int n = 0;

        for (int i = 0; i < size; i++) {

            n += 2 * (i * i);

        }

        return n;

    }

}

^{1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0}

Wow, that's some braindead asm. Instead of incrementing i before copying it to calculate 2*i, it does it after so it needs an extra add r11d,2 instruction. (Plus it misses the add same,same peephole instead of shl by 1 (add runs on more ports). It also misses an LEA peephole for x*2 + 2 (lea r11d, [r8*2 + 2]) if it really wants to do things in that order for some crazy instruction-scheduling reason. We could already see from the unrolled version that missing out on LEA was costing it a lot of uops, same as both loops here. — Dec 2 at 2:50
lea eax, [rax + r11 * 2] would replace 2 instructions (in both loops) if the JIT compiler had time to look for that optimization in long-running loops. Any decent ahead-of-time compiler would find it. (Unless maybe tuning only for AMD, where scaled-index LEA has 2 cycle latency so maybe not worth it.) — Dec 2 at 2:51

score 3 · Accepted Answer · 2018-12-23 14:04:23Z

The two methods of adding do generate slightly different byte code:

  17: iconst_2

  18: iload         4

  20: iload         4

  22: imul

  23: imul

  24: iadd

For 2 * (i * i) vs:

  17: iconst_2

  18: iload         4

  20: imul

  21: iload         4

  23: imul

  24: iadd

For 2 * i * i.

And when using a JMH benchmark like this:

@Warmup(iterations = 5, batchSize = 1)

@Measurement(iterations = 5, batchSize = 1)

@Fork(1)

@BenchmarkMode(Mode.AverageTime)

@OutputTimeUnit(TimeUnit.MILLISECONDS)

@State(Scope.Benchmark)

public class MyBenchmark {



    @Benchmark

    public int noBrackets() {

        int n = 0;

        for (int i = 0; i < 1000000000; i++) {

            n += 2 * i * i;

        }

        return n;

    }



    @Benchmark

    public int brackets() {

        int n = 0;

        for (int i = 0; i < 1000000000; i++) {

            n += 2 * (i * i);

        }

        return n;

    }



}

The difference is clear:

# JMH version: 1.21

# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28

# VM options: <none>



Benchmark                      (n)  Mode  Cnt    Score    Error  Units

MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op

MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)

Running again with Graal:

# JMH version: 1.21

# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28

# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler



Benchmark                      (n)  Mode  Cnt    Score    Error  Units

MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op

MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.

So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.

uhrrr.. your noBrackets is using brackets..
– Krease
Nov 23 at 20:55 — Nov 23 at 20:55
@Krease Oops, let me fix that.
– Jorn Vernee
Nov 23 at 20:56 — Nov 23 at 20:56

搜尋此網誌

Csdrhrt

Why is 2 * (i * i) faster than 2 * i * i in Java?

10 Answers
10

protected by Cassio Mazzochi Molin Dec 7 at 10:23

10 Answers
10

10 Answers
10

protected by Cassio Mazzochi Molin Dec 7 at 10:23

Popular posts from this blog

Plaza Victoria

Puebla de Zaragoza

Musa

Why is 2 * (i * i) faster than 2 * i * i in Java?

10 Answers 10

protected by Cassio Mazzochi Molin Dec 7 at 10:23

10 Answers 10

10 Answers 10

protected by Cassio Mazzochi Molin Dec 7 at 10:23

Popular posts from this blog

Plaza Victoria

Puebla de Zaragoza

Musa

10 Answers
10

10 Answers
10

10 Answers
10