Csdrhrt

Question

The following Java program takes on average between 0.50s and 0.55s to run:

public static void main(String args) {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");

    System.out.println("n = " + n);

}

If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65s to run. How come?

I ran each version of the program 15 times, alternating between the two. Here are the results:

2 * (i * i): 0.5183738, 0.5298337, 0.5308647, 0.5133458, 0.5003011, 0.5366181, 0.515149, 0.5237389, 0.5249942, 0.5641624, 0.538412, 0.5466744, 0.531159, 0.5048032, 0.5232789

2 * i * i: 0.6246434, 0.6049722, 0.6603363, 0.6243328, 0.6541802, 0.6312638, 0.6241105, 0.627815, 0.6114252, 0.6781033, 0.6393969, 0.6608845, 0.6201077, 0.6511559, 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they were both as efficient, the probability of this happening would be less than 1/2^15 = 0.00305%.

I get similar results (slightly different numbers, but definitely noticeable and consistent gap, definitely more than sampling error) — Nov 23 at 20:47
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs. — Nov 23 at 21:29
You could rename your question to "Why is i * i * 2 faster than 2 * i * i?" for improved clarity that the issue is on the order of the operations. — yesterday

Ian Kemp 16.2k126797 · Accepted Answer · 2018-11-27 09:18:36Z

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)¹.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.

But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)²:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]

  vmovdqa ymm3, YMMWORD PTR .LC1[rip]

  xor eax, eax

  vpxor xmm2, xmm2, xmm2

.L2:

  vpmulld ymm1, ymm0, ymm0

  inc eax

  vpaddd ymm0, ymm0, ymm3

  vpslld ymm1, ymm1, 1

  vpaddd ymm2, ymm2, ymm1

  cmp eax, 125000000      ; 8 calculations per iteration

  jne .L2

  vmovdqa xmm0, xmm2

  vextracti128 xmm2, ymm2, 1

  vpaddd xmm2, xmm0, xmm2

  vpsrldq xmm0, xmm2, 8

  vpaddd xmm0, xmm2, xmm0

  vpsrldq xmm1, xmm0, 4

  vpaddd xmm0, xmm0, xmm1

  vmovd eax, xmm0

  vzeroupper

With run times:

SSE: 0.24 s, or 2 times faster.

AVX: 0.15 s, or 3 times faster.

AVX2: 0.08 s, or 5 times faster.

¹_{To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly}

²_{The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.}

The single biggest problem the optimizer encounters in the C example is the undefined behavior invoked by signed integer overflow. Which, otherwise, would probably result in simply loading a constant as the whole loop can be calculated at compiletime. — Nov 25 at 18:28
@Damon Why would undefined behavior be a problem for the optimizer? If the optimizer sees it overflows when trying to calculate the result, that just means it can optimize it however it wants, because the behavior is undefined. — Nov 25 at 18:51
@Runemoro: if the optimizer proves that calling the function will inevitably result in undefined behaviour, it could choose to assume that the function will never be called, and emit no body for it. Or emit just a ret instruction, or emit a label and no ret instruction so execution just falls through. GCC does in fact behave this was sometimes when it encounters UB, though. For example: why ret disappear with optimization?. You definitely want to compile well-formed code to be sure the asm is sane. — Nov 25 at 22:26
It's probably just a front-end uop throughput bottleneck because of the inefficient code-gen. It's not even using LEA as a peephole for mov / add-immediate. e.g. movl RBX, R9 / addl RBX, #8 should be leal ebx, [r9 + 8], 1 uop to copy-and-add. Or leal ebx, [r9 + r9 + 16] to do ebx = 2*(r9+8). So yeah, unrolling to the point of spilling is dumb, and so is naive braindead codegen that doesn't take advantage of integer identities and associative integer math. — Nov 25 at 22:38
@kasperd The register names are different because the vector instructions use an entirely different set of registers to the normal instructions. — 2 days ago

Runemoro 2,00111238 · Answer 2 · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — Nov 23 at 22:27
I think on the optimizedVersion, it should be n *= 2000000000; — Nov 24 at 1:19
@StefansArya - No. Consider the case where the limit is 4, and we are trying to calculate 2*1*1 + 2*2*2 + 2*3*3. It is obvious that calculating 1*1 + 2*2 + 3*3 and multiplying by 2 is correct, whereas multiply by 8 would not be. — 2 days ago
@MartinBonner - Agreed, I think someone was explain it few days ago, but now his comment was gone. Thanks anyway for explaining. — 2 days ago
The math equation was just like this 2(1²) + 2(2²) + 2(3²) = 2(1² + 2² + 3²). That was very simple and I just forgot it because the loop increment. — 2 days ago

score 26 · Answer 3 · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

Puzzled 1412 · Answer 4 · 2018-11-25 18:18:01Z

Kasperd asked in a comment of the accepted answer:

The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx

xor eax, eax

.L2:

mov ecx, edx

imul ecx, edx

add edx, 1

lea eax, [rax+rcx*2]

cmp edx, 1000000000

jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

Ask the question yourself and self-answer :) this answer here should be removed since it's not an answer to the question. — 2 days ago
@Puzzled - thank you for the additional information. It was helpful. — 2 days ago

NoDataFound 5,4421739 · Answer 5 · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

AFAICT bytecode is pretty irrelevant for performance, and I wouldn't try to estimate what's faster based on it. It's just the source code for the JIT compiler... sure can meaning-preserving reordering source code lines change the resulting code and it's efficiency, but that all pretty unpredictable. — Nov 26 at 2:33

paulsm4 75.9k999122 · Answer 6 · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — Nov 23 at 21:11
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — Nov 23 at 21:14
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — Nov 23 at 21:16
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — Nov 23 at 22:42
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — Nov 24 at 8:06

score 2 · Answer 7 · 2018-11-28 14:58:04Z

While not directly related to the question's environment, just for the curiosity, I did the same test on .Net Core 2.1, x64, release mode.
Here is the interesting result, confirming same phonemenia happening over the dark side of the force. Code:

static void Main(string args)

    {

        Stopwatch watch = new Stopwatch();



        Console.WriteLine("2 * (i * i)");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * (i * i);

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }



        Console.WriteLine();

        Console.WriteLine("2 * i * i");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * i * i;

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }

    }

Result:

2 * (i * i)

result:119860736, 438ms

result:119860736, 433ms

result:119860736, 437ms

result:119860736, 435ms

result:119860736, 436ms

result:119860736, 435ms

result:119860736, 435ms

result:119860736, 439ms

result:119860736, 436ms

result:119860736, 437ms

2 * i * i

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 418ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 416ms

result:119860736, 417ms

result:119860736, 418ms

While this isn't an answer to the question, it does add value. That being said, if something is vital to your post, please in-line it in the post rather than linking to an off-site resource. Links go dead. — 14 hours ago
@JaredSmith Thanks for the feedback. Considering the link you mention is the "result" link, that image is not an off-site source. I uploaded it to the stackoverflow via its own panel. — 14 hours ago
It's a link to imgur, so yes, it is, it doesn't matter how you added the link. I fail to see what's so difficult about copy-pasting some console output. — 13 hours ago

Ian Kemp 16.2k126797 · Accepted Answer · 2018-11-27 09:18:36Z

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2

     iload0

     iload0

     imul

     imul

     iadd

vs 2 * i * i:

     iconst_2

     iload0

     imul

     iload0

     imul

     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)¹.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006

030     addl    R11, RBP    # int

033     movl    RBP, R13    # spill

036     addl    RBP, #14    # int

039     imull   RBP, RBP    # int

03c     movl    R9, R13 # spill

03f     addl    R9, #13 # int

043     imull   R9, R9  # int

047     sall    RBP, #1

049     sall    R9, #1

04c     movl    R8, R13 # spill

04f     addl    R8, #15 # int

053     movl    R10, R8 # spill

056     movdl   XMM1, R8    # spill

05b     imull   R10, R8 # int

05f     movl    R8, R13 # spill

062     addl    R8, #12 # int

066     imull   R8, R8  # int

06a     sall    R10, #1

06d     movl    [rsp + #32], R10    # spill

072     sall    R8, #1

075     movl    RBX, R13    # spill

078     addl    RBX, #11    # int

07b     imull   RBX, RBX    # int

07e     movl    RCX, R13    # spill

081     addl    RCX, #10    # int

084     imull   RCX, RCX    # int

087     sall    RBX, #1

089     sall    RCX, #1

08b     movl    RDX, R13    # spill

08e     addl    RDX, #8 # int

091     imull   RDX, RDX    # int

094     movl    RDI, R13    # spill

097     addl    RDI, #7 # int

09a     imull   RDI, RDI    # int

09d     sall    RDX, #1

09f     sall    RDI, #1

0a1     movl    RAX, R13    # spill

0a4     addl    RAX, #6 # int

0a7     imull   RAX, RAX    # int

0aa     movl    RSI, R13    # spill

0ad     addl    RSI, #4 # int

0b0     imull   RSI, RSI    # int

0b3     sall    RAX, #1

0b5     sall    RSI, #1

0b7     movl    R10, R13    # spill

0ba     addl    R10, #2 # int

0be     imull   R10, R10    # int

0c2     movl    R14, R13    # spill

0c5     incl    R14 # int

0c8     imull   R14, R14    # int

0cc     sall    R10, #1

0cf     sall    R14, #1

0d2     addl    R14, R11    # int

0d5     addl    R14, R10    # int

0d8     movl    R10, R13    # spill

0db     addl    R10, #3 # int

0df     imull   R10, R10    # int

0e3     movl    R11, R13    # spill

0e6     addl    R11, #5 # int

0ea     imull   R11, R11    # int

0ee     sall    R10, #1

0f1     addl    R10, R14    # int

0f4     addl    R10, RSI    # int

0f7     sall    R11, #1

0fa     addl    R11, R10    # int

0fd     addl    R11, RAX    # int

100     addl    R11, RDI    # int

103     addl    R11, RDX    # int

106     movl    R10, R13    # spill

109     addl    R10, #9 # int

10d     imull   R10, R10    # int

111     sall    R10, #1

114     addl    R10, R11    # int

117     addl    R10, RCX    # int

11a     addl    R10, RBX    # int

11d     addl    R10, R8 # int

120     addl    R9, R10 # int

123     addl    RBP, R9 # int

126     addl    RBP, [RSP + #32 (32-bit)]   # int

12a     addl    R13, #16    # int

12e     movl    R11, R13    # spill

131     imull   R11, R13    # int

135     sall    R11, #1

138     cmpl    R13, #999999985

13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006

05a     addl    RBX, R11    # int

05d     movl    [rsp + #32], RBX    # spill

061     movl    R11, R8 # spill

064     addl    R11, #15    # int

068     movl    [rsp + #36], R11    # spill

06d     movl    R11, R8 # spill

070     addl    R11, #14    # int

074     movl    R10, R9 # spill

077     addl    R10, #16    # int

07b     movdl   XMM2, R10   # spill

080     movl    RCX, R9 # spill

083     addl    RCX, #14    # int

086     movdl   XMM1, RCX   # spill

08a     movl    R10, R9 # spill

08d     addl    R10, #12    # int

091     movdl   XMM4, R10   # spill

096     movl    RCX, R9 # spill

099     addl    RCX, #10    # int

09c     movdl   XMM6, RCX   # spill

0a0     movl    RBX, R9 # spill

0a3     addl    RBX, #8 # int

0a6     movl    RCX, R9 # spill

0a9     addl    RCX, #6 # int

0ac     movl    RDX, R9 # spill

0af     addl    RDX, #4 # int

0b2     addl    R9, #2  # int

0b6     movl    R10, R14    # spill

0b9     addl    R10, #22    # int

0bd     movdl   XMM3, R10   # spill

0c2     movl    RDI, R14    # spill

0c5     addl    RDI, #20    # int

0c8     movl    RAX, R14    # spill

0cb     addl    RAX, #32    # int

0ce     movl    RSI, R14    # spill

0d1     addl    RSI, #18    # int

0d4     movl    R13, R14    # spill

0d7     addl    R13, #24    # int

0db     movl    R10, R14    # spill

0de     addl    R10, #26    # int

0e2     movl    [rsp + #40], R10    # spill

0e7     movl    RBP, R14    # spill

0ea     addl    RBP, #28    # int

0ed     imull   RBP, R11    # int

0f1     addl    R14, #30    # int

0f5     imull   R14, [RSP + #36 (32-bit)]   # int

0fb     movl    R10, R8 # spill

0fe     addl    R10, #11    # int

102     movdl   R11, XMM3   # spill

107     imull   R11, R10    # int

10b     movl    [rsp + #44], R11    # spill

110     movl    R10, R8 # spill

113     addl    R10, #10    # int

117     imull   RDI, R10    # int

11b     movl    R11, R8 # spill

11e     addl    R11, #8 # int

122     movdl   R10, XMM2   # spill

127     imull   R10, R11    # int

12b     movl    [rsp + #48], R10    # spill

130     movl    R10, R8 # spill

133     addl    R10, #7 # int

137     movdl   R11, XMM1   # spill

13c     imull   R11, R10    # int

140     movl    [rsp + #52], R11    # spill

145     movl    R11, R8 # spill

148     addl    R11, #6 # int

14c     movdl   R10, XMM4   # spill

151     imull   R10, R11    # int

155     movl    [rsp + #56], R10    # spill

15a     movl    R10, R8 # spill

15d     addl    R10, #5 # int

161     movdl   R11, XMM6   # spill

166     imull   R11, R10    # int

16a     movl    [rsp + #60], R11    # spill

16f     movl    R11, R8 # spill

172     addl    R11, #4 # int

176     imull   RBX, R11    # int

17a     movl    R11, R8 # spill

17d     addl    R11, #3 # int

181     imull   RCX, R11    # int

185     movl    R10, R8 # spill

188     addl    R10, #2 # int

18c     imull   RDX, R10    # int

190     movl    R11, R8 # spill

193     incl    R11 # int

196     imull   R9, R11 # int

19a     addl    R9, [RSP + #32 (32-bit)]    # int

19f     addl    R9, RDX # int

1a2     addl    R9, RCX # int

1a5     addl    R9, RBX # int

1a8     addl    R9, [RSP + #60 (32-bit)]    # int

1ad     addl    R9, [RSP + #56 (32-bit)]    # int

1b2     addl    R9, [RSP + #52 (32-bit)]    # int

1b7     addl    R9, [RSP + #48 (32-bit)]    # int

1bc     movl    R10, R8 # spill

1bf     addl    R10, #9 # int

1c3     imull   R10, RSI    # int

1c7     addl    R10, R9 # int

1ca     addl    R10, RDI    # int

1cd     addl    R10, [RSP + #44 (32-bit)]   # int

1d2     movl    R11, R8 # spill

1d5     addl    R11, #12    # int

1d9     imull   R13, R11    # int

1dd     addl    R13, R10    # int

1e0     movl    R10, R8 # spill

1e3     addl    R10, #13    # int

1e7     imull   R10, [RSP + #40 (32-bit)]   # int

1ed     addl    R10, R13    # int

1f0     addl    RBP, R10    # int

1f3     addl    R14, RBP    # int

1f6     movl    R10, R8 # spill

1f9     addl    R10, #16    # int

1fd     cmpl    R10, #999999985

204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.

But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite
considerable if the average instruction length is more than 4 bytes.
The following methods of optimizing the use of the µop cache may
be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time

. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)²:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]

  vmovdqa ymm3, YMMWORD PTR .LC1[rip]

  xor eax, eax

  vpxor xmm2, xmm2, xmm2

.L2:

  vpmulld ymm1, ymm0, ymm0

  inc eax

  vpaddd ymm0, ymm0, ymm3

  vpslld ymm1, ymm1, 1

  vpaddd ymm2, ymm2, ymm1

  cmp eax, 125000000      ; 8 calculations per iteration

  jne .L2

  vmovdqa xmm0, xmm2

  vextracti128 xmm2, ymm2, 1

  vpaddd xmm2, xmm0, xmm2

  vpsrldq xmm0, xmm2, 8

  vpaddd xmm0, xmm2, xmm0

  vpsrldq xmm1, xmm0, 4

  vpaddd xmm0, xmm0, xmm1

  vmovd eax, xmm0

  vzeroupper

With run times:

SSE: 0.24 s, or 2 times faster.

AVX: 0.15 s, or 3 times faster.

AVX2: 0.08 s, or 5 times faster.

¹_{To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly}

²_{The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.}

The single biggest problem the optimizer encounters in the C example is the undefined behavior invoked by signed integer overflow. Which, otherwise, would probably result in simply loading a constant as the whole loop can be calculated at compiletime. — Nov 25 at 18:28
@Damon Why would undefined behavior be a problem for the optimizer? If the optimizer sees it overflows when trying to calculate the result, that just means it can optimize it however it wants, because the behavior is undefined. — Nov 25 at 18:51
@Runemoro: if the optimizer proves that calling the function will inevitably result in undefined behaviour, it could choose to assume that the function will never be called, and emit no body for it. Or emit just a ret instruction, or emit a label and no ret instruction so execution just falls through. GCC does in fact behave this was sometimes when it encounters UB, though. For example: why ret disappear with optimization?. You definitely want to compile well-formed code to be sure the asm is sane. — Nov 25 at 22:26
It's probably just a front-end uop throughput bottleneck because of the inefficient code-gen. It's not even using LEA as a peephole for mov / add-immediate. e.g. movl RBX, R9 / addl RBX, #8 should be leal ebx, [r9 + 8], 1 uop to copy-and-add. Or leal ebx, [r9 + r9 + 16] to do ebx = 2*(r9+8). So yeah, unrolling to the point of spilling is dumb, and so is naive braindead codegen that doesn't take advantage of integer identities and associative integer math. — Nov 25 at 22:38
@kasperd The register names are different because the vector instructions use an entirely different set of registers to the normal instructions. — 2 days ago

Runemoro 2,00111238 · Answer 9 · 2018-11-23 21:44:55Z

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;

for (int i = 0; i < 1000000000; i++) {

    n += i * i;

}

n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same

The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String args) {

    long fastVersion = 0;

    long slowVersion = 0;

    long optimizedVersion = 0;

    long modifiedFastVersion = 0;

    long modifiedSlowVersion = 0;



    for (int i = 0; i < 10; i++) {

        fastVersion += fastVersion();

        slowVersion += slowVersion();

        optimizedVersion += optimizedVersion();

        modifiedFastVersion += modifiedFastVersion();

        modifiedSlowVersion += modifiedSlowVersion();

    }



    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");

    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");

    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");

    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");

    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");

}



private static long fastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long slowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}



private static long optimizedVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        n += i * i;

    }

    n *= 2;

    return System.nanoTime() - startTime;

}



private static long modifiedFastVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * (i * i);

    }

    return System.nanoTime() - startTime;

}



private static long modifiedSlowVersion() {

    long startTime = System.nanoTime();

    int n = 0;

    for (int i = 0; i < 1000000000; i++) {

        if (n == 0) n = 1;

        n += 2 * i * i;

    }

    return System.nanoTime() - startTime;

}

And here are the results:

Fast version: 5.7274411 s

Slow version: 7.6190804 s

Optimized version: 5.1348007 s

Modified fast version: 7.1492705 s

Modified slow version: 7.2952668 s

here is a benchmark: github.com/jawb-software/stackoverflow-53452713 — Nov 23 at 22:27
I think on the optimizedVersion, it should be n *= 2000000000; — Nov 24 at 1:19
@StefansArya - No. Consider the case where the limit is 4, and we are trying to calculate 2*1*1 + 2*2*2 + 2*3*3. It is obvious that calculating 1*1 + 2*2 + 3*3 and multiplying by 2 is correct, whereas multiply by 8 would not be. — 2 days ago
@MartinBonner - Agreed, I think someone was explain it few days ago, but now his comment was gone. Thanks anyway for explaining. — 2 days ago
The math equation was just like this 2(1²) + 2(2²) + 2(3²) = 2(1² + 2² + 3²). That was very simple and I just forgot it because the loop increment. — 2 days ago

score 26 · Answer 10 · 2018-11-23 21:30:11Z

ByteCodes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html

ByteCodes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Win10 64 1.8.0_65-b17) I can reproduce and explain:

public static void main(String args) {

    int repeat = 10;

    long A = 0;

    long B = 0;

    for (int i = 0; i < repeat; i++) {

        A += test();

        B += testB();

    }



    System.out.println(A / repeat + " ms");

    System.out.println(B / repeat + " ms");

}





private static long test() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multi(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multi(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms A " + n);

    return ms;

}





private static long testB() {

    int n = 0;

    for (int i = 0; i < 1000; i++) {

        n += multiB(i);

    }

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < 1000000000; i++) {

        n += multiB(i);

    }

    long ms = (System.currentTimeMillis() - startTime);

    System.out.println(ms + " ms B " + n);

    return ms;

}



private static int multiB(int i) {

    return 2 * (i * i);

}



private static int multi(int i) {

    return 2 * i * i;

}

Output:

...

405 ms A 785527736

327 ms B 785527736

404 ms A 785527736

329 ms B 785527736

404 ms A 785527736

328 ms B 785527736

404 ms A 785527736

328 ms B 785527736

410 ms

333 ms

So why?
The Byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         iload0

         imul

         imul

         ireturn

     }

     L2 {

     }

 }



 private static multi(int arg0) { // 2 * i * i

     <localVar:index=0 , name=i , desc=I, sig=null, start=L1, end=L2>



     L1 {

         iconst_2

         iload0

         imul

         iload0

         imul

         ireturn

     }

     L2 {

     }

 }

The difference being:

With brackets (2 * (i * i)):

push const stack

push local on stack

push local on stack

multiply top of stack

multiply top of stack

Without brackets (2 * i * i):

push const stack

push local on stack

multiply top of stack

push local on stack

multiply top of stack

Loading all on stack and then working back down is faster than switching between putting on stack and operating on it.

Puzzled 1412 · Answer 11 · 2018-11-25 18:18:01Z

Kasperd asked in a comment of the accepted answer:

The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx

xor eax, eax

.L2:

mov ecx, edx

imul ecx, edx

add edx, 1

lea eax, [rax+rcx*2]

cmp edx, 1000000000

jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

Ask the question yourself and self-answer :) this answer here should be removed since it's not an answer to the question. — 2 days ago
@Puzzled - thank you for the additional information. It was helpful. — 2 days ago

NoDataFound 5,4421739 · Answer 12 · 2018-11-23 22:10:50Z

I tried a JMH using the default archetype: I also added optimized version based Runemoro' explanation .

@State(Scope.Benchmark)

@Warmup(iterations = 2)

@Fork(1)

@Measurement(iterations = 10)

@OutputTimeUnit(TimeUnit.NANOSECONDS)

//@BenchmarkMode({ Mode.All })

@BenchmarkMode(Mode.AverageTime)

public class MyBenchmark {

  @Param({ "100", "1000", "1000000000" })

  private int size;



  @Benchmark

  public int two_square_i() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * (i * i);

    }

    return n;

  }



  @Benchmark

  public int square_i_two() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += i * i;

    }

    return 2*n;

  }  



  @Benchmark

  public int two_i_() {

    int n = 0;

    for (int i = 0; i < size; i++) {

      n += 2 * i * i;

    }

    return n;

  }

}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units

o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op

o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op

o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op

o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op

o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op

o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op

o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op

o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op

o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860, doing nothing much apart reading on my smartphone):

n += i*i then n*2 is first

2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI

Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, there is no need to load it again) whilst in the 2*i*i it can't.

AFAICT bytecode is pretty irrelevant for performance, and I wouldn't try to estimate what's faster based on it. It's just the source code for the JIT compiler... sure can meaning-preserving reordering source code lines change the resulting code and it's efficiency, but that all pretty unpredictable. — Nov 26 at 2:33

paulsm4 75.9k999122 · Answer 13 · 2018-11-23 21:10:06Z

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736

2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: iload         4

    30: imul

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:

     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V

     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J

    11: lstore_1

    12: iconst_0

    13: istore_3

    14: iconst_0

    15: istore        4

    17: iload         4

    19: ldc           #6                  // int 1000000000

    21: if_icmpge     40

    24: iload_3

    25: iconst_2

    26: iload         4

    28: imul

    29: iload         4

    31: imul

    32: iadd

    33: istore_3

    34: iinc          4, 1

    37: goto          17

FYI -

java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — Nov 23 at 21:11
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — Nov 23 at 21:14
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — Nov 23 at 21:16
Get a debug jre and run with -XX:+PrintOptoAssembly. Or just use vtune or alike. — Nov 23 at 22:42
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found above with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — Nov 24 at 8:06

score 2 · Answer 14 · 2018-11-28 14:58:04Z

While not directly related to the question's environment, just for the curiosity, I did the same test on .Net Core 2.1, x64, release mode.
Here is the interesting result, confirming same phonemenia happening over the dark side of the force. Code:

static void Main(string args)

    {

        Stopwatch watch = new Stopwatch();



        Console.WriteLine("2 * (i * i)");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * (i * i);

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }



        Console.WriteLine();

        Console.WriteLine("2 * i * i");



        for (int a = 0; a < 10; a++)

        {

            int n = 0;



            watch.Restart();



            for (int i = 0; i < 1000000000; i++)

            {

                n += 2 * i * i;

            }



            watch.Stop();



            Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");

        }

    }

Result:

2 * (i * i)

result:119860736, 438ms

result:119860736, 433ms

result:119860736, 437ms

result:119860736, 435ms

result:119860736, 436ms

result:119860736, 435ms

result:119860736, 435ms

result:119860736, 439ms

result:119860736, 436ms

result:119860736, 437ms

2 * i * i

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 418ms

result:119860736, 417ms

result:119860736, 418ms

result:119860736, 416ms

result:119860736, 417ms

result:119860736, 418ms

While this isn't an answer to the question, it does add value. That being said, if something is vital to your post, please in-line it in the post rather than linking to an off-site resource. Links go dead. — 14 hours ago
@JaredSmith Thanks for the feedback. Considering the link you mention is the "result" link, that image is not an off-site source. I uploaded it to the stackoverflow via its own panel. — 14 hours ago
It's a link to imgur, so yes, it is, it doesn't matter how you added the link. I fail to see what's so difficult about copy-pasting some console output. — 13 hours ago

搜尋此網誌

Csdrhrt

Why is “2 * (i * i)” faster than “2 * i * i”?

7 Answers
7

Your Answer

Post as a guest

7 Answers
7

7 Answers
7

Post as a guest

Popular posts from this blog

Plaza Victoria

In PowerPoint, is there a keyboard shortcut for bulleted / numbered list?

How to put 3 figures in Latex with 2 figures side by side and 1 below these side by side images but in...

Why is “2 * (i * i)” faster than “2 * i * i”?

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Post as a guest

7 Answers 7

7 Answers 7

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Plaza Victoria

In PowerPoint, is there a keyboard shortcut for bulleted / numbered list?

How to put 3 figures in Latex with 2 figures side by side and 1 below these side by side images but in...

7 Answers
7

7 Answers
7

7 Answers
7