Slow Instructions in Simple Loop on x86

MattCochrane

I have a simple loop which I've written in C++ as I wanted to profile the performance of a multiply instruction on my CPU. I found some interesting nuances in the assembly code that was generated when I profiled it.

Here is the C++ program:

#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];

unsigned int i, j;

for (i = 0; i < BUFSIZE; i++) {
    buf_in1[i] = i;
    buf_in2[i] = i;
}

for (j = 0; j < TESTS; j++) {
    for (i = 0; i < BUFSIZE; i++) {
        buf_out[i] = buf_in1[i] * buf_in2[i];
    }
}

I compiled with the following flags:

Optimization: Optimization

Code Generation:

Code Generation

It's compiled in visual studio 2012 under Win32 although I am running it on a 64 bit machine.

Note the volatile qualifier on buf_out. It's just to stop the compiler from optimising the loop away.

I ran this code through a profiler (AMD's CodeXL) and I see that the multiplication instruction doesn't take up the majority of the CPU time. About 30% is taken up by the imul instruction, but around 60% is also spent on two other instructions:

Profiler

Note that the Timer column shows the number of timer ticks during which the profiler found the code on this instruction. The timer tick is 1ms so 2609 ticks is approximately 2609ms spent on that instruction.

The two instructions other than the multiply instruction which are taking up a lot of time are a mov instruction and the jb (jump if condition is met) instruction.

The mov instruction,

mov [esp+eax+00001f40h],ecx

is moving the result of the multiply (ecx) back into the buffer buf_out buffer at eax (which is the register representing i). This makes sense, but why does it take so much longer to do this than the other mov instruction? Ie this one:

mov ecx,[esp+eax+00000fa0h]

They both read from similar locations in memory, the arrays are 1000 uint32_t's long or 4000 bytes long. That's 4000*3 = 12kB. My L1 cache is 64kB so it should all easily fit in L1 as far as I can see...

Here are results showing my cache sizes etc. from Coreinfo:

Coreinfo

As for the jump instruction:

jb $-1ah (0x903732)

I can't tell why it's taking up 33% of the program execution time either. My processor line size is 64 bytes and the jump only jumps backwards 0x1A bytes or 26 bytes. Could it be because this jump crosses a 64-byte boundry? (0x903740 is a 64 byte boundary)

So can anyone explain these behaviours?

Thanks.

Alexis Wilke

As mentioned by Mystical, the timings you are looking at are not one to one the responsibility of the instructions it is shown against.

Modern processors run many instructions in parallel (the imul and the add 4 to eax can both run in parallel, also the math involved in the mov addressing uses the ALU too and can be computed before the imul completes).

The way most profilers compute their timing is by using timed interrupts and what you see are the instructions that happened to be the ones executed at the time of the interrupts.

To properly use a profiler, you want to run against large programs and see whether the program spends a lot of time. On a per instruction basis, it does not have much value.

If you really want to do speed tests, you want to use the CPU timer before and after your loops and see how you can ameliorate it one way or another to get it to run faster.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Slow Instructions in Simple Loop on x86

From Dev

Simple loop is too slow

From Dev

What is the error of trigonometric instructions on x86?

From Dev

For loop in x86 assembly

From Dev

Android x86 emulator extremely slow

From Dev

Latency of CPU instructions on x86 and x64 processors

From Dev

Simple register allocation scheme for x86

From Dev

Conditional in Loop MASM x86 Assembly

From Dev

For-Loop in x86 Assembly on MacOS

From Dev

X86 instructions to power off computer in real mode?

From Dev

Are RMW instructions considered harmful on modern x86?

From Dev

Difference between JS and JL x86 instructions

From Dev

Non-blocking memory write in x86 instructions?

From Dev

What's the difference between the x86 NOP and FNOP instructions?

From Dev

Achieve C casting using x86 mov instructions

From Dev

Mapping Between LLVM IR and x86 Instructions

From Dev

x86 mov / add Instructions & Memory Addressing

From Dev

What are "non-virtualizable" instructions in x86 architecture?

From Dev

How is a StoreStore barrier mapped to instructions under x86?

From Dev

C: put x86 instructions into array and execute them

From Dev

How is a StoreStore barrier mapped to instructions under x86?

From Dev

x86 - Instruction-level parallelism - optimal order of instructions

From Dev

x86 assembly directly write to VGA simple OS

From Dev

How this simple code is working in x86 assembly

From Dev

Simple x86 linux assembly program returning unexpected value

From Java

Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

From Dev

x86 Assembly Beginner: Program doesn't loop correctly?

From Dev

MASM x86 Access Violation While Running Loop

From Dev

x86 Assembly Beginner: Program doesn't loop correctly?

Related Related

  1. 1

    Slow Instructions in Simple Loop on x86

  2. 2

    Simple loop is too slow

  3. 3

    What is the error of trigonometric instructions on x86?

  4. 4

    For loop in x86 assembly

  5. 5

    Android x86 emulator extremely slow

  6. 6

    Latency of CPU instructions on x86 and x64 processors

  7. 7

    Simple register allocation scheme for x86

  8. 8

    Conditional in Loop MASM x86 Assembly

  9. 9

    For-Loop in x86 Assembly on MacOS

  10. 10

    X86 instructions to power off computer in real mode?

  11. 11

    Are RMW instructions considered harmful on modern x86?

  12. 12

    Difference between JS and JL x86 instructions

  13. 13

    Non-blocking memory write in x86 instructions?

  14. 14

    What's the difference between the x86 NOP and FNOP instructions?

  15. 15

    Achieve C casting using x86 mov instructions

  16. 16

    Mapping Between LLVM IR and x86 Instructions

  17. 17

    x86 mov / add Instructions & Memory Addressing

  18. 18

    What are "non-virtualizable" instructions in x86 architecture?

  19. 19

    How is a StoreStore barrier mapped to instructions under x86?

  20. 20

    C: put x86 instructions into array and execute them

  21. 21

    How is a StoreStore barrier mapped to instructions under x86?

  22. 22

    x86 - Instruction-level parallelism - optimal order of instructions

  23. 23

    x86 assembly directly write to VGA simple OS

  24. 24

    How this simple code is working in x86 assembly

  25. 25

    Simple x86 linux assembly program returning unexpected value

  26. 26

    Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly?

  27. 27

    x86 Assembly Beginner: Program doesn't loop correctly?

  28. 28

    MASM x86 Access Violation While Running Loop

  29. 29

    x86 Assembly Beginner: Program doesn't loop correctly?

HotTag

Archive