編集3:画像はフルサイズバージョンへのリンクです。テキストの写真で申し訳ありませんが、グラフをコピーしてテキストテーブルに貼り付けるのは難しいでしょう。
私はでコンパイルされたプログラムのために次のVTuneプロファイルを持っていますicc --std=c++14 -qopenmp -axS -O3 -fPIC
:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n
times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.
HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[]
, so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0]
, which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[]
is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf
, just keep it hot across loop iterations in a __m128
variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi
with _mm_cvttps_epi32(vf)
, so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel
gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0
, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short
values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0)
, except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0)
.
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float
in parallel with _mm_cvtepi32_ps
(CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
呼び出しfloorf()
はばかげており、関数呼び出しはコンパイラにすべてのxmmレジスタをメモリにスピルさせる。コンパイルする-ffast-math
か、ROUNDSSにインライン化するか、手動で実行します。特に先に進んで、そこから計算したフロートをベクトルにロードするので!
スカラーprev_x / prev_y / prev_zの代わりにベクトル比較を使用します。MOVMASKPSを使用して、テスト可能な整数に結果を取得します。(下位3つの要素のみを気にするので、次のようにテストしますcompare_mask & 0b0111
(4ビットマスクの下位3ビットのいずれかが設定されている場合はtrue、と等しくないことを比較した後_mm_cmpneq_ps
。double
その他の表については、命令のバージョンを参照してください。すべてがどのように機能するかについて:http://www.felixcloutier.com/x86/CMPPD.html)。
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加