同じSSEコードが同じ関数で数倍遅くなる原因は何ですか？

debugcn 投稿 Dev

iksemyonov

編集3：画像はフルサイズバージョンへのリンクです。テキストの写真で申し訳ありませんが、グラフをコピーしてテキストテーブルに貼り付けるのは難しいでしょう。

私はでコンパイルされたプログラムのために次のVTuneプロファイルを持っていますicc --std=c++14 -qopenmp -axS -O3 -fPIC：

In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).

What am I missing?

Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.

I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.

Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...

Same profile, albeit with a slightly lower CPI this time.

Peter Cordes

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.

The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)

The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.

But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.

It looks like there's a lot you could optimize here.

instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).

It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)

Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).

Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
呼び出しfloorf()はばかげており、関数呼び出しはコンパイラにすべてのxmmレジスタをメモリにスピルさせる。コンパイルする-ffast-mathか、ROUNDSSにインライン化するか、手動で実行します。特に先に進んで、そこから計算したフロートをベクトルにロードするので！
スカラーprev_x / prev_y / prev_zの代わりにベクトル比較を使用します。MOVMASKPSを使用して、テスト可能な整数に結果を取得します。（下位3つの要素のみを気にするので、次のようにテストしますcompare_mask & 0b0111（4ビットマスクの下位3ビットのいずれかが設定されている場合はtrue、と等しくないことを比較した後_mm_cmpneq_ps。doubleその他の表については、命令のバージョンを参照してください。すべてがどのように機能するかについて：http：//www.felixcloutier.com/x86/CMPPD.html）。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-05-29

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

同じSSEコードが同じ関数で数倍遅くなる原因は何ですか？

同じSSEコードが同じ関数で数倍遅くなる原因は何ですか？

同じコード、同じ入力、時々速く実行する、時々遅い、なぜですか？

同じファイルと宛先に対する後続のコピー操作が遅くなるのはなぜですか (ウィンドウ)

同じことを達成する方法が少ないコードで（またはKotlinの拡張関数のいくつかを使用して）

VS2010からVS2015に移行したときに、同じコードが2倍遅くなるのはなぜですか？

同じシードを持つrand（）関数は、異なるPCで同じ乱数を生成しますか？

ほぼ同じコード、異なる出力。ここでJavaScriptは何が違うのですか？

私のPythonコードがPHPの同じコードよりも100倍遅いのはなぜですか？

この関数の実行が遅くなる原因は何ですか？

JSfiddleとChromeコンソールが同じ関数の異なる値を返すのはなぜですか？

POST関数がphpコードで機能しないが、同じ関数でGETが機能する

コードが同じコマンドライン引数を2回出力するのはなぜですか？

kubernetesでポッドが遅くなる原因は何ですか？

コードが文字を取得するのと同じくらい印刷されるのはなぜですか

異なる主キーがまったく同じレコードを指すことはできますか？

同じコードの動作が異なるのはなぜですか？

同じGolangコードの出力が異なるのはなぜですか？

同じPythonコードの時刻が異なるのはなぜですか？

同じコード行を書くのが良いですか、それとも同じ関数を呼び出すのが良いですか？

まったく同じコード-AndroidStudioに組み込まれているapkは遅くて遅いですが、Eclipseに組み込まれているのと同じように速くて遅れていませんか？

異なるデバッガーが同じ関数に対して異なるアセンブリコードを出力するのはなぜですか？

関数「pack」と「grid」を同じコードで使用できないのはなぜですか

これがネストされたforループが、同じコードを展開するよりもはるかに遅いのはなぜですか？

同じコード行でofstreamを作成して開くことができないのはなぜですか？

この再帰関数が別のほぼ同じ関数と比較してクラッシュする原因は何ですか？

同じ関数を何度も呼び出さなくても、このコードを単純化する方法はありますか？

F＃パフォーマンス：このコードが非常に遅くなる原因は何ですか？

Intellijではなくても同じscalaコードがコマンドラインで正常に機能するのはなぜですか？

私のコードがまったく同じコードで2つの異なる方法で実行されるのはなぜですか？

なぜJavaは、同じ名前を持つ2つの変数が同じスコープ内で宣言することができるのですか？