SSE runs slow after using AVX

Geoffrey Published at Dev

Geoffrey

I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:

g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx

When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.

Could someone please explain what the cause of this may be?

Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.

If I skip the AVX test routines I never see this performance loss.

Here is my SSE2 routine

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float  ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m128 mul   = _mm_set_ps1(ratio);

  unsigned int i;
  for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
  {
    __m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
  }

  for (; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

And the AVX version of the same.

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m256 mul  = _mm256_set1_ps(ratio);

  unsigned int i;
  for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
  {
    __m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
    out[4] = ((int16_t*)&con)[8];
    out[5] = ((int16_t*)&con)[10];
    out[6] = ((int16_t*)&con)[12];
    out[7] = ((int16_t*)&con)[14];
  }

  for(; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

I have also run this through valgrind which detects no errors.

Nayuki

Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.

As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:

References:

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-22

Comments

0 comments

From Dev

Related Related

Article

SSE runs slow after using AVX

SSE runs slow after using AVX

Integer dot product using SSE/AVX?

Using SIMD/AVX/SSE for tree traversal

python - using ctypes and SSE/AVX SOMETIMES segfaults

Using sse and avx intrinsics to add a set of packed singles into one value

Using sse and avx intrinsics to add a set of packed singles into one value

MySQL application runs slow after general usage

Lanczos SSE/AVX implementation

Lanczos SSE/AVX implementation

practical BigNum AVX/SSE possible?

Channel/lane shuffling for SSE and AVX?

Avoiding unnecessary loads (SSE/AVX)

SSE incredibly slow

Find largest element in matrix and its column and row indexes using SSE and AVX

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

Forcing AVX intrinsics to use SSE instructions instead

implict SIMD (SSE/AVX) broadcasts with GCC

GCC generates SSE instructions instead of AVX

Why is permute needed in parallel SIMD/SSE/AVX ?

AVX2 slower than SSE on Haswell

SSE/AVX floating point convert exceptions

Saturated substraction - AVX or SSE4.2

Is it possible to get multiple sines in AVX/SSE?

Why is permute needed in parallel SIMD/SSE/AVX ?

Is it possible to get multiple sines in AVX/SSE?

Are FPU/SSE/AVX registers not saved in core dumps?

regex runs too slow

Have you noticed that dispatch_after runs ~10% too slow on iOS devices?

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

FLANN in opencv runs too slow