SSE runs slow after using AVX

Geoffrey

I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:

g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx

When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.

Could someone please explain what the cause of this may be?

Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.

If I skip the AVX test routines I never see this performance loss.

Here is my SSE2 routine

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float  ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m128 mul   = _mm_set_ps1(ratio);

  unsigned int i;
  for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
  {
    __m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
  }

  for (; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

And the AVX version of the same.

void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
  static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
  static __m256 mul  = _mm256_set1_ps(ratio);

  unsigned int i;
  for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
  {
    __m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
    out[0] = ((int16_t*)&con)[0];
    out[1] = ((int16_t*)&con)[2];
    out[2] = ((int16_t*)&con)[4];
    out[3] = ((int16_t*)&con)[6];
    out[4] = ((int16_t*)&con)[8];
    out[5] = ((int16_t*)&con)[10];
    out[6] = ((int16_t*)&con)[12];
    out[7] = ((int16_t*)&con)[14];
  }

  for(; i < samples; ++i, ++in, ++out)
    *out = (int16_t)lrint(*in * ratio);
}

I have also run this through valgrind which detects no errors.

Nayuki

Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.

As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:

References:

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Integer dot product using SSE/AVX?

From Dev

Using SIMD/AVX/SSE for tree traversal

From Dev

python - using ctypes and SSE/AVX SOMETIMES segfaults

From Dev

Using sse and avx intrinsics to add a set of packed singles into one value

From Dev

Using sse and avx intrinsics to add a set of packed singles into one value

From Dev

MySQL application runs slow after general usage

From Dev

Lanczos SSE/AVX implementation

From Dev

Lanczos SSE/AVX implementation

From Dev

practical BigNum AVX/SSE possible?

From Dev

Channel/lane shuffling for SSE and AVX?

From Dev

Avoiding unnecessary loads (SSE/AVX)

From Dev

SSE incredibly slow

From Dev

Find largest element in matrix and its column and row indexes using SSE and AVX

From Dev

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

From Dev

Forcing AVX intrinsics to use SSE instructions instead

From Dev

implict SIMD (SSE/AVX) broadcasts with GCC

From Dev

GCC generates SSE instructions instead of AVX

From Dev

Why is permute needed in parallel SIMD/SSE/AVX ?

From Dev

AVX2 slower than SSE on Haswell

From Dev

SSE/AVX floating point convert exceptions

From Dev

Saturated substraction - AVX or SSE4.2

From Dev

Is it possible to get multiple sines in AVX/SSE?

From Dev

Why is permute needed in parallel SIMD/SSE/AVX ?

From Dev

Is it possible to get multiple sines in AVX/SSE?

From Dev

Are FPU/SSE/AVX registers not saved in core dumps?

From Dev

regex runs too slow

From Dev

Have you noticed that dispatch_after runs ~10% too slow on iOS devices?

From Dev

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

From Dev

FLANN in opencv runs too slow

Related Related

  1. 1

    Integer dot product using SSE/AVX?

  2. 2

    Using SIMD/AVX/SSE for tree traversal

  3. 3

    python - using ctypes and SSE/AVX SOMETIMES segfaults

  4. 4

    Using sse and avx intrinsics to add a set of packed singles into one value

  5. 5

    Using sse and avx intrinsics to add a set of packed singles into one value

  6. 6

    MySQL application runs slow after general usage

  7. 7

    Lanczos SSE/AVX implementation

  8. 8

    Lanczos SSE/AVX implementation

  9. 9

    practical BigNum AVX/SSE possible?

  10. 10

    Channel/lane shuffling for SSE and AVX?

  11. 11

    Avoiding unnecessary loads (SSE/AVX)

  12. 12

    SSE incredibly slow

  13. 13

    Find largest element in matrix and its column and row indexes using SSE and AVX

  14. 14

    Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

  15. 15

    Forcing AVX intrinsics to use SSE instructions instead

  16. 16

    implict SIMD (SSE/AVX) broadcasts with GCC

  17. 17

    GCC generates SSE instructions instead of AVX

  18. 18

    Why is permute needed in parallel SIMD/SSE/AVX ?

  19. 19

    AVX2 slower than SSE on Haswell

  20. 20

    SSE/AVX floating point convert exceptions

  21. 21

    Saturated substraction - AVX or SSE4.2

  22. 22

    Is it possible to get multiple sines in AVX/SSE?

  23. 23

    Why is permute needed in parallel SIMD/SSE/AVX ?

  24. 24

    Is it possible to get multiple sines in AVX/SSE?

  25. 25

    Are FPU/SSE/AVX registers not saved in core dumps?

  26. 26

    regex runs too slow

  27. 27

    Have you noticed that dispatch_after runs ~10% too slow on iOS devices?

  28. 28

    How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

  29. 29

    FLANN in opencv runs too slow

HotTag

Archive