SlideShare une entreprise Scribd logo
1  sur  66
Working with the compiler, not against it
Evgeniy Muralev - Senior Software Engineer
Mark Vince - Senior Rendering Engineer
Sperasoft Spb.
We will talk about…
• Fragility of optimizations done by the compiler
• Coding strategies to “fit” the modern CPU
• Ways of optimizing algorithms respecting the CPU
The talk contains assembly code and some low-level material.
We will focus on compilation for x86 and associated CPUs, but material is
applicable to other instruction sets and architectures 
Games industry
• We do care about performance
• ~16-33ms per frame is reality we are living in
• A lot of computation performed each frame (both CPU and GPU)
Myths about compilers
• Two extremes:
• “Compilers are terrible, use assembler!”
• “Compiler will do everything for you and is always better than you”
• Reality is neither is really true
So,…
• Over the last couple of decades compilers have become significantly
smarter
• Smart optimizations
• Unrolling
• LTO
• …
• However there is a great danger in assuming that compilers will do
everything for us (they won’t)
• Doesn’t know all data constraints
• May be not fully aware of underlying CPU micro-architecture
• Hopefully we will be able to demonstrate this
Utilizing ranges
• Assume we perform some integer (or floating-point) math in code
• Knowing that some variable lies in specific range may often enable
many optimizations
• We wondered how we can specify a limited range of values in C++
Utilizing ranges
• Types such as int8_t, int_least8_t, int_fast8_t, int16_t…
• Number of bits on a data element in class/struct (bitfields)
• // struct S { int a : 4; };
• Range of values on a branch:
• // if (a >= 0 && a <= 100)
• Deduced from logic flow:
• // a = a % 100
How we can specify a limited range of values:
Utilizing ranges
C++:
int divTest(uint n, uint k)
{
if (k > 0)
{
n %= (k+1);
return n / k;
}
return 0;
}
Obviously, after this operation n is in [0, k] range
Easily computed as (pseudo-assembly):
cmp n, k
sete al
Utilizing ranges
C++:
int divTest(uint n, uint k)
{
if (k > 0)
{
n %= (k+1);
return n / k;
}
return 0;
}
gcc 6.3 –O3: MSVC15 /O2:
test esi, esi
je .L7
lea ecx, [rsi+1]
mov eax, edi
xor edx, edx
div ecx
mov eax, edx
xor edx, edx
div esi
mov esi, eax
.L7:
mov eax, esi
ret
test edi,edi
je .L0
push esi
lea esi,[edi+1]
xor edx,edx
mov eax,ecx
div eax,esi
pop esi
mov eax,edx
xor edx,edx
div eax,edi
pop edi
ret
.L0
xor eax,eax
pop edi
ret
• Both compilers insert extra div…
• May take ~30 cycles depending on CPU
Utilizing ranges
int doDivisionBranch(int n)
{
if (n >= 0 && n < 123)
{
return n / 123;
}
return 0;
}
gcc 6.3 –O3: MSVC15 /O2:
xor eax, eax
ret
cmp ecx, 122
ja L1
mov eax, 558694933
imul ecx
sar edx, 4
mov eax, edx
shr eax, 31
add eax, edx
ret
L1:
xor eax, eax
ret
C++:
Will look at it later
Utilizing ranges
C++:
int doDivisionBranch(int n)
{
if (n >= 0 && n < 123)
{
return n / 123;
}
return 1;
}
gcc 6.3 –O3: MSVC15 /O2:
xor eax, eax
cmp edi, 122
seta al
ret
cmp ecx, 122
ja L1
mov eax, 558694933
imul ecx
sar edx, 4
mov eax, edx
shr eax, 31
add eax, edx
ret
L1:
mov eax, 1
ret
Still bad
Ranges
Little bit more complicated, but compiler already fails to recognize optimization opportunity
C++: You expect: gcc 6.3 –O3:
int doDivisionBranch(int n)
{
if (n >= 0 && n <= 123)
{
return n / 123;
}
return 1;
}
xor eax, eax
cmp edi, 122
seta al
ret
cmp edi, 123
mov eax, 1
ja .L2
mov eax, edi
mov edx, 558694933
sar edi, 31
imul edx
mov eax, edx
sar eax, 4
sub eax, edi
.L2:
rep ret
Oops…
Ranges continued
• Division by a constant can always be replaced by a multiplication and a
shift right
• For more information check out: Hacker’s Delight by Henry S. Warren
mov eax, edi
mov edx, 558694933
sar edi, 31
imul edx
mov eax, edx
sar eax, 4
sub eax, edi
• However this number above is for any 32-bit value being divided by 123
• Assuming we have just a narrow range of values, we could calculate a
much smaller number
Ranges
• Assuming n in [0, 40], we can use the magic-
number multiplier 26 and shift right by 8.
• Gives a compiler more potential for optimization
with surrounding code, less register usage, more
vectorization opportunities, etc.
• However our experiments seem to show that the
compilers we have tried are not doing this.
mov eax, edi
mov edx, 1717986919
sar edi, 31
imul edx
mov eax, edx
sar eax, 2
sub eax, edi
gcc 6.3 –O3:
if (n >= 0 && n <= 40)
{
return n / 10;
}
C++:
Remember SIMD?
• Vector registers to perform multiple operations in one instruction
• Vector registers are used even for scalar operations
Even wider registers on modern micro-architectures/IS:
ymm(256bit)/zmm(512bit)
SIMD support
• SSE/SSE2/SSE3/SSE4/AVX/AVX2/AVX512
Hardware support?
• E.g. SSE2 came out in 2001 – pretty safe
to assume that target processor supports
it
• And friends…
SSE refresher
movups xmm0, XMMWORD PTR [rsi]
movups xmm1, XMMWORD PTR [rdi]
addps xmm1, xmm0
movaps XMMWORD PTR [rsp-24], xmm1
movss xmm0, DWORD PTR [rsi]
movss xmm1, DWORD PTR [rdi]
addss xmm1, xmm0
movss DWORD PTR [r0], xmm1
movss xmm0, DWORD PTR [rsi]
movss xmm1, DWORD PTR [rdi]
addss xmm1, xmm0
movss DWORD PTR [r0], xmm1
movss xmm0, DWORD PTR [rsi]
movss xmm1, DWORD PTR [rdi]
addss xmm1, xmm0
movss DWORD PTR [r0], xmm1
movss xmm0, DWORD PTR [rsi]
movss xmm1, DWORD PTR [rdi]
addss xmm1, xmm0
movss DWORD PTR [r0], xmm1
Vectorized code (SIMD):Scalar code (No SIMD):
struct Vec4
{
float x, y, z, w;
};
Vec4 operator+(const Vec4& v1, const Vec4& v2)
{
return Vec4{v1.x+v2.x, v1.y+v2.y,
v1.z+v2.z, v1.w+v2.w};
}
C++:
Reduction challenge
// int arr[N];
// Randomly generate array arr
static int sum = 0;
void doSomething(int a)
{
sum += a;
}
for (int i = 0; i < N; ++i)
{
doSomething(arr[i]);
}
.L1:
paddd xmm0, XMMWORD PTR [rbx]
add rbx, 16
cmp r13, rbx
jne .L1
movdqa xmm1, xmm0
psrldq xmm1, 8
paddd xmm0, xmm1
movdqa xmm1, xmm0
psrldq xmm1, 4
paddd xmm0, xmm1
movd eax, xmm0
add eax, edx
mov DWORD PTR sum[rip], eax
gcc 6.3 –O3
Packed addition of ints!
May increase overall code size, but size of the critical loop is still small…
s1 s2 s3 s4
s1+s3 s2+s4
s1+s2+s3+s4
Addition of partial sums across register:
C++:
MSVC output
• Vectorized
• 2x loop unrolled
L1:
movups xmm0,xmmword ptr [edi+eax*4]
paddd xmm1,xmm0
movups xmm0,xmmword ptr [edi+eax*4+10h]
add eax,8
paddd xmm2,xmm0
cmp eax,100000h
jl .L1
paddd xmm2,xmm1
lea eax,[esp+10h]
movaps xmm0,xmm2
psrldq xmm0,8
paddd xmm2,xmm0
movaps xmm0,xmm2
psrldq xmm0,4
paddd xmm2,xmm0
push eax
movd esi,xmm2
MSVC15 /O2:
• What can possibly confuse compiler?
Scalar addition
Problem with this one is that FP math is not associative
Reduction challenge
C++:
static float sum = 0;
void doSomething(float a)
{
sum += a;
}
for (int i = 0; i < N; ++i)
{
doSomething(arr[i]);
}
gcc 6.3 –O3:
.L4:
addss xmm0, DWORD PTR [rbp]
add rbp, 4
cmp rbp, OFFSET FLAT:arr+4194304
jne .L4
pop rbx
pop rbp
movss DWORD PTR sum[rip], xmm0
pop r12
ret
• E.g. we changed type to floating-point
.L4:
addps xmm1, XMMWORD PTR [rbp+0]
add rbp, 16
cmp r12, rbp
jne .L4
movaps xmm0, xmm1
pop rbx
movhlps xmm0, xmm1
pop rbp
addps xmm1, xmm0
pop r12
movaps xmm0, xmm1
shufps xmm0, xmm1, 85
addps xmm1, xmm0
movaps xmm0, xmm1
addss xmm0, xmm2
movss DWORD PTR sum[rip], xmm0
ret
Reduction challenge
C++:
static float sum = 0;
void doSomething(float a)
{
sum += a;
}
for (int i = 0; i < N; ++i)
{
doSomething(arr[i]);
}
gcc 6.3 –O3 –ffast-math
Give compiler more context
vxorps xmm0, xmm0, xmm0
.L4:
vaddps ymm0, ymm0, YMMWORD PTR [r13]
add r13, 32
cmp r14, r13
jne .L4
vhaddps ymm0, ymm0, ymm0
vhaddps ymm1, ymm0, ymm0
vperm2f128 ymm0, ymm1, ymm1, 1
vaddps ymm0, ymm0, ymm1
vaddss xmm0, xmm0, xmm2
vmovss DWORD PTR sum[rip], xmm0
gcc 6.3 –O3 –ffast-math –march=haswell
• Assume we know target CPU = Haswell microarchitecture?
• Let the compiler know!
Neat!
C++:
static float sum = 0;
void doSomething(float a)
{
sum += a;
}
for (int i = 0; i < N; ++i)
{
doSomething(arr[i]);
}
• Breaks all optimizations!
• Link-time optimization may help
• If not building DLL
• You not just paying for extra function call, but lose vectorization!
• And other potential optimizations!
Reduction challenge
C++:
static float sum = 0;
extern void doSomething(float a);
for (int i = 0; i < N; ++i)
{
doSomething(arr[i]);
}
RISC vs CISC
• RISC: Reduced Instruction Set Computing
• CISC: Complex Instruction Set Computing
• RISC:
• Simple addressing modes
• Uniform instruction format
• Fewer data types in hardware
• => Larger semantic gap between ISA and higher-level program
• => But simplifies hardware a lot
Microops
• On x86 we have visible CISC architecture ISA, but underlying CPU
actually works in RISC fashion…
• Instructions are decoded to micro-operations
add eax, [s] => load & add (2 microops)
push eax => sub ESP, 4 (2 microops)
mov [esp], eax
Out-of-order execution
Out of order part – execute
microops; Out-of-order
*Serialized again later to conform
memory consistency model requirements
Front-end: fetch and decode
instructions; In-order
Data dependencies
• This kind of loop actually can perform badly
• Each next iteration is dependent on previous one
• Data dependencies kill out-of-order execution!
C++:
for (int i = 2; i < N; ++i)
{
arr[i] = arr[i-2] + arr[i-1];
}
False dependencies
• But we are talking about *real* dependencies
• Remember Register renaming?
movaps xmm1, XMMWORD PTR .LC0[rip]
xor eax, eax
.L4:
add rax, 16
movaps xmm0, XMMWORD PTR a[rax-16]
mulps xmm0, xmm1
movaps XMMWORD PTR b[rax-16], xmm0
cmp rax, 4194304
jne .L4
C++:
for (int i = 0; i < N; ++i)
{
b[i] = a[i] * 0.5f;
}
False dependencies
Dot = a1.x * a2.x + a1.y * a2.y + a1.z * a2.z + a1.w * a2.w
movss xmm0, DWORD PTR [rdi]
movss xmm1, DWORD PTR [rdi+4]
mulss xmm0, DWORD PTR [rsi]
mulss xmm1, DWORD PTR [rsi+4]
movss xmm2, DWORD PTR [rdi+12]
mulss xmm2, DWORD PTR [rsi+12]
addss xmm0, xmm1
movss xmm1, DWORD PTR [rdi+8]
mulss xmm1, DWORD PTR [rsi+8]
addss xmm1, xmm2
addss xmm0, xmm1
• False dependencies!
• Write-after-write are false dependencies
• Write-after-read are false dependencies
• Register renaming eliminates both
Register renaming
Physical register file
• Register renaming maps architectural registers to physical ones
• In B2 xmm1 will map to different physical location than in B1
movss xmm0, DWORD PTR [rdi]
movss xmm1, DWORD PTR [rdi+4]
mulss xmm0, DWORD PTR [rsi]
mulss xmm1, DWORD PTR [rsi+4]
movss xmm2, DWORD PTR [rdi+12]
mulss xmm2, DWORD PTR [rsi+12]
addss xmm0, xmm1
movss xmm1, DWORD PTR [rdi+8]
mulss xmm1, DWORD PTR [rsi+8]
addss xmm1, xmm2
addss xmm0, xmm1 B2
B1
Small recap
• Recap:
• Compiler won’t solve all problems for you
• Compiler optimizations may unexpectedly break, so be careful!
• Instructions executed out of order, blocked by real dependencies only
• Register renaming removes false dependencies
• Code that looks simple and elegant is not necessary fast
• Lets examine some code examples – how this affects
optimization
• Frame = one iteration of a loop
• Discreet frame = frame NOT dependent on any other frame
• Iterative frame = frame depends on another (previous) frame
for( int i = …. )
{
result[i] = i* 3 + 6;
}
for( int i = …. )
{
result[i] = i* 3 + result[i-1];
}
Frame dependencies
What blocks vectorization ?
• Various reasons code may not be vectorized by compiler.
• Compilers continue to improve, but don’t assume your code is vectorized
• Check the assembler!
Two important causes of vectorization failure:
Frame dependency - Frame depends on other (previous) frame(s) data.
Data collision - Target (eg. array index) is not unique.
AVX 512
AVX512 ( Intel Xeon Phi, Knights Landing )
• Special instructions (compress / expand) – give compilers a chance to
process several loop iterations (typically 4 or 8) in a single iteration.
Gives compiler better chance to vectorize.
• Conflict detection. AVX512-CD - Can be used when uniqueness of array
target cannot be determined.
• When we are writing to an array in a loop –
• construct a SIMD vector of indices in the array, for every 4 or 8
iterations.
• use vpconflictd to create a bitmask of non-unique indices.
• bitmask can then be used to vectorize unique elements and process
separate non-unique targets after.
• Specifically designed for compilers to vectorize loops.
for (int i = … )
{
int index = calculateIndex(i);
array[index] = ….;
}
Index may NOT be unique
- For example, hashing algorithms
AVX512 – conflict detection
Dependencies
We shall….
Look at how dependencies between frames can ‘break’ your optimizations.
What we can do to get around these problems. (when you don’t have AVX512! )
Look at just one type of optimization, evaluating polynomials, to demonstrate
Finally some useful mathematical rules, to break down difficult formula –
and how to use what we know about vectorization.
• Summing squares.
• Each iteration is not dependent and its easy to vectorize.
• Yes – This could be calculated as n(n+1)(2n+1) / 6 - but compiler doesn’t know this!
int xSqrSum(const int n)
{
int result = 0;
int a = 1;
int b = 0;
for (int i = 1; i < n; ++i)
{
b += a;
a += 2;
result += b;
}
return result;
}
Dependencies block
vectorization
- Dependencies between frames
- Dependencies inside frame
int xSqrSum(const int n)
{
int result = 0;
for (int i = 1; i < n; ++i)
{
result += i*i;
}
return result;
}
Discrete or Iterative - 𝑥2
Assembler
// Too much code to list here … main loop is….
.L30:
movdqa xmm2, xmm3
movdqa xmm1, xmm3
add edx, 1
pmuludq xmm2, xmm3
psrlq xmm1, 32
pshufd xmm2, xmm2, 8
pmuludq xmm1, xmm1
pshufd xmm1, xmm1, 8
cmp eax, edx
paddd xmm3, xmm4
punpckldq xmm2, xmm1
paddd xmm0, xmm2
ja .L30
xSqrSum (int):
cmp edi, 1
jle .L24
lea esi, [rdi-1+rdi]
xor ecx, ecx
mov edx, 1
xor eax, eax
.L23:
add ecx, edx
add edx, 2
add eax, ecx
cmp esi, edx
jne .L23
rep ret
.L24:
xor eax, eax
ret
Discrete ( vectorized ) Iterative
Two multiplies ?
- if we know 𝑥2
is a smaller
range we could do better.
- Another thing compiler could
use range information for.
• Often vectorized code has bigger footprint – but much faster
• Tail code can be reduced by processing exact multiples only for loop limit.
• We are concerned really with the main loop.
• Beware: if a loop calls an external method that cannot be ‘seen’ (eg. in a DLL) then
vectorization can fail because the method may take discreet single loop values.
• Same problem if you do NOT use link time optimization (whole program optimization).
‘Setup’ code - initial values, etc.
Main loop - processes blocks of typically 4 or 8
elements at once.
- processes remaining elements.
Setup
Main loop
Tail code
Fast code, big code
Strip-mining and vectorization
• We can break the range of values in a loop into multiple sub-ranges.
• To calculate squares from 1 to 100 we could split this into 4 ranges, and process each strip
in a different channel of a SIMD register, in parallel.
1 to 25 26 to 50 51 to 75 76 to 100
1 to 100
𝑥2
𝑥2 𝑥2
𝑥2
SIMD pack
Breaking up ranges
• Firstly, we create a function to take a range, rather than 1..N
int xSqrIterativeRange( const int lo, const int hi )
{
int result = 0;
int b = (lo - 1)*(lo - 1);
int a = lo + lo - 1;
for (int i = lo; i < hi; ++i)
{
b += a;
a += 2;
result += a;
}
return result;
}
Initial loop values calculated, explained later
Parallel ranges .. 4 channels
int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range)
{
int result0 = (lo0 - 1)*(lo0 – 1);
int result1 = (lo1 - 1)*(lo1 – 1);
…. // result2, result3
int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1;
int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1;
…… // a2, a3
for (int i = 0; i < range; ++i)
{
b0 += a0; a0 += 2;
b1 += a1; a1 += 2;
….. // b2,a2, b3,a3 …
result += a0 + a1 + a2 + a3;
}
return result;
}
Ranges NOT vectorized
• Gcc failed to vectorize this – why ?
• Answer:
• If we replace the sum in the loop with four separate partial sums, it
does vectorize:
• Or… we can optimize this by hand, using SIMD intrinsics.
result += a0 + a1 + a2 + a3;
Result0 += a0;
Result1 += a1;
Result2 += a2;
Result3 += a3;
int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range)
{
int result0 = (lo0 - 1)*(lo0 – 1);
int result1 = (lo1 - 1)*(lo1 – 1);
…. // result2, result3
int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1;
int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1;
…… // a2, a3
for (int i = 0; i < range; ++i)
{
b0 += a0; a0 += 2;
b1 += a1; a1 += 2;
….. // b2,a2, b3,a3 …
}
return result0 + result1 + result2 + result3;
}
result0 += a0;
result1 += a1;
….
Vectorizable code
Quick summary
• Optimization of discrete loops can add dependencies which are hard to vectorize
• Can break up ranges into sub-ranges and re-code for parallelism, ie. strip-
mining.
• May need to hand-craft SIMD with intrinsics.
• Often we can find clever iterative ways of ‘optimizing’ loop functions to require
less mathematical operations …
Iterative - floating point
• Big problem – cumulative errors.
• Compiler may not vectorize because re-arranging terms may not give the same
result due to rounding etc.
• Some classical optimizations not always worth the problems.
Eg. loop for calculating sin and cos iteratively done with a couple of additions
per iteration, but cumulative errors may be problematic.
Example… Polynomials
• Iteratively can evaluate using only additions per frame.
• Number of additions = maximum power + 1
• But Which is fastest ? – can’t assume that less math. operations is fastest.
• Only looking at integers here. The same principles apply to floating point but
there are other problems too.
Eg: 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒
Iterative: 4 additions, each dependent on last
Discreet: Several multiplications and adds.
Polynomial differencing
• This is old idea, - Babbage’s Difference Engine etc.
• Let us have a polynomial P(x) of order ‘n’, P(x+1) – P(x) is of order n-1
Eg. P(x) = 𝒙^𝟐 , P(x+1) – P(x) = 𝟐𝒙 + 𝟏 (1)
Let Q(x) = 𝟐𝒙 + 𝟏 , Q(x+1) – Q(x) = 𝟐 (2)
• Keep repeating the process, record the new polynomial until a constant is reached.
• In this example, we now have the the polynomials ( 𝒙^𝟐 , 𝟐𝒙 + 𝟏 , 2 )
• We begin by calculating each of these terms at our initial value, then every time we
iterate the loop we add each term to the previous one, and we get the next value of the
polynomial.
𝑥 = 2 3 4 5 …
𝑥2
= 4 9 16 25 …
2𝑥 + 1 = 5 7 9 …
2 = 2 2 …
void Loop()
{
int a = 4;
int b = 5;
int c = 2;
while(….)
{
// a = x^2 here
a += b;
b += c;
}
}
Polynomial differences
( 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒)
• Discrete: (compiler vectorizes the main loop)
int polynomial(int n)
{
int sum = 0;
for( int i = 1; i < n; ++i )
{
int result = 5 * i*i*i + 3 * i*i + i – 4;
sum += result;
}
return sum;
}
Compiler reduces the number of multiplies
with shifts and adds, etc
( 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒)
• Iterative: (compiler does not vectorize this! )
int polynomial(int n)
{
int sum = 0;
int a = -4 ; // 5x^3 + 3x^2 + x – 4, at x = 0
int b = 9 ; // q(x) = p(x+1) - p(x) = 15x^2 + 21x + 9, at x = 0
int c = 36; // r(x) = q(x+1) – q(x) = 30x + 36, at x = 0
int d = 30; // r(x+1) – r(x) = 30
for( int i = 1; i < n; ++i )
{
a += b; b+= c; c+= d;
sum += a;
}
return sum;
}
Dependencies !!!
Modular arithmetic
• To prevent integer calculations going out of range (outside the native bit-length of
the CPU)
• Limit our discussion here to briefly show a simple technique to solve some
problems.
• Look an example to demonstrate.
• Compilers are not using these techniques and so its up to you!
Continued…
• What kind of problem ?
• Result is ‘in range’ but partial results too big.
• Simple example is: 𝒓 = 𝒂 ∗ 𝒃 /𝒄
• The result of 𝒂 ∗ 𝒃 may be more than one machine word
• Instruction sets typically have a multiply to produce two words, and divide takes two
word numerator as input.
on x86 we multiply two 32-bits to get edx,eax = 64 bit result in two registers.
If we divide it, the idiv instruction takes both registers as 64 bit numerator input.
Modular Exponentiation
So, Lets look at a more interesting problem not solved so easily:
Remember…. We are dealing with positive integers here
We know that 𝒓 must be in the range [ 0, c-1]
Raising a to the power b can result in huge numbers. Far too big for
the machine word length.
𝒓 = (𝒂 𝒃
) 𝒎𝒐𝒅 𝒄
𝒓 = 𝒂 𝒃
𝒎𝒐𝒅 𝒄
• A few maths things we can do, but we want a general solution.
• We reduce the power b.
• We create an algorithm which iterates through a loop to get the result.
Reducing the power
𝜑 𝑐 = Euler Totient Function 𝝋 𝒄 = 𝒄 ∗ ∏(𝟏 −
𝟏
𝒑
) = 𝒄 ∗ (𝟏 −
𝟏
𝒑 𝟎
) ∗ (𝟏 −
𝟏
𝒑 𝟏
) …
Where 𝑝 = prime factor of c
Example:
𝝋 𝟏𝟎𝟎 = ( 𝟏 −
𝟏
𝟐
) ( 1 −
𝟏
𝟓
) = 40;
since 100 = 22
. 52
, ( prime factors of 100 being 2 and 5 )
𝑖𝑓 𝑟 = 𝑎 𝑏 𝑚𝑜𝑑 𝑐
𝑟 = 𝑎 𝑏 𝑚𝑜𝑑 𝜑 𝑐
𝑚𝑜𝑑 𝑐
Example
567123 𝑚𝑜𝑑 100
= 567 (123 𝑚𝑜𝑑 40)
𝑚𝑜𝑑 100 // since 𝜑 100 = 40
= 5673
𝑚𝑜𝑑 100
• A smaller power makes the modular exponentiation faster.
• Keep a table of precomputed totient functions.
567^123 has 1126 bits!
Useful relationships
Suppose we have three integers: a, b, c
Let : 𝒓 = 𝒂 𝒎𝒐𝒅 𝒄
𝒔 = 𝒃 𝒎𝒐𝒅 𝒄
Then:
Simple, but really useful to break formula into smaller parts.
Allows us to find solutions which consists of loops which iteratively evaluate a result,
without going out of range.
𝒂𝒃 𝒎𝒐𝒅 𝒄 = (𝒓𝒔) 𝒎𝒐𝒅 𝒄 (1)
(𝒂 + 𝒃) 𝒎𝒐𝒅 𝒄 = (𝒓 + 𝒔) 𝒎𝒐𝒅 𝒄 (2)
To evaluate: 𝒓 = 𝒂 𝒃
𝒎𝒐𝒅 𝒄
First reduce b using the totient function => less computation.
Then we can break up 𝒂 𝒑
using the bits of 𝒑 : // 𝒑 = b mod 𝝋 𝒄
eg. (𝑝 =21): 𝒂 𝟐𝟏
𝒎𝒐𝒅 𝒄 = (𝒂 𝟏𝟔
. 𝒂 𝟒
. 𝒂 𝟏
) 𝒎𝒐𝒅 𝒄
Create a loop to calculate each binary power of ‘a’ mod c by squaring the previous one.
𝒂 𝟐𝒏
𝒎𝒐𝒅 𝒄 = 𝒂 𝒏
𝒎𝒐𝒅 𝒄 𝟐
𝒎𝒐𝒅 𝒄
Combining these formulae we can get some code ………
Exponentiation continued…
int expMod( int a, int b, int c) // for 32 bit, b < 1^31-1 because of the mask test
{
int mask = 1;
int r = 1;
while(1)
{
mask += mask;
if (mask > b) break;
}
return r;
}
a *= a;
a %= c;
if (b & mask)
{
r*= a;
r %= c;
}
Combine for set bits…
(recall, 𝒂 𝟐𝟏 𝒎𝒐𝒅 𝒄 = (𝒂 𝟏𝟔 . 𝒂 𝟒. 𝒂 𝟏 ) 𝒎𝒐𝒅 𝒄 )
𝒂 𝟐𝒏
𝒎𝒐𝒅 𝒄 = 𝒂 𝒏
𝒎𝒐𝒅 𝒄 𝟐
𝒎𝒐𝒅 𝒄
expMod
uint expMod( int a, int b, int c )
{
int mask = 1; int r = 1;
while(1)
{
if (b & mask)
{
r*= a;
}
mask += mask;
if (mask > b) break;
a *= a;
}
return r;
}
if ( r > MAXROOT ) r %= c;
if ( a > MAXROOT ) a %= c;
if (r > c) r %= c;
• We only need to take the mod when a
multiply next time will cause an overflow
• This means that the final result may need a
mod to bring it into range.
• MAXROOT = the square root of the biggest
integer. 32-bit = 0xFFFF
64-bit = 0xFFFFFFFF
One last optimization
• Dependencies both inside loop frames and between frames.
• Difficult to break into sub-ranges – can’t determine initial state for
each sub-range.
• BUT!... Many other ways to improve performance of this kind of
algorithm – beyond the scope of this discussion.
Same old problems
Another code sample
• Another example of using modular arithmetic -
• Uses just additive operations and compares -
• Examines bits of ‘b’ in a right-to-left way (least significant to most) -
• Again, code is very suboptimal, but just for demonstration.
b mod c
No division
Any size numerator
int mod( int b, int c) // b % c without division for 32 bit, b range [1, 2^31-1]
{
int mask = 1;
int r = 0;
int a = 1;
while(1)
{
if (b & mask)
{
}
mask += mask;
if (mask > b) break;
}
return r;
}
r += a;
if( r >= c ) r -= c;
a+=a;
if( a >= c ) a -= c;
Add 𝟐𝒊
𝒎𝒐𝒅 𝒄 to the total
Calculate 𝟐𝒊
𝒎𝒐𝒅 𝒄 each iteration
b mod c
Recap
• Remember about Instruction level parallelism
• Vectorization for SIMD easily broken
• Range information is not fully used by the compiler
• Mathematical tricks to optimize may not be as effective as they appear
• Modular arithmetic can be used to break up difficult computations
Questions?
Evgeniy: evgeniy.muralev@sperasoft.com
Mark: mark.vince@sperasoft.com
Sperasoft Spb.

Contenu connexe

Tendances

C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingcppfrug
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеSergey Platonov
 
Алексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereАлексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereSergey Platonov
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done rightPlatonov Sergey
 
Дмитрий Демчук. Кроссплатформенный краш-репорт
Дмитрий Демчук. Кроссплатформенный краш-репортДмитрий Демчук. Кроссплатформенный краш-репорт
Дмитрий Демчук. Кроссплатформенный краш-репортSergey Platonov
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumlercorehard_by
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...corehard_by
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstractionSergey Platonov
 
Алексей Кутумов, Вектор с нуля
Алексей Кутумов, Вектор с нуляАлексей Кутумов, Вектор с нуля
Алексей Кутумов, Вектор с нуляSergey Platonov
 
Basic C++ 11/14 for Python Programmers
Basic C++ 11/14 for Python ProgrammersBasic C++ 11/14 for Python Programmers
Basic C++ 11/14 for Python ProgrammersAppier
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
 
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Sergey Platonov
 
Protocol handler in Gecko
Protocol handler in GeckoProtocol handler in Gecko
Protocol handler in GeckoChih-Hsuan Kuo
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기Ji Hun Kim
 
C++の話(本当にあった怖い話)
C++の話(本当にあった怖い話)C++の話(本当にあった怖い話)
C++の話(本当にあった怖い話)Yuki Tamura
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageablecorehard_by
 
Best Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' MistakesBest Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' MistakesAndrey Karpov
 
Антон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиАнтон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиSergey Platonov
 

Tendances (20)

C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Алексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereАлексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhere
 
Multithreading done right
Multithreading done rightMultithreading done right
Multithreading done right
 
Дмитрий Демчук. Кроссплатформенный краш-репорт
Дмитрий Демчук. Кроссплатформенный краш-репортДмитрий Демчук. Кроссплатформенный краш-репорт
Дмитрий Демчук. Кроссплатформенный краш-репорт
 
TensorFlow XLA RPC
TensorFlow XLA RPCTensorFlow XLA RPC
TensorFlow XLA RPC
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
 
Алексей Кутумов, Вектор с нуля
Алексей Кутумов, Вектор с нуляАлексей Кутумов, Вектор с нуля
Алексей Кутумов, Вектор с нуля
 
Basic C++ 11/14 for Python Programmers
Basic C++ 11/14 for Python ProgrammersBasic C++ 11/14 for Python Programmers
Basic C++ 11/14 for Python Programmers
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
 
Protocol handler in Gecko
Protocol handler in GeckoProtocol handler in Gecko
Protocol handler in Gecko
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기
 
C++の話(本当にあった怖い話)
C++の話(本当にあった怖い話)C++の話(本当にあった怖い話)
C++の話(本当にあった怖い話)
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageable
 
Best Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' MistakesBest Bugs from Games: Fellow Programmers' Mistakes
Best Bugs from Games: Fellow Programmers' Mistakes
 
Антон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствамиАнтон Наумович, Система автоматической крэш-аналитики своими средствами
Антон Наумович, Система автоматической крэш-аналитики своими средствами
 

Similaire à Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

ImplementingCryptoSecurityARMCortex_Doin
ImplementingCryptoSecurityARMCortex_DoinImplementingCryptoSecurityARMCortex_Doin
ImplementingCryptoSecurityARMCortex_DoinJonny Doin
 
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfmeerobertsonheyde608
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyRay Song
 
C++ and Assembly: Debugging and Reverse Engineering
C++ and Assembly: Debugging and Reverse EngineeringC++ and Assembly: Debugging and Reverse Engineering
C++ and Assembly: Debugging and Reverse Engineeringcorehard_by
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialJin-Hwa Kim
 
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHackito Ergo Sum
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 
Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Arnaud Bouchez
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimonSisimon Soman
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxtrupeace
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++Microsoft
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperConnor McDonald
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
C language programming
C language programmingC language programming
C language programmingpullarao29
 
Scale17x buffer overflows
Scale17x buffer overflowsScale17x buffer overflows
Scale17x buffer overflowsjohseg
 
05 instruction set design and architecture
05 instruction set design and architecture05 instruction set design and architecture
05 instruction set design and architectureWaqar Jamil
 

Similaire à Evgeniy Muralev, Mark Vince, Working with the compiler, not against it (20)

R and cpp
R and cppR and cpp
R and cpp
 
R and C++
R and C++R and C++
R and C++
 
ImplementingCryptoSecurityARMCortex_Doin
ImplementingCryptoSecurityARMCortex_DoinImplementingCryptoSecurityARMCortex_Doin
ImplementingCryptoSecurityARMCortex_Doin
 
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCy
 
C++ and Assembly: Debugging and Reverse Engineering
C++ and Assembly: Debugging and Reverse EngineeringC++ and Assembly: Debugging and Reverse Engineering
C++ and Assembly: Debugging and Reverse Engineering
 
Simple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorialSimple, fast, and scalable torch7 tutorial
Simple, fast, and scalable torch7 tutorial
 
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimon
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
C language programming
C language programmingC language programming
C language programming
 
Scale17x buffer overflows
Scale17x buffer overflowsScale17x buffer overflows
Scale17x buffer overflows
 
05 instruction set design and architecture
05 instruction set design and architecture05 instruction set design and architecture
05 instruction set design and architecture
 

Plus de Sergey Platonov

Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловПолухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловSergey Platonov
 
Григорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерГригорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерSergey Platonov
 
Василий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейВасилий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейSergey Platonov
 
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptСергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptSergey Platonov
 
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаЛев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаSergey Platonov
 
Павел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioПавел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioSergey Platonov
 
Григорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияГригорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияSergey Platonov
 
Антон Полухин. C++17
Антон Полухин. C++17Антон Полухин. C++17
Антон Полухин. C++17Sergey Platonov
 
Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Sergey Platonov
 
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtДенис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtSergey Platonov
 
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Sergey Platonov
 
Павел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаПавел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаSergey Platonov
 
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковНикита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковSergey Platonov
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov
 
Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Sergey Platonov
 
Антон Нонко, Классические строки в C++
Антон Нонко, Классические строки в C++Антон Нонко, Классические строки в C++
Антон Нонко, Классические строки в C++Sergey Platonov
 
Михаил Матросов, Повседневный С++: boost и STL
Михаил Матросов, Повседневный С++: boost и STLМихаил Матросов, Повседневный С++: boost и STL
Михаил Матросов, Повседневный С++: boost и STLSergey Platonov
 
Борис Сазонов, RAII потоки и CancellationToken в C++
Борис Сазонов, RAII потоки и CancellationToken в C++Борис Сазонов, RAII потоки и CancellationToken в C++
Борис Сазонов, RAII потоки и CancellationToken в C++Sergey Platonov
 
Илья Шишков, Принципы создания тестируемого кода
Илья Шишков, Принципы создания тестируемого кодаИлья Шишков, Принципы создания тестируемого кода
Илья Шишков, Принципы создания тестируемого кодаSergey Platonov
 
Андрей Карпов, Приватные байки от разработчиков анализатора кода
Андрей Карпов, Приватные байки от разработчиков анализатора кодаАндрей Карпов, Приватные байки от разработчиков анализатора кода
Андрей Карпов, Приватные байки от разработчиков анализатора кодаSergey Platonov
 

Plus de Sergey Platonov (20)

Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловПолухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
 
Григорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерГригорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптер
 
Василий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейВасилий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексией
 
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptСергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
 
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаЛев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
 
Павел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioПавел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.io
 
Григорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияГригорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизация
 
Антон Полухин. C++17
Антон Полухин. C++17Антон Полухин. C++17
Антон Полухин. C++17
 
Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++
 
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtДенис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
 
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
 
Павел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаПавел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладка
 
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковНикита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++
 
Антон Нонко, Классические строки в C++
Антон Нонко, Классические строки в C++Антон Нонко, Классические строки в C++
Антон Нонко, Классические строки в C++
 
Михаил Матросов, Повседневный С++: boost и STL
Михаил Матросов, Повседневный С++: boost и STLМихаил Матросов, Повседневный С++: boost и STL
Михаил Матросов, Повседневный С++: boost и STL
 
Борис Сазонов, RAII потоки и CancellationToken в C++
Борис Сазонов, RAII потоки и CancellationToken в C++Борис Сазонов, RAII потоки и CancellationToken в C++
Борис Сазонов, RAII потоки и CancellationToken в C++
 
Илья Шишков, Принципы создания тестируемого кода
Илья Шишков, Принципы создания тестируемого кодаИлья Шишков, Принципы создания тестируемого кода
Илья Шишков, Принципы создания тестируемого кода
 
Андрей Карпов, Приватные байки от разработчиков анализатора кода
Андрей Карпов, Приватные байки от разработчиков анализатора кодаАндрей Карпов, Приватные байки от разработчиков анализатора кода
Андрей Карпов, Приватные байки от разработчиков анализатора кода
 

Dernier

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 

Dernier (20)

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 

Evgeniy Muralev, Mark Vince, Working with the compiler, not against it

  • 1. Working with the compiler, not against it Evgeniy Muralev - Senior Software Engineer Mark Vince - Senior Rendering Engineer Sperasoft Spb.
  • 2. We will talk about… • Fragility of optimizations done by the compiler • Coding strategies to “fit” the modern CPU • Ways of optimizing algorithms respecting the CPU The talk contains assembly code and some low-level material. We will focus on compilation for x86 and associated CPUs, but material is applicable to other instruction sets and architectures 
  • 3. Games industry • We do care about performance • ~16-33ms per frame is reality we are living in • A lot of computation performed each frame (both CPU and GPU)
  • 4. Myths about compilers • Two extremes: • “Compilers are terrible, use assembler!” • “Compiler will do everything for you and is always better than you” • Reality is neither is really true
  • 5. So,… • Over the last couple of decades compilers have become significantly smarter • Smart optimizations • Unrolling • LTO • … • However there is a great danger in assuming that compilers will do everything for us (they won’t) • Doesn’t know all data constraints • May be not fully aware of underlying CPU micro-architecture • Hopefully we will be able to demonstrate this
  • 6. Utilizing ranges • Assume we perform some integer (or floating-point) math in code • Knowing that some variable lies in specific range may often enable many optimizations • We wondered how we can specify a limited range of values in C++
  • 7. Utilizing ranges • Types such as int8_t, int_least8_t, int_fast8_t, int16_t… • Number of bits on a data element in class/struct (bitfields) • // struct S { int a : 4; }; • Range of values on a branch: • // if (a >= 0 && a <= 100) • Deduced from logic flow: • // a = a % 100 How we can specify a limited range of values:
  • 8. Utilizing ranges C++: int divTest(uint n, uint k) { if (k > 0) { n %= (k+1); return n / k; } return 0; } Obviously, after this operation n is in [0, k] range Easily computed as (pseudo-assembly): cmp n, k sete al
  • 9. Utilizing ranges C++: int divTest(uint n, uint k) { if (k > 0) { n %= (k+1); return n / k; } return 0; } gcc 6.3 –O3: MSVC15 /O2: test esi, esi je .L7 lea ecx, [rsi+1] mov eax, edi xor edx, edx div ecx mov eax, edx xor edx, edx div esi mov esi, eax .L7: mov eax, esi ret test edi,edi je .L0 push esi lea esi,[edi+1] xor edx,edx mov eax,ecx div eax,esi pop esi mov eax,edx xor edx,edx div eax,edi pop edi ret .L0 xor eax,eax pop edi ret • Both compilers insert extra div… • May take ~30 cycles depending on CPU
  • 10. Utilizing ranges int doDivisionBranch(int n) { if (n >= 0 && n < 123) { return n / 123; } return 0; } gcc 6.3 –O3: MSVC15 /O2: xor eax, eax ret cmp ecx, 122 ja L1 mov eax, 558694933 imul ecx sar edx, 4 mov eax, edx shr eax, 31 add eax, edx ret L1: xor eax, eax ret C++: Will look at it later
  • 11. Utilizing ranges C++: int doDivisionBranch(int n) { if (n >= 0 && n < 123) { return n / 123; } return 1; } gcc 6.3 –O3: MSVC15 /O2: xor eax, eax cmp edi, 122 seta al ret cmp ecx, 122 ja L1 mov eax, 558694933 imul ecx sar edx, 4 mov eax, edx shr eax, 31 add eax, edx ret L1: mov eax, 1 ret Still bad
  • 12. Ranges Little bit more complicated, but compiler already fails to recognize optimization opportunity C++: You expect: gcc 6.3 –O3: int doDivisionBranch(int n) { if (n >= 0 && n <= 123) { return n / 123; } return 1; } xor eax, eax cmp edi, 122 seta al ret cmp edi, 123 mov eax, 1 ja .L2 mov eax, edi mov edx, 558694933 sar edi, 31 imul edx mov eax, edx sar eax, 4 sub eax, edi .L2: rep ret Oops…
  • 13. Ranges continued • Division by a constant can always be replaced by a multiplication and a shift right • For more information check out: Hacker’s Delight by Henry S. Warren mov eax, edi mov edx, 558694933 sar edi, 31 imul edx mov eax, edx sar eax, 4 sub eax, edi • However this number above is for any 32-bit value being divided by 123 • Assuming we have just a narrow range of values, we could calculate a much smaller number
  • 14. Ranges • Assuming n in [0, 40], we can use the magic- number multiplier 26 and shift right by 8. • Gives a compiler more potential for optimization with surrounding code, less register usage, more vectorization opportunities, etc. • However our experiments seem to show that the compilers we have tried are not doing this. mov eax, edi mov edx, 1717986919 sar edi, 31 imul edx mov eax, edx sar eax, 2 sub eax, edi gcc 6.3 –O3: if (n >= 0 && n <= 40) { return n / 10; } C++:
  • 15. Remember SIMD? • Vector registers to perform multiple operations in one instruction • Vector registers are used even for scalar operations Even wider registers on modern micro-architectures/IS: ymm(256bit)/zmm(512bit)
  • 16. SIMD support • SSE/SSE2/SSE3/SSE4/AVX/AVX2/AVX512 Hardware support? • E.g. SSE2 came out in 2001 – pretty safe to assume that target processor supports it • And friends…
  • 17. SSE refresher movups xmm0, XMMWORD PTR [rsi] movups xmm1, XMMWORD PTR [rdi] addps xmm1, xmm0 movaps XMMWORD PTR [rsp-24], xmm1 movss xmm0, DWORD PTR [rsi] movss xmm1, DWORD PTR [rdi] addss xmm1, xmm0 movss DWORD PTR [r0], xmm1 movss xmm0, DWORD PTR [rsi] movss xmm1, DWORD PTR [rdi] addss xmm1, xmm0 movss DWORD PTR [r0], xmm1 movss xmm0, DWORD PTR [rsi] movss xmm1, DWORD PTR [rdi] addss xmm1, xmm0 movss DWORD PTR [r0], xmm1 movss xmm0, DWORD PTR [rsi] movss xmm1, DWORD PTR [rdi] addss xmm1, xmm0 movss DWORD PTR [r0], xmm1 Vectorized code (SIMD):Scalar code (No SIMD): struct Vec4 { float x, y, z, w; }; Vec4 operator+(const Vec4& v1, const Vec4& v2) { return Vec4{v1.x+v2.x, v1.y+v2.y, v1.z+v2.z, v1.w+v2.w}; } C++:
  • 18. Reduction challenge // int arr[N]; // Randomly generate array arr static int sum = 0; void doSomething(int a) { sum += a; } for (int i = 0; i < N; ++i) { doSomething(arr[i]); } .L1: paddd xmm0, XMMWORD PTR [rbx] add rbx, 16 cmp r13, rbx jne .L1 movdqa xmm1, xmm0 psrldq xmm1, 8 paddd xmm0, xmm1 movdqa xmm1, xmm0 psrldq xmm1, 4 paddd xmm0, xmm1 movd eax, xmm0 add eax, edx mov DWORD PTR sum[rip], eax gcc 6.3 –O3 Packed addition of ints! May increase overall code size, but size of the critical loop is still small… s1 s2 s3 s4 s1+s3 s2+s4 s1+s2+s3+s4 Addition of partial sums across register: C++:
  • 19. MSVC output • Vectorized • 2x loop unrolled L1: movups xmm0,xmmword ptr [edi+eax*4] paddd xmm1,xmm0 movups xmm0,xmmword ptr [edi+eax*4+10h] add eax,8 paddd xmm2,xmm0 cmp eax,100000h jl .L1 paddd xmm2,xmm1 lea eax,[esp+10h] movaps xmm0,xmm2 psrldq xmm0,8 paddd xmm2,xmm0 movaps xmm0,xmm2 psrldq xmm0,4 paddd xmm2,xmm0 push eax movd esi,xmm2 MSVC15 /O2:
  • 20. • What can possibly confuse compiler? Scalar addition Problem with this one is that FP math is not associative Reduction challenge C++: static float sum = 0; void doSomething(float a) { sum += a; } for (int i = 0; i < N; ++i) { doSomething(arr[i]); } gcc 6.3 –O3: .L4: addss xmm0, DWORD PTR [rbp] add rbp, 4 cmp rbp, OFFSET FLAT:arr+4194304 jne .L4 pop rbx pop rbp movss DWORD PTR sum[rip], xmm0 pop r12 ret • E.g. we changed type to floating-point
  • 21. .L4: addps xmm1, XMMWORD PTR [rbp+0] add rbp, 16 cmp r12, rbp jne .L4 movaps xmm0, xmm1 pop rbx movhlps xmm0, xmm1 pop rbp addps xmm1, xmm0 pop r12 movaps xmm0, xmm1 shufps xmm0, xmm1, 85 addps xmm1, xmm0 movaps xmm0, xmm1 addss xmm0, xmm2 movss DWORD PTR sum[rip], xmm0 ret Reduction challenge C++: static float sum = 0; void doSomething(float a) { sum += a; } for (int i = 0; i < N; ++i) { doSomething(arr[i]); } gcc 6.3 –O3 –ffast-math
  • 22. Give compiler more context vxorps xmm0, xmm0, xmm0 .L4: vaddps ymm0, ymm0, YMMWORD PTR [r13] add r13, 32 cmp r14, r13 jne .L4 vhaddps ymm0, ymm0, ymm0 vhaddps ymm1, ymm0, ymm0 vperm2f128 ymm0, ymm1, ymm1, 1 vaddps ymm0, ymm0, ymm1 vaddss xmm0, xmm0, xmm2 vmovss DWORD PTR sum[rip], xmm0 gcc 6.3 –O3 –ffast-math –march=haswell • Assume we know target CPU = Haswell microarchitecture? • Let the compiler know! Neat! C++: static float sum = 0; void doSomething(float a) { sum += a; } for (int i = 0; i < N; ++i) { doSomething(arr[i]); }
  • 23. • Breaks all optimizations! • Link-time optimization may help • If not building DLL • You not just paying for extra function call, but lose vectorization! • And other potential optimizations! Reduction challenge C++: static float sum = 0; extern void doSomething(float a); for (int i = 0; i < N; ++i) { doSomething(arr[i]); }
  • 24. RISC vs CISC • RISC: Reduced Instruction Set Computing • CISC: Complex Instruction Set Computing • RISC: • Simple addressing modes • Uniform instruction format • Fewer data types in hardware • => Larger semantic gap between ISA and higher-level program • => But simplifies hardware a lot
  • 25. Microops • On x86 we have visible CISC architecture ISA, but underlying CPU actually works in RISC fashion… • Instructions are decoded to micro-operations add eax, [s] => load & add (2 microops) push eax => sub ESP, 4 (2 microops) mov [esp], eax
  • 26. Out-of-order execution Out of order part – execute microops; Out-of-order *Serialized again later to conform memory consistency model requirements Front-end: fetch and decode instructions; In-order
  • 27. Data dependencies • This kind of loop actually can perform badly • Each next iteration is dependent on previous one • Data dependencies kill out-of-order execution! C++: for (int i = 2; i < N; ++i) { arr[i] = arr[i-2] + arr[i-1]; }
  • 28. False dependencies • But we are talking about *real* dependencies • Remember Register renaming? movaps xmm1, XMMWORD PTR .LC0[rip] xor eax, eax .L4: add rax, 16 movaps xmm0, XMMWORD PTR a[rax-16] mulps xmm0, xmm1 movaps XMMWORD PTR b[rax-16], xmm0 cmp rax, 4194304 jne .L4 C++: for (int i = 0; i < N; ++i) { b[i] = a[i] * 0.5f; }
  • 29. False dependencies Dot = a1.x * a2.x + a1.y * a2.y + a1.z * a2.z + a1.w * a2.w movss xmm0, DWORD PTR [rdi] movss xmm1, DWORD PTR [rdi+4] mulss xmm0, DWORD PTR [rsi] mulss xmm1, DWORD PTR [rsi+4] movss xmm2, DWORD PTR [rdi+12] mulss xmm2, DWORD PTR [rsi+12] addss xmm0, xmm1 movss xmm1, DWORD PTR [rdi+8] mulss xmm1, DWORD PTR [rsi+8] addss xmm1, xmm2 addss xmm0, xmm1 • False dependencies! • Write-after-write are false dependencies • Write-after-read are false dependencies • Register renaming eliminates both
  • 30. Register renaming Physical register file • Register renaming maps architectural registers to physical ones • In B2 xmm1 will map to different physical location than in B1 movss xmm0, DWORD PTR [rdi] movss xmm1, DWORD PTR [rdi+4] mulss xmm0, DWORD PTR [rsi] mulss xmm1, DWORD PTR [rsi+4] movss xmm2, DWORD PTR [rdi+12] mulss xmm2, DWORD PTR [rsi+12] addss xmm0, xmm1 movss xmm1, DWORD PTR [rdi+8] mulss xmm1, DWORD PTR [rsi+8] addss xmm1, xmm2 addss xmm0, xmm1 B2 B1
  • 31. Small recap • Recap: • Compiler won’t solve all problems for you • Compiler optimizations may unexpectedly break, so be careful! • Instructions executed out of order, blocked by real dependencies only • Register renaming removes false dependencies • Code that looks simple and elegant is not necessary fast • Lets examine some code examples – how this affects optimization
  • 32. • Frame = one iteration of a loop • Discreet frame = frame NOT dependent on any other frame • Iterative frame = frame depends on another (previous) frame for( int i = …. ) { result[i] = i* 3 + 6; } for( int i = …. ) { result[i] = i* 3 + result[i-1]; } Frame dependencies
  • 33. What blocks vectorization ? • Various reasons code may not be vectorized by compiler. • Compilers continue to improve, but don’t assume your code is vectorized • Check the assembler! Two important causes of vectorization failure: Frame dependency - Frame depends on other (previous) frame(s) data. Data collision - Target (eg. array index) is not unique.
  • 34. AVX 512 AVX512 ( Intel Xeon Phi, Knights Landing ) • Special instructions (compress / expand) – give compilers a chance to process several loop iterations (typically 4 or 8) in a single iteration. Gives compiler better chance to vectorize. • Conflict detection. AVX512-CD - Can be used when uniqueness of array target cannot be determined.
  • 35. • When we are writing to an array in a loop – • construct a SIMD vector of indices in the array, for every 4 or 8 iterations. • use vpconflictd to create a bitmask of non-unique indices. • bitmask can then be used to vectorize unique elements and process separate non-unique targets after. • Specifically designed for compilers to vectorize loops. for (int i = … ) { int index = calculateIndex(i); array[index] = ….; } Index may NOT be unique - For example, hashing algorithms AVX512 – conflict detection
  • 36. Dependencies We shall…. Look at how dependencies between frames can ‘break’ your optimizations. What we can do to get around these problems. (when you don’t have AVX512! ) Look at just one type of optimization, evaluating polynomials, to demonstrate Finally some useful mathematical rules, to break down difficult formula – and how to use what we know about vectorization.
  • 37. • Summing squares. • Each iteration is not dependent and its easy to vectorize. • Yes – This could be calculated as n(n+1)(2n+1) / 6 - but compiler doesn’t know this! int xSqrSum(const int n) { int result = 0; int a = 1; int b = 0; for (int i = 1; i < n; ++i) { b += a; a += 2; result += b; } return result; } Dependencies block vectorization - Dependencies between frames - Dependencies inside frame int xSqrSum(const int n) { int result = 0; for (int i = 1; i < n; ++i) { result += i*i; } return result; } Discrete or Iterative - 𝑥2
  • 38. Assembler // Too much code to list here … main loop is…. .L30: movdqa xmm2, xmm3 movdqa xmm1, xmm3 add edx, 1 pmuludq xmm2, xmm3 psrlq xmm1, 32 pshufd xmm2, xmm2, 8 pmuludq xmm1, xmm1 pshufd xmm1, xmm1, 8 cmp eax, edx paddd xmm3, xmm4 punpckldq xmm2, xmm1 paddd xmm0, xmm2 ja .L30 xSqrSum (int): cmp edi, 1 jle .L24 lea esi, [rdi-1+rdi] xor ecx, ecx mov edx, 1 xor eax, eax .L23: add ecx, edx add edx, 2 add eax, ecx cmp esi, edx jne .L23 rep ret .L24: xor eax, eax ret Discrete ( vectorized ) Iterative Two multiplies ? - if we know 𝑥2 is a smaller range we could do better. - Another thing compiler could use range information for.
  • 39. • Often vectorized code has bigger footprint – but much faster • Tail code can be reduced by processing exact multiples only for loop limit. • We are concerned really with the main loop. • Beware: if a loop calls an external method that cannot be ‘seen’ (eg. in a DLL) then vectorization can fail because the method may take discreet single loop values. • Same problem if you do NOT use link time optimization (whole program optimization). ‘Setup’ code - initial values, etc. Main loop - processes blocks of typically 4 or 8 elements at once. - processes remaining elements. Setup Main loop Tail code Fast code, big code
  • 40. Strip-mining and vectorization • We can break the range of values in a loop into multiple sub-ranges. • To calculate squares from 1 to 100 we could split this into 4 ranges, and process each strip in a different channel of a SIMD register, in parallel. 1 to 25 26 to 50 51 to 75 76 to 100 1 to 100 𝑥2 𝑥2 𝑥2 𝑥2 SIMD pack
  • 41. Breaking up ranges • Firstly, we create a function to take a range, rather than 1..N int xSqrIterativeRange( const int lo, const int hi ) { int result = 0; int b = (lo - 1)*(lo - 1); int a = lo + lo - 1; for (int i = lo; i < hi; ++i) { b += a; a += 2; result += a; } return result; } Initial loop values calculated, explained later
  • 42. Parallel ranges .. 4 channels int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range) { int result0 = (lo0 - 1)*(lo0 – 1); int result1 = (lo1 - 1)*(lo1 – 1); …. // result2, result3 int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1; int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1; …… // a2, a3 for (int i = 0; i < range; ++i) { b0 += a0; a0 += 2; b1 += a1; a1 += 2; ….. // b2,a2, b3,a3 … result += a0 + a1 + a2 + a3; } return result; }
  • 43. Ranges NOT vectorized • Gcc failed to vectorize this – why ? • Answer: • If we replace the sum in the loop with four separate partial sums, it does vectorize: • Or… we can optimize this by hand, using SIMD intrinsics. result += a0 + a1 + a2 + a3; Result0 += a0; Result1 += a1; Result2 += a2; Result3 += a3;
  • 44. int xSqrIterativeRange4( const int lo0, const int lo1, const int lo2, const int lo3, const int range) { int result0 = (lo0 - 1)*(lo0 – 1); int result1 = (lo1 - 1)*(lo1 – 1); …. // result2, result3 int b0 = ( lo0 -1 ) * ( lo0 – 1 ); a0 = lo0 + lo0 – 1; int b1 = ( lo1 -1 ) * ( lo1 – 1 ); a1 = lo1 + lo1 – 1; …… // a2, a3 for (int i = 0; i < range; ++i) { b0 += a0; a0 += 2; b1 += a1; a1 += 2; ….. // b2,a2, b3,a3 … } return result0 + result1 + result2 + result3; } result0 += a0; result1 += a1; …. Vectorizable code
  • 45. Quick summary • Optimization of discrete loops can add dependencies which are hard to vectorize • Can break up ranges into sub-ranges and re-code for parallelism, ie. strip- mining. • May need to hand-craft SIMD with intrinsics. • Often we can find clever iterative ways of ‘optimizing’ loop functions to require less mathematical operations …
  • 46. Iterative - floating point • Big problem – cumulative errors. • Compiler may not vectorize because re-arranging terms may not give the same result due to rounding etc. • Some classical optimizations not always worth the problems. Eg. loop for calculating sin and cos iteratively done with a couple of additions per iteration, but cumulative errors may be problematic.
  • 47. Example… Polynomials • Iteratively can evaluate using only additions per frame. • Number of additions = maximum power + 1 • But Which is fastest ? – can’t assume that less math. operations is fastest. • Only looking at integers here. The same principles apply to floating point but there are other problems too. Eg: 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒 Iterative: 4 additions, each dependent on last Discreet: Several multiplications and adds.
  • 48. Polynomial differencing • This is old idea, - Babbage’s Difference Engine etc. • Let us have a polynomial P(x) of order ‘n’, P(x+1) – P(x) is of order n-1 Eg. P(x) = 𝒙^𝟐 , P(x+1) – P(x) = 𝟐𝒙 + 𝟏 (1) Let Q(x) = 𝟐𝒙 + 𝟏 , Q(x+1) – Q(x) = 𝟐 (2) • Keep repeating the process, record the new polynomial until a constant is reached. • In this example, we now have the the polynomials ( 𝒙^𝟐 , 𝟐𝒙 + 𝟏 , 2 ) • We begin by calculating each of these terms at our initial value, then every time we iterate the loop we add each term to the previous one, and we get the next value of the polynomial.
  • 49. 𝑥 = 2 3 4 5 … 𝑥2 = 4 9 16 25 … 2𝑥 + 1 = 5 7 9 … 2 = 2 2 … void Loop() { int a = 4; int b = 5; int c = 2; while(….) { // a = x^2 here a += b; b += c; } } Polynomial differences
  • 50. ( 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒) • Discrete: (compiler vectorizes the main loop) int polynomial(int n) { int sum = 0; for( int i = 1; i < n; ++i ) { int result = 5 * i*i*i + 3 * i*i + i – 4; sum += result; } return sum; } Compiler reduces the number of multiplies with shifts and adds, etc
  • 51. ( 𝟓𝒙 𝟑 + 𝟑𝒙 𝟐 + 𝒙 − 𝟒) • Iterative: (compiler does not vectorize this! ) int polynomial(int n) { int sum = 0; int a = -4 ; // 5x^3 + 3x^2 + x – 4, at x = 0 int b = 9 ; // q(x) = p(x+1) - p(x) = 15x^2 + 21x + 9, at x = 0 int c = 36; // r(x) = q(x+1) – q(x) = 30x + 36, at x = 0 int d = 30; // r(x+1) – r(x) = 30 for( int i = 1; i < n; ++i ) { a += b; b+= c; c+= d; sum += a; } return sum; } Dependencies !!!
  • 52. Modular arithmetic • To prevent integer calculations going out of range (outside the native bit-length of the CPU) • Limit our discussion here to briefly show a simple technique to solve some problems. • Look an example to demonstrate. • Compilers are not using these techniques and so its up to you!
  • 53. Continued… • What kind of problem ? • Result is ‘in range’ but partial results too big. • Simple example is: 𝒓 = 𝒂 ∗ 𝒃 /𝒄 • The result of 𝒂 ∗ 𝒃 may be more than one machine word • Instruction sets typically have a multiply to produce two words, and divide takes two word numerator as input. on x86 we multiply two 32-bits to get edx,eax = 64 bit result in two registers. If we divide it, the idiv instruction takes both registers as 64 bit numerator input.
  • 54. Modular Exponentiation So, Lets look at a more interesting problem not solved so easily: Remember…. We are dealing with positive integers here We know that 𝒓 must be in the range [ 0, c-1] Raising a to the power b can result in huge numbers. Far too big for the machine word length. 𝒓 = (𝒂 𝒃 ) 𝒎𝒐𝒅 𝒄
  • 55. 𝒓 = 𝒂 𝒃 𝒎𝒐𝒅 𝒄 • A few maths things we can do, but we want a general solution. • We reduce the power b. • We create an algorithm which iterates through a loop to get the result.
  • 56. Reducing the power 𝜑 𝑐 = Euler Totient Function 𝝋 𝒄 = 𝒄 ∗ ∏(𝟏 − 𝟏 𝒑 ) = 𝒄 ∗ (𝟏 − 𝟏 𝒑 𝟎 ) ∗ (𝟏 − 𝟏 𝒑 𝟏 ) … Where 𝑝 = prime factor of c Example: 𝝋 𝟏𝟎𝟎 = ( 𝟏 − 𝟏 𝟐 ) ( 1 − 𝟏 𝟓 ) = 40; since 100 = 22 . 52 , ( prime factors of 100 being 2 and 5 ) 𝑖𝑓 𝑟 = 𝑎 𝑏 𝑚𝑜𝑑 𝑐 𝑟 = 𝑎 𝑏 𝑚𝑜𝑑 𝜑 𝑐 𝑚𝑜𝑑 𝑐
  • 57. Example 567123 𝑚𝑜𝑑 100 = 567 (123 𝑚𝑜𝑑 40) 𝑚𝑜𝑑 100 // since 𝜑 100 = 40 = 5673 𝑚𝑜𝑑 100 • A smaller power makes the modular exponentiation faster. • Keep a table of precomputed totient functions. 567^123 has 1126 bits!
  • 58. Useful relationships Suppose we have three integers: a, b, c Let : 𝒓 = 𝒂 𝒎𝒐𝒅 𝒄 𝒔 = 𝒃 𝒎𝒐𝒅 𝒄 Then: Simple, but really useful to break formula into smaller parts. Allows us to find solutions which consists of loops which iteratively evaluate a result, without going out of range. 𝒂𝒃 𝒎𝒐𝒅 𝒄 = (𝒓𝒔) 𝒎𝒐𝒅 𝒄 (1) (𝒂 + 𝒃) 𝒎𝒐𝒅 𝒄 = (𝒓 + 𝒔) 𝒎𝒐𝒅 𝒄 (2)
  • 59. To evaluate: 𝒓 = 𝒂 𝒃 𝒎𝒐𝒅 𝒄 First reduce b using the totient function => less computation. Then we can break up 𝒂 𝒑 using the bits of 𝒑 : // 𝒑 = b mod 𝝋 𝒄 eg. (𝑝 =21): 𝒂 𝟐𝟏 𝒎𝒐𝒅 𝒄 = (𝒂 𝟏𝟔 . 𝒂 𝟒 . 𝒂 𝟏 ) 𝒎𝒐𝒅 𝒄 Create a loop to calculate each binary power of ‘a’ mod c by squaring the previous one. 𝒂 𝟐𝒏 𝒎𝒐𝒅 𝒄 = 𝒂 𝒏 𝒎𝒐𝒅 𝒄 𝟐 𝒎𝒐𝒅 𝒄 Combining these formulae we can get some code ……… Exponentiation continued…
  • 60. int expMod( int a, int b, int c) // for 32 bit, b < 1^31-1 because of the mask test { int mask = 1; int r = 1; while(1) { mask += mask; if (mask > b) break; } return r; } a *= a; a %= c; if (b & mask) { r*= a; r %= c; } Combine for set bits… (recall, 𝒂 𝟐𝟏 𝒎𝒐𝒅 𝒄 = (𝒂 𝟏𝟔 . 𝒂 𝟒. 𝒂 𝟏 ) 𝒎𝒐𝒅 𝒄 ) 𝒂 𝟐𝒏 𝒎𝒐𝒅 𝒄 = 𝒂 𝒏 𝒎𝒐𝒅 𝒄 𝟐 𝒎𝒐𝒅 𝒄 expMod
  • 61. uint expMod( int a, int b, int c ) { int mask = 1; int r = 1; while(1) { if (b & mask) { r*= a; } mask += mask; if (mask > b) break; a *= a; } return r; } if ( r > MAXROOT ) r %= c; if ( a > MAXROOT ) a %= c; if (r > c) r %= c; • We only need to take the mod when a multiply next time will cause an overflow • This means that the final result may need a mod to bring it into range. • MAXROOT = the square root of the biggest integer. 32-bit = 0xFFFF 64-bit = 0xFFFFFFFF One last optimization
  • 62. • Dependencies both inside loop frames and between frames. • Difficult to break into sub-ranges – can’t determine initial state for each sub-range. • BUT!... Many other ways to improve performance of this kind of algorithm – beyond the scope of this discussion. Same old problems
  • 63. Another code sample • Another example of using modular arithmetic - • Uses just additive operations and compares - • Examines bits of ‘b’ in a right-to-left way (least significant to most) - • Again, code is very suboptimal, but just for demonstration. b mod c No division Any size numerator
  • 64. int mod( int b, int c) // b % c without division for 32 bit, b range [1, 2^31-1] { int mask = 1; int r = 0; int a = 1; while(1) { if (b & mask) { } mask += mask; if (mask > b) break; } return r; } r += a; if( r >= c ) r -= c; a+=a; if( a >= c ) a -= c; Add 𝟐𝒊 𝒎𝒐𝒅 𝒄 to the total Calculate 𝟐𝒊 𝒎𝒐𝒅 𝒄 each iteration b mod c
  • 65. Recap • Remember about Instruction level parallelism • Vectorization for SIMD easily broken • Range information is not fully used by the compiler • Mathematical tricks to optimize may not be as effective as they appear • Modular arithmetic can be used to break up difficult computations

Notes de l'éditeur

  1. Why does it matter for us? Well…
  2. Do you remember SIMD? Very heavily used in games…
  3. Do you remember SIMD? Very heavily used in games…
  4. Do you remember SIMD? Very heavily used in games…
  5. *Bullet points of what is CISC and RISC*
  6. *Bullet points of what is CISC and RISC*
  7. *Bullet points of what is CISC and RISC*
  8. *Bullet points of what is CISC and RISC*
  9. *Bullet points of what is CISC and RISC*
  10. *Bullet points of what is CISC and RISC*
  11. *Bullet points of what is CISC and RISC*
  12. Let’s take a look at block-scheme of Sandy-Bridge microarchitecture… It will be important when we will discuss algorithmic stuff later in the talk…