SlideShare une entreprise Scribd logo
1  sur  188
Cranking Floating Point
Performance Up To 11 
             Noel Llopis
            Snappy Touch

    http://twitter.com/snappytouch
       noel@snappytouch.com
     http://gamesfromwithin.com
void* p = &s_particles2[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}         nt"
	   	   	   	 "fldmias %2, {s1}         nt"
	   	   	   	 "mov r1, %0               nt"
	   	   	   	 "mov r2, %0               nt"
	   	   	   	 "mov r3, %3               nt"
	   	   	   	 "0:                       nt"
	   	   	   	 "fldmias r1!, {s8-s13}    nt"
	   	   	   	 "fldmias r1!, {s16-s21} nt"
	   	   	   	 "fmacs s8, s16, s0        nt"
	   	   	   	 "fmuls s16, s16, s1       nt"
	   	   	   	 "fstmias r2!, {s8-s13}    nt"
	   	   	   	 "fstmias r2!, {s16-s21} nt"
	   	   	   	 "subs r3, r3, #1          nt"
	   	   	   	 "bne 0b                   nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
“Don’t do that bit-
 twiddling thing”
“Don’t do that bit-
 twiddling thing”

“Optimize at the
 algorithm level”
“Don’t do that bit-
                      twiddling thing”

                      “Optimize at the
                       algorithm level”




Yes, but key to good performance
 is looking at your data and your
          target platform
Floating Point
Performance
Floating point numbers
Floating point numbers

• Representation of rational numbers
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
• Single precision: 32 bits
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
• Single precision: 32 bits
• Double precision: 64 bits
Floating point numbers
Floating point numbers
Why floating point
 performance?
Why floating point
    performance?

• Most games use floating point numbers for
  most of their calculations
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, velocities, physics, etc, etc.
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, velocities, physics, etc, etc.
• Maybe not so much for regular apps
CPU
CPU

• 32-bit RISC ARM 11
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
  Touch 1st and 2nd gen
CPU (iPhone 3GS)
CPU (iPhone 3GS)


• Cortex-A8 600MHz
CPU (iPhone 3GS)


• Cortex-A8 600MHz
• More advanced
  architecture
CPU
CPU


• No floating point support
  in the ARM CPU!!!
How about integer
     math?
How about integer
        math?

• No need to do any floating point
  operations
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
• But...
Integer Divide
Integer Divide
Integer Divide




There is no integer divide
Fixed-point arithmetic
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
• Performs rational arithmetic with integer
  values at a reduced range/resolution.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
• Performs rational arithmetic with integer
  values at a reduced range/resolution.
• Not so great...
Floating point support
Floating point support
• There’s a floating point
  unit
Floating point support
• There’s a floating point
  unit

• Compiled C/C++/ObjC
  code uses the VFP unit
  for any floating point
  operations.
Sample program
Sample program
	   struct Particle
	   {
	   	 float x, y, z;
	   	 float vx, vy, vz;
	   };
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, z;          Particle& p = s_particles[i];
	   	 float vx, vy, vz;       p.x += p.vx*dt;
	   };                        p.y += p.vy*dt;
                              p.z += p.vz*dt;
                              p.vx *= drag;
                              p.vy *= drag;
                              p.vz *= drag;
                          }
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, z;          Particle& p = s_particles[i];
	   	 float vx, vy, vz;       p.x += p.vx*dt;
	   };                        p.y += p.vy*dt;
                              p.z += p.vz*dt;
                              p.vx *= drag;
                              p.vy *= drag;
                              p.vz *= drag;
                          }




         • 14.1 seconds on an iPod Touch 2nd gen
Floating point support
Floating point support

        Trust no one!
Floating point support

         Trust no one!



 When in doubt, check the
   assembly generated
Floating point support
Thumb Mode
Thumb Mode
Thumb Mode
   • CPU has a special thumb mode.
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
   • No floating point support.
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
   • No floating point support.
   • Every timeitthere’s an fp of
     operation, switches out
     Thumb, does the fp operation,
     and switches back on.
Thumb Mode
Thumb Mode

    • It’s on by default!
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point operations
  though
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point operations
  though
• Most games will probably benefit from
  turning it off (especially 3D games)
5.1 seconds!
ARM assembly
   DISCLAIMER:
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
              DISCLAIMER:
I’m not an ARM assembly expert!!!




          Z80!!!
ARM assembly
ARM assembly

• Hit the docs
ARM assembly

• Hit the docs
• References included in your USB card
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
• http://bit.ly/arminfo
ARM assembly
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
• Writing is more specialized. Most people
  don’t need to.
VFP unit
VFP unit
A0
VFP unit
A0
+
VFP unit
A0
+
B0
VFP unit
A0
+
B0
=
VFP unit
A0
+
B0
=
C0
VFP unit
A0
+
B0
=
C0


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1   A3
+    +
B1   B3
=    =
C1   C3
VFP unit
VFP unit
A0   A1   A2    A3
VFP unit
A0   A1       A2    A3

          +
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3




 Sweet! How do we
    use the vfp?
Like this!

"fldmias %2, {s8-s23}     nt"
"fldmias %1!, {s0-s3}     nt"
"fmuls s24, s8, s0        nt"
"fmacs s24, s12, s1       nt"

"fldmias %1!,   {s4-s7}   nt"

"fmacs s24, s16, s2       nt"
"fmacs s24, s20, s3       nt"
"fstmias %0!, {s24-s27}   nt"
Writing vfp assembly
Writing vfp assembly

• There are two parts to it
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
 • Learning ARM and VPM assembly
vfpmath library
vfpmath library

• Already done a lot of work for you
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
• Might not be exactly what you need, but it’s
  a great starting point
Assembly in gcc
• Only use it when targeting the device
Assembly in gcc
 • Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1)
	 #define USE_VFP
#endif
Assembly in gcc
• The basics

          asm (“cmp r2, r1”);
Assembly in gcc
    • The basics

                asm (“cmp r2, r1”);




http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-
                    HOWTO.html
Assembly in gcc
• Multiple lines
            asm (
                “mov r0, #1000nt”
                “cmp r2, r1nt”
            );
Assembly in gcc
• Accessing C variables
         asm (//assembly code
             : // output operands
             : // input operands
             : // clobbered registers
         );
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
                 : // input operands
                 : // clobbered registers
             );

     	   	   int src = 19;
     	   	   int dest = 0;
     	   	
     	   	   asm volatile (
     	   	   	 "add %0, %1, #42"
     	   	   	 : "=r" (dest)
     	   	   	 : "r" (src)
     	   	   	 :
     	   	   );
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
                 : // input operands
                 : // clobbered registers
             );

     	   	   int src = 19;
     	   	   int dest = 0;
                                    %0, %1, etc are the
     	   	                          variables in order
     	   	   asm volatile (
     	   	   	 "add %0, %1, #42"
     	   	   	 : "=r" (dest)
     	   	   	 : "r" (src)
     	   	   	 :
     	   	   );
Assembly in gcc
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	   	   	 "add %0, r10, #33nt"
	   	   	 : "=r" (dest)
	   	   	 : "r" (src)
	   	   	 : "r10"
	   	   );
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	   	   	 "add %0, r10, #33nt"
	   	   	 : "=r" (dest)
	   	   	 : "r" (src)
	   	   	 : "r10"
	   	   );

                        Clobber register list
                        are registers used by
                           the asm block
Assembly in gcc
	   	   int src = 19;      volatile prevents “optimizations”
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	   	   	 "add %0, r10, #33nt"
	   	   	 : "=r" (dest)
	   	   	 : "r" (src)
	   	   	 : "r10"
	   	   );

                        Clobber register list
                        are registers used by
                           the asm block
VFP asm
Four banks of 8 32-bit registers each




        Can address them as single precision
                  or as doubles
VFP asm
VFP asm




#define VFP_VECTOR_LENGTH(VEC_LENGTH)
    "fmrx    r0, fpscr                         nt" 
    "bic     r0, r0, #0x00370000               nt" 
    "orr     r0, r0, #0x000" #VEC_LENGTH "0000 nt" 
    "fmxr    fpscr, r0                         nt"
VFP asm




       Bank 0 is always scalar!
Operations only work on a single bank
       (wrap around possible)
VFP asm
VFP asm
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];
    p.x += p.vx*dt;
    p.y += p.vy*dt;
    p.z += p.vz*dt;
    p.vx *= drag;
    p.vy *= drag;
    p.vz *= drag;
}
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];    {
    p.x += p.vx*dt;                      void* p = &s_particles[i];
    p.y += p.vy*dt;
                                         asm volatile (
    p.z += p.vz*dt;
    p.vx *= drag;                            "fldmias %1, {s0}   nt"
    p.vy *= drag;                            "fldmias %2, {s1}   nt"
    p.vz *= drag;                            "fldmias %0, {s8-s13}    nt"
}
                                             "fmacs s8, s11, s0      nt"
                                             "fmuls s11, s11, s1     nt"
                                             "fstmias %0, {s8-s13}    nt"
                                             :
                                             : "r" (p), "r" (&dt), "r" (&drag)
                                             : "r0", "cc", "memory"
                                         );
                                     }
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];    {
    p.x += p.vx*dt;                      void* p = &s_particles[i];
    p.y += p.vy*dt;
                                         asm volatile (
    p.z += p.vz*dt;
    p.vx *= drag;                            "fldmias %1, {s0}   nt"
    p.vy *= drag;                            "fldmias %2, {s1}   nt"
    p.vz *= drag;                            "fldmias %0, {s8-s13}    nt"
}
                                             "fmacs s8, s11, s0      nt"
                                             "fmuls s11, s11, s1     nt"
 Was: 5.1 seconds                            "fstmias %0, {s8-s13}    nt"
                                             :
                                             : "r" (p), "r" (&dt), "r" (&drag)
                                             : "r0", "cc", "memory"
                                         );
                                     }
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];    {
    p.x += p.vx*dt;                      void* p = &s_particles[i];
    p.y += p.vy*dt;
                                         asm volatile (
    p.z += p.vz*dt;
    p.vx *= drag;                            "fldmias %1, {s0}   nt"
    p.vy *= drag;                            "fldmias %2, {s1}   nt"
    p.vz *= drag;                            "fldmias %0, {s8-s13}    nt"
}
                                             "fmacs s8, s11, s0      nt"
                                             "fmuls s11, s11, s1     nt"
 Was: 5.1 seconds                            "fstmias %0, {s8-s13}    nt"
                                             :
 Now: 2.7 seconds!!                          : "r" (p), "r" (&dt), "r" (&drag)
                                             : "r0", "cc", "memory"
                                         );
                                     }
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
    void* p = &s_particles[i];
    asm volatile (
        "fldmias %1, {s0}   nt"
        "fldmias %2, {s1}   nt"
        "fldmias %0, {s8-s13}    nt"
        "fmacs s8, s11, s0      nt"
        "fmuls s11, s11, s1     nt"
        "fstmias %0, {s8-s13}    nt"
        :
        : "r" (p), "r" (&dt), "r" (&drag)
        : "r0", "cc", "memory"
    );
}
VFP asm
for (int i=0; i<MaxParticles; ++i)      Same every loop!
{
    void* p = &s_particles[i];
    asm volatile (
        "fldmias %1, {s0}   nt"
        "fldmias %2, {s1}   nt"
        "fldmias %0, {s8-s13}    nt"
        "fmacs s8, s11, s0      nt"
        "fmuls s11, s11, s1     nt"
        "fstmias %0, {s8-s13}    nt"
        :
        : "r" (p), "r" (&dt), "r" (&drag)
        : "r0", "cc", "memory"
    );
}
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"
	   	   	   	 "fldmias %2, {s1}        nt"
	   	   	   	 "mov r1, %0              nt"
	   	   	   	 "mov r2, %0              nt"
	   	   	   	 "mov r3, %3              nt"
	   	   	   	 "0:                      nt"	 	 	 	
	   	   	   	 "fldmias r1!, {s8-s13}   nt"
	   	   	   	 "fmacs s8, s11, s0       nt"
	   	   	   	 "fmuls s11, s11, s1      nt"
	   	   	   	 "fstmias r2!, {s8-s13}   nt"
	   	   	   	 "subs r3, r3, #1         nt"
	   	   	   	 "bne 0b                  nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"   Was: 2.7 seconds
	   	   	   	 "fldmias %2, {s1}        nt"
	   	   	   	 "mov r1, %0              nt"
	   	   	   	 "mov r2, %0              nt"
	   	   	   	 "mov r3, %3              nt"
	   	   	   	 "0:                      nt"	 	 	 	
	   	   	   	 "fldmias r1!, {s8-s13}   nt"
	   	   	   	 "fmacs s8, s11, s0       nt"
	   	   	   	 "fmuls s11, s11, s1      nt"
	   	   	   	 "fstmias r2!, {s8-s13}   nt"
	   	   	   	 "subs r3, r3, #1         nt"
	   	   	   	 "bne 0b                  nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"   Was: 2.7 seconds
	   	   	   	 "fldmias %2, {s1}        nt"
	   	   	   	 "mov r1, %0              nt"   Now: 2.7 seconds
	   	   	   	 "mov r2, %0              nt"
	   	   	   	 "mov r3, %3              nt"
	   	   	   	 "0:                      nt"	 	 	 	
	   	   	   	 "fldmias r1!, {s8-s13}   nt"
	   	   	   	 "fmacs s8, s11, s0       nt"
	   	   	   	 "fmuls s11, s11, s1      nt"
	   	   	   	 "fstmias r2!, {s8-s13}   nt"
	   	   	   	 "subs r3, r3, #1         nt"
	   	   	   	 "bne 0b                  nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
We can do 8 operations at once. So let’s try doing two
           particles in a single operation.

                	   struct Particle2
                	   {
                	   	 float x0, y0, z0;
                	   	 float x1, y1, z1;
                	   	 float vx0, vy0, vz0;
                	   	 float vx1, vy1, vz1;
                	   };
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}         nt"
	   	   	   	 "fldmias %2, {s1}         nt"
	   	   	   	 "mov r1, %0               nt"
	   	   	   	 "mov r2, %0               nt"
	   	   	   	 "mov r3, %3               nt"
	   	   	   	 "0:                       nt"
	   	   	   	 "fldmias r1!, {s8-s13}    nt"
	   	   	   	 "fldmias r1!, {s16-s21} nt"
	   	   	   	 "fmacs s8, s16, s0        nt"
	   	   	   	 "fmuls s16, s16, s1       nt"
	   	   	   	 "fstmias r2!, {s8-s13}    nt"
	   	   	   	 "fstmias r2!, {s16-s21} nt"
	   	   	   	 "subs r3, r3, #1          nt"
	   	   	   	 "bne 0b                   nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	
	
    	
    	
        	
        	
            	 "fldmias %1, {s0}
            	 "fldmias %2, {s1}
                                        nt"
                                        nt"       Was: 2.77 seconds
	   	   	   	 "mov r1, %0               nt"
	   	   	   	 "mov r2, %0               nt"
	   	   	   	 "mov r3, %3               nt"
	   	   	   	 "0:                       nt"
	   	   	   	 "fldmias r1!, {s8-s13}    nt"
	   	   	   	 "fldmias r1!, {s16-s21} nt"
	   	   	   	 "fmacs s8, s16, s0        nt"
	   	   	   	 "fmuls s16, s16, s1       nt"
	   	   	   	 "fstmias r2!, {s8-s13}    nt"
	   	   	   	 "fstmias r2!, {s16-s21} nt"
	   	   	   	 "subs r3, r3, #1          nt"
	   	   	   	 "bne 0b                   nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	
	
    	
    	
        	
        	
            	 "fldmias %1, {s0}
            	 "fldmias %2, {s1}
                                        nt"
                                        nt"       Was: 2.77 seconds
	   	   	   	 "mov r1, %0               nt"
	
	
    	
    	
        	
        	
            	 "mov r2, %0
            	 "mov r3, %3
                                        nt"
                                        nt"       Now: 2.67 seconds
	   	   	   	 "0:                       nt"
	   	   	   	 "fldmias r1!, {s8-s13}    nt"
	   	   	   	 "fldmias r1!, {s16-s21} nt"
	   	   	   	 "fmacs s8, s16, s0        nt"
	   	   	   	 "fmuls s16, s16, s1       nt"
	   	   	   	 "fstmias r2!, {s8-s13}    nt"
	   	   	   	 "fstmias r2!, {s16-s21} nt"
	   	   	   	 "subs r3, r3, #1          nt"
	   	   	   	 "bne 0b                   nt"
	   	   	   	 :
	   	   	   	 : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
	   	   	   	 : "r0", "r1", "r2", "r3", "cc", "memory"
	   	   	   );
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Particle* p = &s_particles[i];
	   	   	   	 p->x = p->vx;
	   	   	   	 p->y = p->vy;
	   	   	   	 p->z = p->vz;
	   	   	   }
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Particle* p = &s_particles[i];
	   	   	   	 p->x = p->vx;
	   	   	   	 p->y = p->vy;
	   	   	   	 p->z = p->vz;
	   	   	   }



            Was: 2.67 seconds
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Particle* p = &s_particles[i];
	   	   	   	 p->x = p->vx;
	   	   	   	 p->y = p->vy;
	   	   	   	 p->z = p->vz;
	   	   	   }



            Was: 2.67 seconds
            Now: 2.41 seconds!!!!
Matrix multiply
Matrix multiply
Straight from vfpmathlib
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s

     About 2x faster!
Good use of vfp
Good use of vfp
Something with lots of fp operations in a row
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Physics
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Physics
 • Procedural content generation
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Physics
 • Procedural content generation
 • ....
What about the 3GS?
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77   4.26
  VFP3    2.66   3.57
 Touch    2.41   0.42
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77   4.26
  VFP3    2.66   3.57
 Touch    2.41   0.42
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77   4.26
  VFP3    2.66   3.57
 Touch    2.41   0.42
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77   4.26
  VFP3    2.66   3.57
 Touch    2.41   0.42
Matrix multiply on 3GS
        In ms
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42    75

  Touch     38    19
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42    75

  Touch     38    19
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42    75

  Touch     38    19
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42    75

  Touch     38    19
VFP resources
• ARM and VFP reference in your USB drive
• http://code.google.com/p/vfpmathlibrary
• http://aleiby.blogspot.com/2008/12/iphone-
  vfp-for-n00bs.html
• http://www.ibiblio.org/gferg/ldp/GCC-
  Inline-Assembly-HOWTO.html
More 3GS: NEON
More 3GS: NEON

• SIMD coprocessor
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
• Not many examples yet
NEON resources
NEON resources

• Cortex A8 reference in USB drive
NEON resources

• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
  NEON-Intrinsics.html
NEON resources

• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
  NEON-Intrinsics.html
• http://code.google.com/p/oolongengine/
  source/browse/trunk/Oolong+Engine2/
  Math/neonmath
Conclusions
Conclusions
• Turn Thumb mode off NOW
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
• Not much difference in 3GS (but it’s fast
  already)
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
• Not much difference in 3GS (but it’s fast
  already)
• NEON SIMD tech still unused. Research
  that and be the first one with the killer 3GS
  app!
Thank you!


         Noel Llopis
        Snappy Touch

http://twitter.com/snappytouch
   noel@snappytouch.com
 http://gamesfromwithin.com

Contenu connexe

En vedette

Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSantosh Verma
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceJJ Wu
 
OpenCV for Embedded: Lessons Learned
OpenCV for Embedded: Lessons LearnedOpenCV for Embedded: Lessons Learned
OpenCV for Embedded: Lessons LearnedYury Gorbachev
 
ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
iPhone Architecture - Review
iPhone Architecture - ReviewiPhone Architecture - Review
iPhone Architecture - ReviewAbdelrahman Hosny
 
12 Cooling Load Calculations
12 Cooling Load Calculations12 Cooling Load Calculations
12 Cooling Load Calculationsspsu
 

En vedette (13)

ARM cortex A15
ARM cortex A15ARM cortex A15
ARM cortex A15
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Imaging on embedded GPUs
Imaging on embedded GPUsImaging on embedded GPUs
Imaging on embedded GPUs
 
Android Optimization: Myth and Reality
Android Optimization: Myth and RealityAndroid Optimization: Myth and Reality
Android Optimization: Myth and Reality
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile Device
 
Snapdragon Processor
Snapdragon ProcessorSnapdragon Processor
Snapdragon Processor
 
OpenCV for Embedded: Lessons Learned
OpenCV for Embedded: Lessons LearnedOpenCV for Embedded: Lessons Learned
OpenCV for Embedded: Lessons Learned
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
iPhone Architecture - Review
iPhone Architecture - ReviewiPhone Architecture - Review
iPhone Architecture - Review
 
12 Cooling Load Calculations
12 Cooling Load Calculations12 Cooling Load Calculations
12 Cooling Load Calculations
 

Similaire à Cranking Floating Point Performance To 11 On The iPhone

Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmPawel Szulc
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321Teddy Hsiung
 
Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Sperasoft
 
Code and memory optimization tricks
Code and memory optimization tricksCode and memory optimization tricks
Code and memory optimization tricksDevGAMM Conference
 
Mediump support in Mesa (XDC 2019)
Mediump support in Mesa (XDC 2019)Mediump support in Mesa (XDC 2019)
Mediump support in Mesa (XDC 2019)Igalia
 
Arduino Platform with C programming.
Arduino Platform with C programming.Arduino Platform with C programming.
Arduino Platform with C programming.Govind Jha
 
Hands-on VeriFast with STM32 microcontroller @ Osaka
Hands-on VeriFast with STM32 microcontroller @ OsakaHands-on VeriFast with STM32 microcontroller @ Osaka
Hands-on VeriFast with STM32 microcontroller @ OsakaKiwamu Okabe
 
OSCON Presentation: Developing High Performance Websites and Modern Apps with...
OSCON Presentation: Developing High Performance Websites and Modern Apps with...OSCON Presentation: Developing High Performance Websites and Modern Apps with...
OSCON Presentation: Developing High Performance Websites and Modern Apps with...Doris Chen
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding AutovacuumDan Robinson
 
Introduction to Arduino and Circuits
Introduction to Arduino and CircuitsIntroduction to Arduino and Circuits
Introduction to Arduino and CircuitsJason Griffey
 
Exploring the x64
Exploring the x64Exploring the x64
Exploring the x64FFRI, Inc.
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced BasicsDoug Jones
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROPSaumil Shah
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimonSisimon Soman
 

Similaire à Cranking Floating Point Performance To 11 On The iPhone (20)

Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvm
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks Code and Memory Optimisation Tricks
Code and Memory Optimisation Tricks
 
Code and memory optimization tricks
Code and memory optimization tricksCode and memory optimization tricks
Code and memory optimization tricks
 
Mediump support in Mesa (XDC 2019)
Mediump support in Mesa (XDC 2019)Mediump support in Mesa (XDC 2019)
Mediump support in Mesa (XDC 2019)
 
Microchip Mfg. problem
Microchip Mfg. problemMicrochip Mfg. problem
Microchip Mfg. problem
 
Message passing
Message passingMessage passing
Message passing
 
Arduino Platform with C programming.
Arduino Platform with C programming.Arduino Platform with C programming.
Arduino Platform with C programming.
 
Inside Winnyp
Inside WinnypInside Winnyp
Inside Winnyp
 
Hands-on VeriFast with STM32 microcontroller @ Osaka
Hands-on VeriFast with STM32 microcontroller @ OsakaHands-on VeriFast with STM32 microcontroller @ Osaka
Hands-on VeriFast with STM32 microcontroller @ Osaka
 
OSCON Presentation: Developing High Performance Websites and Modern Apps with...
OSCON Presentation: Developing High Performance Websites and Modern Apps with...OSCON Presentation: Developing High Performance Websites and Modern Apps with...
OSCON Presentation: Developing High Performance Websites and Modern Apps with...
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding Autovacuum
 
Introduction to Arduino and Circuits
Introduction to Arduino and CircuitsIntroduction to Arduino and Circuits
Introduction to Arduino and Circuits
 
Exploring the x64
Exploring the x64Exploring the x64
Exploring the x64
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced Basics
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
 
Emulating With JavaScript
Emulating With JavaScriptEmulating With JavaScript
Emulating With JavaScript
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROP
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimon
 

Dernier

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Cranking Floating Point Performance To 11 On The iPhone

  • 1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
  • 2. void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. “Don’t do that bit- twiddling thing”
  • 11. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level”
  • 12. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level” Yes, but key to good performance is looking at your data and your target platform
  • 15. Floating point numbers • Representation of rational numbers
  • 16. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
  • 17. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
  • 18. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
  • 19. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
  • 22. Why floating point performance?
  • 23. Why floating point performance? • Most games use floating point numbers for most of their calculations
  • 24. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
  • 25. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
  • 26. CPU
  • 28. CPU • 32-bit RISC ARM 11 • 400-535Mhz
  • 29. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
  • 31. CPU (iPhone 3GS) • Cortex-A8 600MHz
  • 32. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
  • 33. CPU
  • 34. CPU • No floating point support in the ARM CPU!!!
  • 36. How about integer math? • No need to do any floating point operations
  • 37. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor
  • 38. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor • But...
  • 41. Integer Divide There is no integer divide
  • 43. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
  • 44. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
  • 45. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library.
  • 46. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
  • 47. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
  • 49. Floating point support • There’s a floating point unit
  • 50. Floating point support • There’s a floating point unit • Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.
  • 52. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
  • 53. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  • 54. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 14.1 seconds on an iPod Touch 2nd gen
  • 56. Floating point support Trust no one!
  • 57. Floating point support Trust no one! When in doubt, check the assembly generated
  • 61. Thumb Mode • CPU has a special thumb mode.
  • 62. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
  • 63. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support.
  • 64. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support. • Every timeitthere’s an fp of operation, switches out Thumb, does the fp operation, and switches back on.
  • 66. Thumb Mode • It’s on by default!
  • 67. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  • 68. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  • 70. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
  • 71. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though
  • 72. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though • Most games will probably benefit from turning it off (especially 3D games)
  • 73.
  • 75. ARM assembly DISCLAIMER:
  • 76. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 77. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 78. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 79. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
  • 82. ARM assembly • Hit the docs • References included in your USB card
  • 83. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
  • 84. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
  • 86. ARM assembly • Reading assembly is a very important skill for high-performance programming
  • 87. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
  • 95. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
  • 96. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
  • 98. VFP unit A0 A1 A2 A3
  • 99. VFP unit A0 A1 A2 A3 +
  • 100. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
  • 101. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
  • 102. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
  • 103. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
  • 104. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
  • 106. Writing vfp assembly • There are two parts to it
  • 107. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
  • 108. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
  • 110. vfpmath library • Already done a lot of work for you
  • 111. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
  • 112. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
  • 113. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
  • 114. Assembly in gcc • Only use it when targeting the device
  • 115. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
  • 116. Assembly in gcc • The basics asm (“cmp r2, r1”);
  • 117. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
  • 118. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
  • 119. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
  • 120. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  • 121. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  • 123. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
  • 124. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  • 125. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  • 126. VFP asm Four banks of 8 32-bit registers each Can address them as single precision or as doubles
  • 128. VFP asm #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
  • 129. VFP asm Bank 0 is always scalar! Operations only work on a single bank (wrap around possible)
  • 132. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  • 133. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 134. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 135. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : Now: 2.7 seconds!! : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 136. VFP asm for (int i=0; i<MaxParticles; ++i) { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 137. VFP asm for (int i=0; i<MaxParticles; ++i) Same every loop! { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 138. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 139. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 140. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" Now: 2.7 seconds "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 141. VFP asm We can do 8 operations at once. So let’s try doing two particles in a single operation. struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
  • 142. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 143. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 144. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 "mov r3, %3 nt" nt" Now: 2.67 seconds "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 145. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
  • 146. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds
  • 147. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds Now: 2.41 seconds!!!!
  • 148.
  • 151. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s
  • 152. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s
  • 153. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s
  • 154. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s About 2x faster!
  • 155. Good use of vfp
  • 156. Good use of vfp Something with lots of fp operations in a row
  • 157. Good use of vfp Something with lots of fp operations in a row • Matrix operations
  • 158. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems
  • 159. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning
  • 160. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics
  • 161. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
  • 162. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
  • 163. What about the 3GS?
  • 164. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 165. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 166. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 167. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 168. Matrix multiply on 3GS In ms
  • 169. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 170. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 171. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 172. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 173. VFP resources • ARM and VFP reference in your USB drive • http://code.google.com/p/vfpmathlibrary • http://aleiby.blogspot.com/2008/12/iphone- vfp-for-n00bs.html • http://www.ibiblio.org/gferg/ldp/GCC- Inline-Assembly-HOWTO.html
  • 175. More 3GS: NEON • SIMD coprocessor
  • 176. More 3GS: NEON • SIMD coprocessor • Floating point and integer
  • 177. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
  • 178. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Not many examples yet
  • 180. NEON resources • Cortex A8 reference in USB drive
  • 181. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html
  • 182. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html • http://code.google.com/p/oolongengine/ source/browse/trunk/Oolong+Engine2/ Math/neonmath
  • 184. Conclusions • Turn Thumb mode off NOW
  • 185. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp
  • 186. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already)
  • 187. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already) • NEON SIMD tech still unused. Research that and be the first one with the killer 3GS app!
  • 188. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com

Notes de l'éditeur

  1. Slides will be up on my web site after the talk
  2. I just wanted to flash that to scare people off :-)
  3. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  4. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  5. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  6. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  7. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  8. In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  9. During that time, working on engine technology and trying to achieve maximum performance Performance always very important
  10. Almost a year ago I started Snappy Touch
  11. Almost a year ago I started Snappy Touch
  12. Almost a year ago I started Snappy Touch
  13. Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to Sometimes you need performance to back up design Also, beware optimizing the best case. Optimize the worst case!
  14. Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to Sometimes you need performance to back up design Also, beware optimizing the best case. Optimize the worst case!
  15. Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to Sometimes you need performance to back up design Also, beware optimizing the best case. Optimize the worst case!
  16. Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to Sometimes you need performance to back up design Also, beware optimizing the best case. Optimize the worst case!
  17. Not necessary to understand the format, but helps a lot to understand source of problems Single precision (32 bit) vs double (64 bits)
  18. Let&amp;#x2019;s see what we have to work with
  19. Let&amp;#x2019;s see what we have to work with
  20. Let&amp;#x2019;s see what we have to work with
  21. We want to optimize for the old model
  22. We want to optimize for the old model
  23. So plan accordingly!
  24. So plan accordingly!
  25. ACTION: Go to XCode Simulator vs. device Release
  26. ACTION: Go to XCode Simulator vs. device Release
  27. ACTION: Go to XCode Simulator vs. device Release
  28. What&amp;#x2019;s going on in there??
  29. That&amp;#x2019;s more like it!
  30. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  31. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  32. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  33. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  34. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  35. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  36. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  37. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  38. I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  39. Single operation
  40. Single operation
  41. Single operation
  42. Single operation
  43. Single operation
  44. Single operation
  45. Need to get down to the metal
  46. Can control how many register to use for each operation. Up to 8 at once! Bank 0 is always scalar!!!
  47. Can control how many register to use for each operation. Up to 8 at once! Bank 0 is always scalar!!!
  48. Stride
  49. Read the manual for the specific instructions
  50. That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  51. That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  52. That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  53. Barely much of an improvement!
  54. Barely much of an improvement!
  55. VFP almost as fast as just iterating through the loop. Calculations become almost free!!!
  56. VFP almost as fast as just iterating through the loop. Calculations become almost free!!!
  57. This is one pipeline Can squeeze more performance by doing up to 3 operations at the same time
  58. ACTION: Switch to XCode
  59. ACTION: Switch to XCode
  60. ACTION: Switch to XCode
  61. ACTION: Switch to XCode
  62. ACTION: Switch to XCode
  63. The vfp on the 3GS seems to be running at half the speed in comparison to the 3G That combined with a much faster CPU, makes it pretty useless there
  64. The vfp on the 3GS seems to be running at half the speed in comparison to the 3G That combined with a much faster CPU, makes it pretty useless there
  65. The vfp on the 3GS seems to be running at half the speed in comparison to the 3G That combined with a much faster CPU, makes it pretty useless there
  66. The vfp on the 3GS seems to be running at half the speed in comparison to the 3G That combined with a much faster CPU, makes it pretty useless there
  67. Maybe next year&amp;#x2019;s 360iDev session :-)
  68. Maybe next year&amp;#x2019;s 360iDev session :-)
  69. Maybe next year&amp;#x2019;s 360iDev session :-)
  70. Maybe next year&amp;#x2019;s 360iDev session :-)