SlideShare a Scribd company logo
1 of 87
Download to read offline
IBM Systems and Technology Group
© 2005 IBM Corporation
Tutorial
Hardware and Software Architectures
for the CELL BROADBAND ENGINE processor
Michael Day, Peter Hofstee
IBM Systems & Technology Group, Austin, Texas
CODES+ISSS Conference, September 2005
2
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Agenda
Trends in Processors and Systems
Introduction to Cell
Challenges to processor performance
Cell Broadband Engine Architecture (CBEA)
Cell Broadband Engine (CBE) Processor Overview
Cell Programming Models
Prototype Software Environment
TRE Demo
© 2005 IBM Corporation
3
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Trends in Microprocessors and Systems
4
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Processor Performance over Time
(Game processors take the lead on media performance)
10
100
1000
10000
100000
1000000
1993 1995 1996 1998 2000 2002 2004
Game Processors
PC Processors
Flops (SP)
5
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
System Trends toward Integration
Implied loss of system configuration flexibility
Must compensate with generality of acceleration function to maintain
market size.
Processor
Southbridge
NorthbridgeMemory
Accel
Next-Gen
Processor
Memory
IO
IO
6
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Motivation: Cell Goals
Outstanding performance, especially on
game/multimedia applications.
• Challenges: Memory Wall, Power Wall, Frequency Wall
Real time responsiveness to the user and the network.
• Challenges: Real-time in an SMP environment, Security
Applicable to a wide range of platforms.
• Challenge: Maintain programmability while increasing performance
Support an introduction in 2005/6.
• Challenge: Structure innovation such that 5yr. schedule can be met
7
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Performance Limiters in Conventional
Microprocessors
Memory Wall
• Latency induced bandwidth limitations
Power Wall
• Must improve efficiency and performance equally
Frequency Wall
• Diminishing returns from deeper pipelines
(can be negative if power is taken into account)
8
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell
© 2005 IBM Corporation
Systems and Technology Group
9
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell History
IBM, SCEI/Sony, Toshiba Alliance formed in 2000
Design Center opened in March 2001
Based in Austin, Texas
February 7, 2005: First external technical disclosures
• Cell Broadband Engine Architecture documentation can be found at:
http://www.ibm.com/developerworks/power/cell
• Additional publications on Cell can be downloaded from:
http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell
• A paper on Cell in the upcoming issue of the IBM Journal of Research and Development can be
found at:
http://www.research.ibm.com/journal/rd/494/kahle.html
Systems and Technology Group
© 2005 IBM Corporation
10
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Highlights
Supercomputer on a chip
Multi-core microprocessor (9 cores)
3.2 GHz clock frequency
10x performance for many applications
Digital home to distributed computing
Systems and Technology Group
© 2005 IBM Corporation
11
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Introducing Cell
Sets a new performance standard
• Exploits parallelism while achieving high frequency
• Supercomputer attributes with extreme floating point capabilities
• Sustains high memory bandwidth with smart DMA controllers
Designed for natural human interaction
• Photo-realistic effects
• Predictable real-time response
• Virtualized resources for concurrent activities
Designed for flexibility
• Wide variety of application domains
• Highly abstracted to highly exploitable programming models
• Reconfigurable I/O interfaces
• Autonomic power management
12
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Microprocessor Architecture Trends
CPU
GPU
NIC
Security
Media
…
CPU
Streaming
Graphics
Processor
Network
Processor
Security
Processor
Media
Processor
…
64b Power
Processor
Synergistic
Processor
Synergistic
Processor
...
Mem.
Contr.
Config.
IO
Hardwired
Function
Programmable
ASIC Cell
13
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Key Attributes of Cell
Cell is Multi-Core
• Contains 64-bit Power Architecture TM
• Contains 8 Synergistic Processor Elements (SPE)
Cell is a Flexible Architecture
• Multi-OS support (including Linux) with Virtualization technology
• Path for OS, legacy apps, and software development
Cell is a Broadband Architecture
• SPE is RISC architecture with SIMD organization and Local Store
• 128+ concurrent transactions to memory per processor
Cell is a Real-Time Architecture
• Resource allocation (for Bandwidth Measurement)
• Locking Caches (via Replacement Management Tables)
Cell is a Security Enabled Architecture
• SPE dynamically reconfigurable as secure processors
14
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Architecture is …
COHERENT BUS
Power
ISA
MMU/BIU
Power
ISA
MMU/BIU
…
IO
transl.
Memory
Incl. coherence/memory
compatible with 32/64b Power Arch. Applications and OS’s
64b Power Architecture™
15
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Architecture is … 64b Power Architecture™
COHERENT BUS (+RAG)
Power
ISA
+RMT
MMU/BIU
+RMT
Power
ISA
+RMT
MMU/BIU
+RMT
IO
transl.
Memory
Plus
Memory
Flow Control (MFC)
MMU/DMA
+RMT
Local Store
Memory
MMU/DMA
+RMT
Local Store
Memory
LS Alias
LS Alias
…
…
…
16
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Architecture is … 64b Power Architecture™+ MFC
COHERENT BUS (+RAG)
Power
ISA
+RMT
MMU/BIU
+RMT
Power
ISA
+RMT
MMU/BIU
+RMT
IO
transl.
Memory
Plus
Synergistic
Processors
MMU/DMA
+RMT
Local Store
Memory
MMU/DMA
+RMT
Local Store
Memory
LS Alias
LS Alias
…
…
…
Syn.
Proc.
ISA
Syn.
Proc.
ISA
17
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell - Attacking the Performance Walls
Multi-Core Non-Homogeneous Architecture
• Control Plane vs. Data Plane processors
• Attacks Power Wall
3-level Model of Memory
• Main Memory, Local Store, Registers
• Attacks Memory Wall
Large Shared Register File & SW Controlled Branching
• Allows deeper pipelines
• Attacks Frequency Wall
Systems and Technology Group
© 2005 IBM Corporation
18
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Frequency Increase vs Power Consumption
0
0.5
1
1.5
2
2.5
3
3.5
0.9 1 1.1 1.2 1.3
Voltage
Realative
Power
Frequency
19
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Power Efficient Architecture and the CellBE
Non-Homogeneous Coherent Multi-Processor
• Data-plane/Control-plane specialization
• More efficient than homogeneous SMP
3-level model of Memory
• Bandwidth without (inefficient) speculation
• High-bandwidth .. Low power
20
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Power Efficient Architecture and the SPE
Power Efficient ISA allows Simple Control
• Single mode architecture
• No cache
• Branch hint
• Large unified register file
• Channel Interface
Efficient Microarchitecture
• Single port local store
• Extensive clock gating
Efficient implementation
• See Cool Chips paper by O. Takahashi et al. and T. Asano et al.
21
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Broadband Engine Components
22
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Processor Components
Power Core
(PPE)
L2 Cache
NCU
Power Processor Element (PPE):
•General Purpose, 64-bit RISC
Processor (PowerPC AS 2.0.2)
•2-Way Hardware Multithreaded
•L1 : 32KB I ; 32KB D
•L2 : 512KB
•Coherent load/store
•VMX
•3+ GHz
•Real-time Control
•Locking L2 Cache & TLB
•Bandwidth Reservation
In the Beginning
– the solitary Power Processor
Custom Designed
– for high frequency, space
and power efficiency
23
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
96 Byte/Cycle 300+GB/sec @ 3.2 GHz
Element Interconnect Bus
Cell Processor Components
Power Core
(PPE)
L2 Cache
NCU
N N
N
MIC
EIB data ring for internal communication
• Four 16 byte data rings, supporting multiple
transfers
• 96B/cycle peak bandwidth
• Over 100 outstanding requests
• 300+ GByte/sec @ 3.2 GHz
24
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
96 Byte/Cycle 300+ GB/sec @ 3.2 GHz
Element Interconnect Bus
Cell Processor Components
Power Core
(PPE)
L2 Cache
NCU
N N
N
MIC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
SPE provides computational performance
• Dual issue, up to 8-way 32-bit SIMD
• Dedicated resources: 128 128-bit RF,
256KB Local Store
• Each DMA MMU can be dynamically
configured to protect resources
• Dedicated DMA engine: Up to 16
outstanding requests
Local Store
SPU MFC
N
AUC
Local Store
SPU MFC
N
AUC
25
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
96 Byte/Cycle 300+ GB/sec @ 3.2GHz
Element Interconnect Bus
Cell Processor Components
Power Core
(PPE)
L2 Cache
NCU
N N
N
MIC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Memory Flow Controller (MFC) and Atomic
Update Cache (AUC)
• 4 line cache for shared memory atomic update
primitives
•High performance DMA unit
•Local Store aliased into PPE system memory
•16 element SPE side DMA Command Queue
• 8 element PPE side DMA Command Queue
•DMA List supports 2K entry scatter/gather
•MFC-MMU controls SPE DMA accesses
•Compatible with PowerPC Virtual Memory
architecture
•Full memory protection for MFC DMA
•S/W controllable from PPE MMIO
•DMA 1,2,4,8,16,128 -> 16Kbyte transfers for I/O
access
26
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Processor Components
IOIF0
20 GB/sec
BIF or IOIF0
IOIF1
Southbridge
I/O
5 GB/sec
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPU MFC
N
AUC
Local Store
SPU MFC
N
AUC
96 Byte/Cycle
Element Interconnect Bus
Power Core
(PPE)
L2 Cache
NCU
Memory Interface Controller (MIC):
• Dual XDRTM
controller (25.6GB/s @ 3.2Gbps)
• ECC support
• Suspend to DRAM support
Broadband Interface Controller (BIC):
Provides a wide connection to external devices
Two configurable interfaces (50+GB/s @ 5Gbps)
– Configurable number of bytes
– Coherent (BIF) and / or
I/O (IOIFx) protocols
Supports two virtual channels per interface
Supports multiple system configurations
MIC
25 GB/sec
XDR DRAM
27
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Processor Components
I/O Bus Master Translation (IOT)
Translates Bus Addresses to System Real Addresses
Two Level Translation
– I/O Segments (256 MB)
– I/O Pages (4K, 64K, 1M, 16M byte)
I/O Device Identifier per page for LPAR
IOST and IOPT Cache – hardware / software managed
IOIF0
20 GB/sec
BIF or IOIF0
MIC
25 GB/sec
XDR DRAM
IOIF1
Southbridge
I/O
5 GB/sec
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPU MFC
N
AUC
Local Store
SPU MFC
N
AUC
96 Byte/Cycle
Element Interconnect Bus
Power Core
(PPE)
L2 Cache
NCU
IIC IOT
Internal Interrupt Controller (IIC)
Handles SPE Interrupts
Handles External Interrupts
– From Coherent Interconnect
– From IOIF0 or IOIF1
Interrupt Priority Level Control
Interrupt Generation Ports for IPI
Duplicated for each PPE hardware thread
28
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
96 Byte/Cycle 256GB/sec @ 4GHz
Element Interconnect Bus
Cell Processor Components
Power Core
(PPE)
L2 Cache
NCU
N N
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
LocalStore
SPUMFC
N
AUC
MIC
IOIF1 IOIF0
Southbridge
I/O
20 GB / sec
BIF or IOIF1
25 GB / sec
XDR DRAM
5 GB/sec
Token Manager – Bandwidth Reservation for shared resources
• Optionally Enabled for RT Tasks or LPAR
• Multiple Resource Allocation Groups
• Generates access tokens at configurable rate for each
allocation group
1 per each memory bank (16 total)
2 for each IOIF (4 total)
• Requestors assigned RAG ID by OS/hypervisor
Each SPE
PPE L2 / NCU
IOIF 0 Bus Master
IOIF 1 Bus Master
• Priority order for using another RAGs unused tokens
• Resource overcommit warning interrupt
IIC IOTTKM
29
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell ProcessorCell Processor
Power Processor Element (PPE):
•General Purpose, 64-bit RISC
Processor (PowerPC AS 2.0)
•2-Way Hardware Multithreaded
•L1 : 32KB I ; 32KB D
•L2 : 512KB
•Coherent load/store
•VMX
•3.2 GHz
Internal Interconnect:
•Coherent ring structure
•300+ GB/s total internal
interconnect bandwidth
•DMA control to/from SPEs
supports >100 outstanding
memory requests
Synergistic Processor Elements
(SPE):
•8 per chip
•128-bit wide SIMD Units
•Integer and Floating Point capable
•256KB Local Store
•Up to 25.6 GF/s per SPE ---
200GF/s total *
External Interconnects:
•25.6 GB/sec BW memory interface
•2 Configurable I/O Interfaces
•Coherent interface (SMP)
•Normal I/O interface (I/O & Graphics)
•Total BW configurable between interfaces
•Up to 35 GB/s out
•Up to 25 GB/s in
Memory Management & Mapping
•SPE Local Store aliased into PPE system memory
•MFC/MMU controls SPE DMA accesses
•Compatible with PowerPC Virtual Memory
architecture
•S/W controllable from PPE MMIO
•Hardware or Software TLB management
•SPE DMA access protected by MFC/MMU
““Supercomputer-on-a-Chip”
* At clock speed of 3.2GHz
Element Interconnect Bus
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
Power Core
(PPE)
L2 Cache
NCU
Local Store
SPU
MFC
N
AUC
Local Store
SPU
MFC
N
AUC
N N
N N
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
20 GB/sec
Coherent
Interconnect
25.6 GB/sec
Memory
Inteface
5 GB/sec
I/O Bus
30
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Highlights
RISC like organization
• 32 bit fixed instructions
• Clean design – unified Register file
User-mode architecture
• No translation/protection within SPU
• DMA is full Power Arch protect/x-late
VMX-like SIMD dataflow
• Broad set of operations (8 / 16 / 32 Byte)
• Graphics SP-Float
• IEEE DP-Float
Unified register file
• 128 entry x 128 bit
256KB Local Store
• Combined I & D
• 16B/cycle L/S bandwidth
• 128B/cycle DMA bandwidth
LS
LS
LS
LS
GPR
FXU ODD
FXU EVN
SFP
DP
CONTROL
CHANNEL
DMA SMM
ATO
SBI
RTB
FWD
14.5mm2 (90nm SOI)
31
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
What is a Synergistic Processor?
(and why is it efficient?)
Local Store “is” large 2nd level register file / private instruction store instead of cache
• Asynchronous transfer (DMA) to shared memory
• Frontal attack on the Memory Wall
Media Unit turned into a Processor
• Unified (large) Register File
• 128 entry x 128 bit
Media & Compute optimized
• One context
• SIMD architecture
LS
LS
LS
LS
GPR
FXU ODD
FXU EVN
SFP
DP
CONTROL
CHANNEL
DMA SMM
ATO
SBI
RTB
FWD
SPU
SMF
32
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Detail
LS
LS
LS
LS
GPR
FXU ODD
FXU EVN
SFP
DP
CONTROL
CHANNEL
DMA SMM
ATO
SBI
RTB
FWD
Synergistic Processor Element (SPE)
User-mode architecture
• No translation/protection
within SPE
• DMA is full PowerPC
protect/xlate
Direct programmer control
• DMA/DMA-list
• Branch hint
VMX-like SIMD dataflow
• Graphics SP-Float
• No saturate arith, some byte
• IEEE DP-Float (BlueGene-
like)
Unified register file
• 128 entry x 128 bit
256KB Local Store
• Combined I & D
• 16B/cycle L/S bandwidth
• 128B/cycle DMA bandwidth
SPE Latencies
• Simple fixed point - 2 cycles*
• Complex fixed point - 4 cycles*
• Load - 6 cycles*
Local store size = 256 KB
• Single-precision (ER) float - 6 cycles*
• Integer multiply - 7 cycles*
• Branch miss - 20 cycles
No penalty if correctly hinted
• DP (IEEE) float - 13 cycles*
Partially pipelined
• Enqueue DMA Command - 20 cycles*
SPU Units:
• Simple (FXU even)
Add/Compare
Rotate
Logical, Count Leading Zero
• Permute (FXU odd)
Permute
Table-lookup
• FPU (Single / Double Precision)
• Control (SCN)
Dual Issue, Load/Store, ECC Handling
• Channel (SSC) – Interface to MFC
• Register File (GPR/FWD)
33
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Coherent Offload Model
DMA into and out of Local Store similar to Power core loads & stores
Governed by Power Architecture page and segment tables for translation and
protection
Shared memory model
• Power architecture compatible addressing
• MMIO capabilities for SPEs
• Local Store is mapped (alias) allowing LS to LS DMA transfers
• DMA equivalents of locking loads & stores
• OS management/virtualization of SPEs
Pre-emptive context switch is supported
34
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
MFC Detail
Memory Flow Control System
•DMA Unit
•LS <-> LS, LS<-> Sys Memory, LS<->
I/O Transfers
•8 PPE-side Command Queue entries
•16 SPU-side Command Queue entries
•MMU similar to PowerPC MMU
•8 SLBs, 256 TLBs
•multiple page sizes
•Software/HW page table walk
•PT/SLB misses interrupt PPE
•Atomic Cache Facility
•4 cache lines for atomic updates
•2 cache lines for cast out/MMU reload
•Up to 16 outstanding DMA requests in BIU
•Resource / Bandwidth Management Tables
•Token Based Bus Access Management
•TLB Locking
Isolation Mode Support (Security Feature)
Hardware enforced “isolation”
• SPE and Local Store not visible (bus or jtag)
• Small LS “untrusted area” for communication area
Secure Boot
• Chip Specific Key
• Decrypt/Authenticate Boot code
“Secure Vault” – Runtime Isolation Support
• Isolate Load Feature
• Isolate Exit Feature
Legend:
Data Bus
Snoop Bus
Control Bus
Xlate Ld/St
MMIO
Local
Store
SPU
DMA Engine DMA
Queue
Atomic
Facility
MMU RMT
Bus I/F Control
SPE
MMIO
35
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Implementation Aspects
36
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Pre-Decode
L1 Instruction Cache
Microcode
SMT Dispatch (Queue)
Decode
Dependency
Issue
Branch Scan
Fetch Control
L2
Interface
VMX/FPU Issue (Queue)
VMX
Load/Store/
Permute
VMX
Arith./Logic Unit
FPU
Load/Store
FPU
Arith/Logic Unit
Load/Store
Unit
Branch
Execution
Unit
Fixed-Point
Unit
FPU CompletionVMX Completion
Completion/Flush
8
4
2 1
2
11 1 1
111
4
Thread A Thread B
Threads alternate
fetch and dispatch
cycles
Thread A
Thread B
Thread A
L1 Data Cache
2
PPE BLOCK DIAGRAM
37
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
ID2 IS1
RF2 EX1 EX2
IC1 IC2 IC3 ID1
RF1 RF2 EX1 EX3 EX4EX2
MC1
IS2
RF1 RF2 EX1 EX2 EX3 EX4
Fixed Point Unit Instruction
PPE PIPELINE FRONT END
PPE PIPELINE BACK END
Microcode
MC2 MC3 MC4
EX5
EX5
... MC9 MC10 MC11
IB2 IS3
EX7
WB
EX6
IB1 ID3
EX3RF1DLY DLY DLY
DLY DLY DLY
EX4 IBZ IC0
Branch Instruction
EX8 WB
IC4
BP1 BP2 BP3 BP4
Load/Store Instruction
Branch Prediction
Instruction Decode and IssueInstruction Cache and Buffer
IC Instruction Cache
IB Instruction Buffer
BP Branch Prediction
MC Microcode
ID Instruction Decode
IS Instruction Issue
DLY Delay Stage
RF Register File Access
EX Execution
WB Write Back
38
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Permute Unit
Load-Store Unit
Floating-Point Unit
Fixed-Point Unit
Branch Unit
Channel Unit
Result Forwarding and Staging
Register File
Local Store
(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle
On-Chip Coherent Bus
SPE BLOCK DIAGRAM
39
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
EX1 EX3 EX4EX2
SPU PIPELINE FRONT END
SPE PIPELINE BACK END
EX5 EX6
RF1 RF2
Branch Instruction
WB
Load/Store Instruction
IF Instruction Fetch
IB Instruction Buffer
ID Instruction Decode
IS Instruction Issue
RF Register File Access
EX Execution
WB Write Back
IF1 IF2 ID2 IS1IF3 IF4 IF5 ID1 IS2IB2IB1 ID3
EX2
Fixed Point Instruction
WBEX1
Floating Point Instruction
WBEX1
EX2
Permute Instruction
WBEX1
EX3 EX4 EX5 EX6EX2
EX3 EX4
40
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
CellBE Processor
~250M transistors
~235mm2
Top frequency >3GHz
9 cores, 10 threads
> 200+ GFlops (SP) @3.2 GHz
> 20+ GFlops (DP) @3.2 GHz
Up to 25.6GB/s memory B/W
Up to 50+ GB/s I/O B/W
~400M$(US) design investment
41
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Processor Can Support Many Systems
Game console systems
Blades
HDTV
Home media servers
Supercomputers
Cell BE
Processor
XDRtm XDRtm
IOIF0 IOIF1
Cell BE
Processor
XDRtm XDRtm
IOIF BIF
Cell BE
Processor
XDRtm XDRtm
IOIF
Cell BE
Processor
XDRtm XDRtm
IOIF
BIF
Cell BE
Processor
XDRtm XDRtm
IOIF
CellBE
Processor
XDRtm XDRtm
IOIF
BIF
CellBE
Processor
XDRtm XDRtm
IOIFSW
42
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell BE Processor Initial Application Areas
Cell excels at processing of rich media content in the context of broad
connectivity
• Digital content creation (games and movies)
• Game playing and game serving
• Distribution of dynamic, media rich content
• Imaging and image processing
• Image analysis (e.g. video surveillance)
• Next-generation physics-based visualization
• Video conferencing
• Streaming applications (codecs etc.)
• Physical simulation & science
Cell is an excellent match for any applications that require:
• Parallel processing
• Real time processing
• Graphics content creation or rendering
• Pattern matching
• High-performance SIMD capabilities
43
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Programmability
Programmer
Productivity
Raw Hardware
Performance
BE
Design
25 GB/sec Memory Inteface
20 GB/sec Coherent Interconnect 5 GB/sec I/O Bus
Cell Software Overview
256 GB/sec Coherent Ring
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
Power Processor
(PPE)
L2 Cache
NCU
Local Store
SPU
MFC
N
AUC
Local Store
SPU
MFC
N
AUC
N N
N N
Cell
Chip
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
MFC
LocalStore
SPU
N
AUC
Cell Programming Features
Flexible Programming Models
Function Offload Model
Existing Application Accelerator
Parallel Computational Acceleration
Heterogeneous Multi-Threading Runtime
44
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Programming Characteristics
Exploit all of Cell’s 18 asynchronous engines
• Through function offload, or parallel computational tasks
Decompose work into data parallel blocks
• Independent Data parallel tasks are most efficient
Overlap SPE compute with DMA loads and stores
• Multibuffering , pipelining
Reduce PPE/SPE workload ratio
• PPE as Control / OS processor
• SPE as heavy lifting computational engines
Super-Linear speedups over conventional processors may be
achieved
• Through exploitation of asynchronous DMA engines
45
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Features Exploited by Software
Keeping Intermediate/Control Data on-Chip
MMU – ERATS, SLBs, TLBs
DMA from L2 cache-> LS
LS to LS DMA
Cache <-> Cache transfers (atomic update)
SPE Signalling Registers
SPE <-> PPE Mailboxes
Resource Reservation and Allocation
PPE can be shared across logical partitions
SPEs can be assigned to logical partitions
SPEs independently or Group Allocated
Cache Replacement Management
TLB Replacement Management
Bandwidth Reservation
Cell Programming Features
Power
Processor
(PPE)
L2 Cache
MFC
Local Store
SPU
N
AUC
DMA with Intervention
Atomic Update Cache
Local Store to Local Store DMA
Hardware Managed Cache Coherency
Cell Chip
System Memory I/O
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
BIF/IOIF
46
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Subsystem Programming Model
Dedicated Function (problem/privileged subsystem)
Programmer writes/uses SPU "libraries"
Graphics Pipeline
Audio Processing
MPEG Encoding/Decoding
Encryption / Decryption
Main Application in PPE, invokes SPU bound services
RPC Like Function Call
I/O Device Like Interface (FIFO/ Command Queue)
1 or more SPUs cooperating in subsystem
Problem State (Application Allocated)
–Transcoding
–Realtime data transformation & streaming
–Graphics Processing
–MPEG Encoding / Decoding
–Physical Simulation
Privileged State (OS Allocated)
–Encryption / Decryption Services
–Network Packet Filtering / Routing
Code-to-data or data-to-code pipelining possible
Very efficient in real-time data streaming applications
Function Offload Power Core
(PPE)
Sys. Memory/QoS
SPU
Local Store
MFC
N
SPU
Local Store
MFC
N
Multi-stage Pipeline
SPU
Local Store
MFC
N
SPU
Local Store
MFC
N
SPU
Local Store
MFC
N
Parallel-stages
Power Core
(PPE)
Sys. Memory/QoS
47
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Application Specific Acceleration ModelApplication Specific Acceleration Model –– SPC Accelerated SubsystemsSPC Accelerated Subsystems
System Memory Parameter Area
PPE
PowerPC
Application
Cell Aware
OS (Linux)
Compression/
Decompression
Realtime
MPEG
Encoding
TCP
Offload
Data
Encryption
Data
Decryption
Conventional PowerPC App.
with Library Dependencies
OpenGL Encrypt Decrypt Encoding
OS loads 3rd party function libraries
containing PPC wrapped SPU code
PPE part of library code loads
SPU allocated by OS
Loads SPU part of library code
PPC part of library
App calls library function,
PPE library stub
communications to SPEmpeg_encode()
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
Maintains PowerPC programming model
Similiar to IOP but integrated with main
processor
• Supports shared memory programming
• Extremely high on-chip bandwidth
• Scales with “conventional” processor
SPU code provided in:
• Application or middleware libraries
• Operating System Services
Does not require application rewrite
• Specific subsystems targetted
• Separate compilations
• Leverage ELF
• SPUs managed by OS
SPE Programming localized in library
• SPU Code vectorization required
• Private Local Store and DMA model
•Scheduling of code / data movement
• Code debug and fault isolation complexity
Application Specific Accelerators
Decoding
48
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Parallel Computational Acceleration
Single Source Compiler (PPE and SPE targets)
• Auto parallelization ( treat target Cell as an Shared Memory MP )
• Auto SIMD-ization ( SIMD-vectorization ) for PPE VMX and SPE
• Compiler management of Local Store as Software managed cache (I&D)
Optimization Options
• OpenMP-like pragmas
• MPI based Microtasking
• Streaming languages
• Vector.org SIMD intrinsics
• Data/Code partitioning
• Streaming / pre-specifying code/data use
Compiler or Programmer scheduling of DMAs
Compiler use of Local store as soft-cache
IBM Research Prototype Single Source Compiler
• C Frontend
• XLC SPE and XLC PPE back-end
IBM Research Prototype Parallelizing Compiler
• UPC front-end
• Fortran front-end
• XLC SPE backend
256 GB/sec Coherent
Ring
MFC
Local Store
SPU
N
AU
C
MFC
Local Store
SPU
N
AU
C
Power
Processor
(PPE)
L2
Cache
NCU
Local Store
SPU
MFC N
AU
C
Local Store
SPU
MFC
N
AU
C
N N
N N
Broadband
Engine
Chip
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
256 GB/sec Coherent
Ring
MFC
Local Store
SPU
N
AU
C
MFC
Local Store
SPU
N
AU
C
Power
Processor
(PPE)
L2
Cache
NCU
Local Store
SPU
MFC N
AU
C
Local Store
SPU MFC
N
AU
C
N N
N N
Broadband
Engine
Chip
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
MFC
Local
Store
SPU
N
AU
C
XDRtm
XDRtm
I/O
I/O
Tools and Programmer Driven
49
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Application Source
& Libraries
PPE object files SPE object files
SPESPE SPE SPE
Physical SPEs
SPESPE SPE SPE
Cell AwareOS ( Linux)
SPE Virtualization / Scheduling Layer (m->n spe lwp/threads)
New SPE lwp/threadsExisting PPE
lwp/threads
Physical PPE
PPE
MT1 MT2
Operating System Runtime Strategy
Heterogeneous Multi-Threading Model
PPE LWP/Threads
SPU LWP/Threads
SPU DMA EA = PPE Process EA Space
OS supports Create/Destroy SPE tasks
Atomic Update Primitives used for Mutex
SPE Context Fully Managed
Context Save/Restore for Debug
Virtualization Mode (indirect access)
Direct Access Mode (realtime)
OS assignment of SPE threads to SPUs
Programmer directed using affinity
mask
50
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Prototype Cell Extensions to Linux
System Call Interface
exec Loader
File System
Framework
Device
Framework
Network
Framework
Streams
Framework
SPU Management
Framework
Privileged
Kernel
Extensions
Firmware / Hypervisor
ILP32 Processes LP64 EM Processes
Cell Reference System Hardware
32-bit GNU Libs (glibc,etc)
64-bit Linux Kernel
64-bit GNU Libs (glibc)
SPUFS
FilesystemMisc format bin
SPU Object
Loader Extension
ptrace, large page, madvise, SPE error, fault handling, IIC support, IOS
Architecture Specific Code
SPU Allocation, Scheduling
& Dispatch Extension
PPC32 Apps. Cell32 Workloads PPC64 Apps.Cell64 Workloads
std. PPC32
elf interp
SPE Object Loader
Services
SPE Management Runtime
Library (32-bit)
std. PPC64
elf interp
Programming Models Offered: RPC, Device Subsystem, Non-Realtime and Realtime
Asymmetric Threads -- Single SPU, SPU Groups, Shared Memory
SPE Management Runtime
Library (64-bit)
51
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Programmer
Experience
Development
Tools Stack
System Level Simulator
Linux PPC64 with Cell Extensions
SPE Management Lib
Application Libs
Samples
Workloads
Demos
Code Dev Tools
Debug Tools
Performance Tools
Standards:
Language extensions
ABI
Verification Hypervisor
Development
Environment
End-User
Experience
Execution
Environment
Cell Prototype Software Environment
Miscellaneous Tools
52
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Cell Standards
Application Binary Interface Specifications
• Defines such things as data types, register usage, calling conventions, and object formats to ensure
compatibility of code generators and portability of code.
SPE ABI
Linux Cell ABI
SPE C/C++ Language Extensions
• Defines standardized data types, compiler directives, and language intrinsics used to exploit SIMD
capabilities in the core.
• Data types and Intrinsics styled to be similar to Altivec/VMX.
SPE Assembly Language Specification
Standards
53
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
System Level Simulator
Cell BE – full system simulator
• Uni-Cell and multi-Cell simulation
• User Interfaces – TCL and GUI
• Cycle accurate SPU simulation (pipeline mode)
• Pseudo accurate memory and MFC modes
• Emitter facility for tracing and viewing simulation events
Other simulators
• spusim – standalone SPU simulator
Execution Environment
54
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Verification Hypervisor (vHype)
Seamless Integration of Dual
Environments
•Low overhead – small footprint
•Realtime Resource management
•SPE management
•Separate policy manager
•Pre-emptive partition switching on
high priority interrupts
Real time Open Source / Proprietary
OS
Conventional
Network
OS
(Linux)
I/O Hosting
Hypervisor
(Hypervisor - Multi-Partition Resource Allocation / Reservation /Protection)
Hardware
Networked
Apps.
Inter-partition Communications
Logical Partitioning RTOS/Linux
Execution Environment
55
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Linux on Cell
All software in STIDC written on Linux OS
• Started with Linux 2.4 PPC64 on Cell Simulator
SPEs exposed as I/O Devices (function offload model)
SPE DMA required pre-pinned memory
Inflexible programming model
Moved to 2.6.3
• Added heterogenous lwp/thread model – via system call – moved to SPUFS in 2.6.13
SPE thread API created (similar to pthreads library)
User mode direct and indirect SPE access models
Full pre-emptive SPE context management
spe_ptrace() added for gdb support
spe_schedule() for thread to physical SPE assignment
currently FIFO – run to completion
• SPE threads share address space with parent PPE process (through DMA)
Demand paging for SPE accesses
Shared hardware page table with PPE
• SPE Error, Event and Signal handling directed to parent PPE thread
• SPE elf objects wrapped into PPE shared objects with extended gld
SPE-side mini-loader
• madvise() extended for L2 cache and TLB locking/preloading (realtime feature)
• All patches for Cell in architecture dependent layer (subtree of PPC64)
Publishing Initial CellBE Patches for 2.6.13 (Fall 2005 target)
Execution Environment
56
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Management Library
SPEs are exposed as threads
• SPE thread model interface is similar to POSIX threads.
• SPE thread consists of the local store, register file, program counter, and MFC-DMA queue.
• Associated with a single Linux task.
• Features include:
Threads - create, groups, wait, kill, set affinity, set context
Thread Queries - get local store pointer, get problem state pointer, get affinity, get
context
Groups - create, set group defaults, destroy, memory map/unmap, madvise.
Group Queries - get priority, get policy, get threads, get max threads per group, get
events.
SPE image files - opening and closing
SPE Executable
• Standalone SPE program managed by a PPE executive.
• Executive responsible for loading and executing SPE program. It also services “syscall”
requests for I/O (eg, fopen, fwrite, fprintf) and memory requests (eg, mmap, shmat, …).
Execution Environment
57
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Optimized Prototype Libraries
Audio (resampling)
Cryptographic
Fast fourier transform
Game math
Image
Large matrix
Math
Matrix (4x4)
Miscellaneous
Memory management
Multi-precision math
Noise & Turbulence
Oscillator
Parallel programming (MPI like)
Shared memory
SPU plugin
SPU system call
Surfaces / curves
Synchronization
Vector
Execution Environment
58
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Code Development Tools
GNU based binutils
• gas SPE assembler
• gld SPE ELF object linker
gld extensions for embedding SPE object modules in PPE executables
• misc bin utils (ar, nm, ...) targeting SPE modules
• hosted on Linux IA32, Linux PowerPC
GNU based C/C++ compiler targeting SPE
• From STI Partner
• retargeted compiler to SPE
• Supports common SPE Language Extensions and ABI (ELF/Dwarf2) object output
Cell Broadband Engine Optimizing Compiler (IBM Proprietary)
• IBM XLC C/C++ for PowerPC (Tobey)
• IBM XLC C retargeted to SPE assembler (including vector intrinsics) - highly optimizing
• Prototype - XLC Compiler supporting CellBE Programmer Productivity Aids
Single Source compilation using OpenMP like pragmas (PPE and SPE object code generated)
Auto-Vectorization (auto-SIMD) for SPE code
Auto-Parallelization across SPEs
UPC Front end – parallelization across SPEs
Local Store software managed caching model
• Hosted on Linux
• Executables to be available on IBM Alphaworks – Fall 2005
Development Environment
59
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Debug Tools
CellBE system simulator
• Executable availability on AlphaWorks (Fall 2005 target)
GNU gdb
• ptrace and spe_ptrace enabled
• Multi-core Application source level debugger supporting PPE multithreading, SPE
multithreading, interacting PPE and SPE threads
• Three modes of debugging SPU threads
Attach to SPE thread
Launch mode – launch a new debug session for each SPE thread
Pass-thru mode – follow execution into SPE thread
RISCwatch
• Low level hardware (JTAG) debugger
Development Environment
60
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Prototype Performance Tools
pmcount
• Tool to access to HW performance counters
Performance inspector
• Suite of GPL based performance analysis tools extended to support SPE threads
tprof – timer based analysis tool
ptt – per thread time
ai – above idle
post – report generator
a2n – address to name
ctrace
• Branch tracing performance monitor (under development)
Development Environment
61
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Performance Tools
Static analysis (spexlc_timing)
• Annotates assembly source with instruction pipeline state
Dynamic analysis (CellBE System Simulator)
• Generates statistical data on SPE execution
Cycles, instructions, and CPI
Single/Dual issue rates
Stall statistics
Register usage
Instruction histogramming
Development Environment
62
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Miscellaneous Tools – IDL Compiler
Development Environment
63
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Samples / Workloads / Demos
Numerous code samples provided
to demonstrate system design
constructs
Complex workloads and demos
used to evaluate and demonstrate
system performance
Terrain Rendering Engine
Subdivision Surfaces
Physics Simulation
Geometry Engine
Execution Environment
64
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Subsystem Sample – Geometry Engine
OpenGL-like geometry engine
•Geometry processing is offloaded to compile-time configurable SPE
“vertex shader”
•User Queue communication model consisting of 4KB blocks for SPU
command requests with command headers in SPE Mailbox FIFO
PPE
Application
G
E
L
I
B
GE
Shader
User
Queues
PPE SPE
System
Memory
Viewer
Execution Environment
65
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Terrain Rendering Engine Overview
Visualization of Terrain Data Increasingly
Important
•General availability of high resolution satellite
images
•Publicly available USGS Digital Elevation Models
(DEMs)
•Mobile GPS devices (land, sea, air) + wireless
networks
Inferior Current Solutions
•Polygonal Models (low quality, non-Real Time)
•Requires CPU + GPU (Graphics Processing Unit)
Superior CellBE Solution
•Highly Compressed Height Map Models (10x less
data)
•Requires only one CellBE (No GPU)
•High Quality Images (Multi-Sampled Raycast)
•Fast (Real Time Animations)
66
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Earthviewer – Polygon Render
67
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Real-Time Terrain Rendering Engine (TRE)
Cell Optimized Raycaster
Advanced SPUAdvanced SPU shadershader functionfunction
Real time rendering with only one Cell processoreal time rendering with only one Cell processor
No graphics adapter assistNo graphics adapter assist
High definition resolutionHigh definition resolution
Ray/Terrain intersection computationRay/Terrain intersection computation
Texture FilteringTexture Filtering
Normal computationNormal computation
Bump map computationBump map computation
Diffuse + Ambient lighting modelDiffuse + Ambient lighting model
PerlinPerlin Noise based cloudsNoise based clouds
Atmosphere computation (haze, sun, halo)Atmosphere computation (haze, sun, halo)
Dynamic multi-sampling (4Dynamic multi-sampling (4 –– 16 samples per pixel)16 samples per pixel)
Image based input (16 bit height + 16 bit texture)Image based input (16 bit height + 16 bit texture)
47 KB of SPU object code47 KB of SPU object code
MJPEG like compression via SPUMJPEG like compression via SPU
Performance scales linearly with number of availablePerformance scales linearly with number of available SPUsSPUs
Written completely in C withWritten completely in C with intrinsicsintrinsics
Client supportClient support –– OpenGL workstation, wireless PDAOpenGL workstation, wireless PDA
User inputsUser inputs –– graphical, joystick, GPS/accelerometergraphical, joystick, GPS/accelerometer
68
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
3D Surface from 25x25 Height Data
0
1000
2000
3000
4000
5000
6000
7.5 KB for just the Surface
69
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Raw 25x25 Height Data
0
1000
2000
3000
4000
5000
6000
1.25 KB for Height Map (6x Compression)
70
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Ray Casting
71
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Height Color Data Layout (Main Memory)
© 2005 IBM Corporation71
Quad Word (128 bits)Quad Word (128 bits)
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C
H C
H C H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C H C
H C H C
H C H C H C H C
H C H C H C H C
H C H C H C
H C H C H C
H C H C
H C
H C
H C
H C
72
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
PPE Data Staging via L2
Main MemoryMain Memory
PPEPPE
L2L2
SPESPE SPESPESPESPE SPESPE SPESPE SPESPESPESPE SPESPE
The L2 has 4 Outstanding Loads + 2 Prefetch
73
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Data Staging
16 Outstanding Loads per SPE
Main MemoryMain Memory
PPEPPE
L2L2
SPESPE SPESPESPESPE SPESPE SPESPE SPESPESPESPE SPESPE
74
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Height Color Data Layout (Main Memory)
© 2005 IBM Corporation74
Quad Word (128 bits)
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
Quad Word (128 bits)
75
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Height Color Data Layout (Local Store)
© 2005 IBM Corporation75
SIMD Register (128 bits)
H C H C H C H C
H C H C H C H C
H C
H C H C H C
H C
H C H C H C
H C H C
H C H C
H C H C H C H C
H C H C H C H C
H C H C H C H C
Shuffle Byte
Quad Word (128 bits)
76
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
TRE Image Pipeline
Time
PPE
SPE 0-6
SPE 7
Prep Frame 0Prep Frame 0 Prep Frame 1Prep Frame 1
Render Frame 0Render Frame 0 Render Frame 1Render Frame 1
Encode Frame 0Encode Frame 0 Encode Frame 1Encode Frame 1
Prep Frame 2Prep Frame 2 Prep Frame 3Prep Frame 3
Render Frame 2Render Frame 2
77
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
TRE SPE Render Pipeline
MFC
Load
RC 0
Load
RC 0
Load HC/Accm 0Load HC/Accm 0 Load HC/Accm 1Load HC/Accm 1
Store
Accm 1
Store
Accm 1
Load HC/Accm 0Load HC/Accm 0
Store
Accm 0
Store
Accm 0
Gen Lists 0Gen Lists 0 Gen Samples 1Gen Samples 1 Gen Lists 0Gen Lists 0Gen Samples 0Gen Samples 0Gen Lists 1Gen Lists 1
Store
Accm 0
Store
Accm 0
Load
RC 0
Load
RC 0
Gen Samples 1Gen Samples 1
Load
RC 1
Load
RC 1
TimeTime
SPU
78
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Ray-casting with SIMD
Ray 1 Ray 4Ray 3Ray 2
Vector (128 bits)
4
3
2
1
79
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
TRE SPE Ray Kernel
Ray IntersectionRay Intersection
Multi - Sample AdjustmentMulti - Sample Adjustment
Shader
Texture Filtering
Normal Generation
Bump Mapping
Lighting
Atmosphere
Shader
Texture Filtering
Normal Generation
Bump Mapping
Lighting
Atmosphere
Sample AccumulationSample Accumulation
80
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
SPE Local Store Memory Layout
Height Color Data
(57 KB)
Height Color Data
(57 KB)
Height Color Data
(57 KB)
Height Color Data
(57 KB)
Accumulation Data
(30 KB)
Accumulation Data
(30 KB)
Accumulation Data
(30 KB)
Accumulation Data
(30 KB)
Accum DMA List (15 KB)Accum DMA List (15 KB) Accum DMA List (15 KB)Accum DMA List (15 KB)
RCRC RCRC ACAC Stack 4 KBStack 4 KB
Code
(29 KB)
Code
(29 KB)
HC DMA List (13 KB)HC DMA List (13 KB) HC DMA List (13 KB)HC DMA List (13 KB)
81
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Chip
82
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Sample SPE Simulator Output
Performance Cycle count 96187613
Performance Instruction count 113483578 (108381008)
Performance CPI 0.85 (0.89)
Branch instructions 1255537
Branch taken 826730
Branch not taken 428807
Hint instructions 507711
Hint hit 809520
Single cycle 59886926 ( 62.3%)
Dual cycle 24247041 ( 25.2%)
Nop cycle 133131 ( 0.1%)
Stall due to branch miss 1024965 ( 1.1%)
Stall due to prefetch miss 22394 ( 0.0%)
Stall due to dependency 9709208 ( 10.1%)
Stall due to fp resource conflict 0 ( 0.0%)
Stall due to waiting for hint target 1163937 ( 1.2%)
Stall due to dp pipeline 0 ( 0.0%)
Channel stall cycle 0 ( 0.0%)
SPU Initialization cycle 9 ( 0.0%)
-----------------------------------------------------------------------
Total cycle 96187611 (100.0%)
The number of used registers are 128, the used ratio is 100.00
83
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Austin
84
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Mount Saint Helens
85
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
TRE 720P Performance
2.0 GHz Apple G5 0.6 frames/sec
• 40% of cycles spent waiting for Memory
3.2 GHz Cell 30.0 frames/sec
• 1% of cycles spent waiting for Memory
Cell has 50x advantage
86
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
Summary
Cell ushers in a new era of leading edge processors optimized for digital
media and entertainment
Desire for realism is driving a convergence between supercomputing and
entertainment
New levels of performance and power efficiency beyond what is achieved by
PC processors
Responsiveness to the human user and the network are key drivers for Cell
Cell will enable entirely new classes of applications, even beyond those we
contemplate today
87
IBM Systems and Technology Group
STI Technology © 2005 IBM Corporation
(c) Copyright International Business Machines Corporation 2005.
All Rights Reserved. Printed in the United Sates September 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com
1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is
Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com

More Related Content

What's hot

Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind
 
Symm configuration management
Symm configuration managementSymm configuration management
Symm configuration management.Gastón. .Bx.
 
Soc - Intro, Design Aspects, HLS, TLM
Soc - Intro, Design Aspects, HLS, TLMSoc - Intro, Design Aspects, HLS, TLM
Soc - Intro, Design Aspects, HLS, TLMSubhash Iyer
 
IBM BladeCenter Fundamentals Introduction
IBM BladeCenter Fundamentals Introduction IBM BladeCenter Fundamentals Introduction
IBM BladeCenter Fundamentals Introduction Dsunte Wilson
 
System On Chip
System On ChipSystem On Chip
System On ChipA B Shinde
 
System-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design ChallengesSystem-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design Challengespboulet
 
Intel i7 Technologies
Intel i7 TechnologiesIntel i7 Technologies
Intel i7 TechnologiesBibhu Biswal
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable RetinaVanya Valindria
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesJeffrey Funk
 
POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI Anand Haridass
 
What is system on chip (1)
What is system on chip (1)What is system on chip (1)
What is system on chip (1)Jagadeshgoud
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
SOC - system on a chip
SOC - system on a chipSOC - system on a chip
SOC - system on a chipParth Kavi
 
2123.a better waytoprint.universal print
2123.a better waytoprint.universal print2123.a better waytoprint.universal print
2123.a better waytoprint.universal printSumit Tambe
 
Intel core i7 processors
Intel core i7 processorsIntel core i7 processors
Intel core i7 processorsSelf employed
 

What's hot (20)

Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
 
Blue Gene Active Storage
Blue Gene Active StorageBlue Gene Active Storage
Blue Gene Active Storage
 
Symm configuration management
Symm configuration managementSymm configuration management
Symm configuration management
 
Soc - Intro, Design Aspects, HLS, TLM
Soc - Intro, Design Aspects, HLS, TLMSoc - Intro, Design Aspects, HLS, TLM
Soc - Intro, Design Aspects, HLS, TLM
 
IBM BladeCenter Fundamentals Introduction
IBM BladeCenter Fundamentals Introduction IBM BladeCenter Fundamentals Introduction
IBM BladeCenter Fundamentals Introduction
 
System On Chip
System On ChipSystem On Chip
System On Chip
 
SOC Design Challenges and Practices
SOC Design Challenges and PracticesSOC Design Challenges and Practices
SOC Design Challenges and Practices
 
Ibm aix
Ibm aixIbm aix
Ibm aix
 
Intel i7
Intel i7Intel i7
Intel i7
 
Soc lect1
Soc lect1Soc lect1
Soc lect1
 
System-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design ChallengesSystem-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design Challenges
 
Intel i7 Technologies
Intel i7 TechnologiesIntel i7 Technologies
Intel i7 Technologies
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable Retina
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phones
 
POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI
 
What is system on chip (1)
What is system on chip (1)What is system on chip (1)
What is system on chip (1)
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
SOC - system on a chip
SOC - system on a chipSOC - system on a chip
SOC - system on a chip
 
2123.a better waytoprint.universal print
2123.a better waytoprint.universal print2123.a better waytoprint.universal print
2123.a better waytoprint.universal print
 
Intel core i7 processors
Intel core i7 processorsIntel core i7 processors
Intel core i7 processors
 

Viewers also liked

Chapter 1 engine components and classification
Chapter 1   engine components and classificationChapter 1   engine components and classification
Chapter 1 engine components and classificationHafizkamaruddin
 
Six stroke-engine-presenation
Six stroke-engine-presenationSix stroke-engine-presenation
Six stroke-engine-presenationgunjan panchal
 
The Animated 4 Cycle Internal Combustion Engine
The Animated 4 Cycle Internal Combustion EngineThe Animated 4 Cycle Internal Combustion Engine
The Animated 4 Cycle Internal Combustion EngineChris Coggon
 
Cryogenic rocket engine
Cryogenic rocket engineCryogenic rocket engine
Cryogenic rocket engineguestfadacb
 
Green engine - an introduction
Green engine - an introductionGreen engine - an introduction
Green engine - an introductionDIVINE SEBASTIAN
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborSvetlin Nakov
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engineAmanpreet Singh
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Diesel engine power plant
Diesel engine power plantDiesel engine power plant
Diesel engine power plantSURAJ PRASAD
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Fuel Injection and Ignition
Fuel Injection and IgnitionFuel Injection and Ignition
Fuel Injection and Ignitionguest2c6da6
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Dennis Deacon
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandSearch Engine Land
 

Viewers also liked (20)

Chapter 1 engine components and classification
Chapter 1   engine components and classificationChapter 1   engine components and classification
Chapter 1 engine components and classification
 
Six stroke-engine-presenation
Six stroke-engine-presenationSix stroke-engine-presenation
Six stroke-engine-presenation
 
The Animated 4 Cycle Internal Combustion Engine
The Animated 4 Cycle Internal Combustion EngineThe Animated 4 Cycle Internal Combustion Engine
The Animated 4 Cycle Internal Combustion Engine
 
Cryogenic rocket engine
Cryogenic rocket engineCryogenic rocket engine
Cryogenic rocket engine
 
6 stroke engine
6 stroke engine6 stroke engine
6 stroke engine
 
Green engine - an introduction
Green engine - an introductionGreen engine - an introduction
Green engine - an introduction
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engine
 
Understanding the Diesel Engine
Understanding the Diesel EngineUnderstanding the Diesel Engine
Understanding the Diesel Engine
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
IC Engine PPt
IC Engine PPtIC Engine PPt
IC Engine PPt
 
Search Engine Demystified
Search Engine DemystifiedSearch Engine Demystified
Search Engine Demystified
 
RoomCloud Booking Engine
RoomCloud Booking EngineRoomCloud Booking Engine
RoomCloud Booking Engine
 
Diesel engine power plant
Diesel engine power plantDiesel engine power plant
Diesel engine power plant
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Fuel Injection and Ignition
Fuel Injection and IgnitionFuel Injection and Ignition
Fuel Injection and Ignition
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
 
Ic engine
Ic engineIc engine
Ic engine
 

Similar to Hardware and Software Architectures for the CELL BROADBAND ENGINE processor

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologysolarisyougood
 
Cell Broadband EngineTM: and Cell/B.E. based blade technology
Cell Broadband EngineTM: and Cell/B.E. based blade technologyCell Broadband EngineTM: and Cell/B.E. based blade technology
Cell Broadband EngineTM: and Cell/B.E. based blade technologySlide_N
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
Xtw01t7v021711 cluster
Xtw01t7v021711 clusterXtw01t7v021711 cluster
Xtw01t7v021711 clusterpgnguyen44
 
mobile processors introduction..
mobile processors introduction..mobile processors introduction..
mobile processors introduction..Muhammad Sayam
 
Connection Machine
Connection MachineConnection Machine
Connection Machinebutest
 
Bladeservertechnology 111018061151-phpapp02
Bladeservertechnology 111018061151-phpapp02Bladeservertechnology 111018061151-phpapp02
Bladeservertechnology 111018061151-phpapp02gov1991
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computePerry Lea
 
Architecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyArchitecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyIntel IT Center
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Visão geral do hardware do servidor System z e Linux on z - Concurso Mainframe
Visão geral do hardware do servidor System z e Linux on z - Concurso MainframeVisão geral do hardware do servidor System z e Linux on z - Concurso Mainframe
Visão geral do hardware do servidor System z e Linux on z - Concurso MainframeAnderson Bassani
 

Similar to Hardware and Software Architectures for the CELL BROADBAND ENGINE processor (20)

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technology
 
Cell Broadband EngineTM: and Cell/B.E. based blade technology
Cell Broadband EngineTM: and Cell/B.E. based blade technologyCell Broadband EngineTM: and Cell/B.E. based blade technology
Cell Broadband EngineTM: and Cell/B.E. based blade technology
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
11136442.ppt
11136442.ppt11136442.ppt
11136442.ppt
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Xtw01t7v021711 cluster
Xtw01t7v021711 clusterXtw01t7v021711 cluster
Xtw01t7v021711 cluster
 
Computer Evolution
Computer EvolutionComputer Evolution
Computer Evolution
 
mobile processors introduction..
mobile processors introduction..mobile processors introduction..
mobile processors introduction..
 
Connection Machine
Connection MachineConnection Machine
Connection Machine
 
Bladeservertechnology 111018061151-phpapp02
Bladeservertechnology 111018061151-phpapp02Bladeservertechnology 111018061151-phpapp02
Bladeservertechnology 111018061151-phpapp02
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
 
Architecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyArchitecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter Efficiency
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Visão geral do hardware do servidor System z e Linux on z - Concurso Mainframe
Visão geral do hardware do servidor System z e Linux on z - Concurso MainframeVisão geral do hardware do servidor System z e Linux on z - Concurso Mainframe
Visão geral do hardware do servidor System z e Linux on z - Concurso Mainframe
 

More from Slide_N

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3Slide_N
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplaySlide_N
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac PhysicsSlide_N
 
SPU Shaders
SPU ShadersSPU Shaders
SPU ShadersSlide_N
 
SPU Physics
SPU PhysicsSPU Physics
SPU PhysicsSlide_N
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Slide_N
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War IIISlide_N
 

More from Slide_N (20)

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplay
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac Physics
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
SPU Physics
SPU PhysicsSPU Physics
SPU Physics
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War III
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Hardware and Software Architectures for the CELL BROADBAND ENGINE processor

  • 1. IBM Systems and Technology Group © 2005 IBM Corporation Tutorial Hardware and Software Architectures for the CELL BROADBAND ENGINE processor Michael Day, Peter Hofstee IBM Systems & Technology Group, Austin, Texas CODES+ISSS Conference, September 2005
  • 2. 2 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Agenda Trends in Processors and Systems Introduction to Cell Challenges to processor performance Cell Broadband Engine Architecture (CBEA) Cell Broadband Engine (CBE) Processor Overview Cell Programming Models Prototype Software Environment TRE Demo © 2005 IBM Corporation
  • 3. 3 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Trends in Microprocessors and Systems
  • 4. 4 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Processor Performance over Time (Game processors take the lead on media performance) 10 100 1000 10000 100000 1000000 1993 1995 1996 1998 2000 2002 2004 Game Processors PC Processors Flops (SP)
  • 5. 5 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation System Trends toward Integration Implied loss of system configuration flexibility Must compensate with generality of acceleration function to maintain market size. Processor Southbridge NorthbridgeMemory Accel Next-Gen Processor Memory IO IO
  • 6. 6 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Motivation: Cell Goals Outstanding performance, especially on game/multimedia applications. • Challenges: Memory Wall, Power Wall, Frequency Wall Real time responsiveness to the user and the network. • Challenges: Real-time in an SMP environment, Security Applicable to a wide range of platforms. • Challenge: Maintain programmability while increasing performance Support an introduction in 2005/6. • Challenge: Structure innovation such that 5yr. schedule can be met
  • 7. 7 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Performance Limiters in Conventional Microprocessors Memory Wall • Latency induced bandwidth limitations Power Wall • Must improve efficiency and performance equally Frequency Wall • Diminishing returns from deeper pipelines (can be negative if power is taken into account)
  • 8. 8 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell © 2005 IBM Corporation Systems and Technology Group
  • 9. 9 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell History IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opened in March 2001 Based in Austin, Texas February 7, 2005: First external technical disclosures • Cell Broadband Engine Architecture documentation can be found at: http://www.ibm.com/developerworks/power/cell • Additional publications on Cell can be downloaded from: http://www.ibm.com/chips/techlib/techlib.nsf/products/Cell • A paper on Cell in the upcoming issue of the IBM Journal of Research and Development can be found at: http://www.research.ibm.com/journal/rd/494/kahle.html Systems and Technology Group © 2005 IBM Corporation
  • 10. 10 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Highlights Supercomputer on a chip Multi-core microprocessor (9 cores) 3.2 GHz clock frequency 10x performance for many applications Digital home to distributed computing Systems and Technology Group © 2005 IBM Corporation
  • 11. 11 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Introducing Cell Sets a new performance standard • Exploits parallelism while achieving high frequency • Supercomputer attributes with extreme floating point capabilities • Sustains high memory bandwidth with smart DMA controllers Designed for natural human interaction • Photo-realistic effects • Predictable real-time response • Virtualized resources for concurrent activities Designed for flexibility • Wide variety of application domains • Highly abstracted to highly exploitable programming models • Reconfigurable I/O interfaces • Autonomic power management
  • 12. 12 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Microprocessor Architecture Trends CPU GPU NIC Security Media … CPU Streaming Graphics Processor Network Processor Security Processor Media Processor … 64b Power Processor Synergistic Processor Synergistic Processor ... Mem. Contr. Config. IO Hardwired Function Programmable ASIC Cell
  • 13. 13 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Key Attributes of Cell Cell is Multi-Core • Contains 64-bit Power Architecture TM • Contains 8 Synergistic Processor Elements (SPE) Cell is a Flexible Architecture • Multi-OS support (including Linux) with Virtualization technology • Path for OS, legacy apps, and software development Cell is a Broadband Architecture • SPE is RISC architecture with SIMD organization and Local Store • 128+ concurrent transactions to memory per processor Cell is a Real-Time Architecture • Resource allocation (for Bandwidth Measurement) • Locking Caches (via Replacement Management Tables) Cell is a Security Enabled Architecture • SPE dynamically reconfigurable as secure processors
  • 14. 14 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Architecture is … COHERENT BUS Power ISA MMU/BIU Power ISA MMU/BIU … IO transl. Memory Incl. coherence/memory compatible with 32/64b Power Arch. Applications and OS’s 64b Power Architecture™
  • 15. 15 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Architecture is … 64b Power Architecture™ COHERENT BUS (+RAG) Power ISA +RMT MMU/BIU +RMT Power ISA +RMT MMU/BIU +RMT IO transl. Memory Plus Memory Flow Control (MFC) MMU/DMA +RMT Local Store Memory MMU/DMA +RMT Local Store Memory LS Alias LS Alias … … …
  • 16. 16 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Architecture is … 64b Power Architecture™+ MFC COHERENT BUS (+RAG) Power ISA +RMT MMU/BIU +RMT Power ISA +RMT MMU/BIU +RMT IO transl. Memory Plus Synergistic Processors MMU/DMA +RMT Local Store Memory MMU/DMA +RMT Local Store Memory LS Alias LS Alias … … … Syn. Proc. ISA Syn. Proc. ISA
  • 17. 17 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell - Attacking the Performance Walls Multi-Core Non-Homogeneous Architecture • Control Plane vs. Data Plane processors • Attacks Power Wall 3-level Model of Memory • Main Memory, Local Store, Registers • Attacks Memory Wall Large Shared Register File & SW Controlled Branching • Allows deeper pipelines • Attacks Frequency Wall Systems and Technology Group © 2005 IBM Corporation
  • 18. 18 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Frequency Increase vs Power Consumption 0 0.5 1 1.5 2 2.5 3 3.5 0.9 1 1.1 1.2 1.3 Voltage Realative Power Frequency
  • 19. 19 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Power Efficient Architecture and the CellBE Non-Homogeneous Coherent Multi-Processor • Data-plane/Control-plane specialization • More efficient than homogeneous SMP 3-level model of Memory • Bandwidth without (inefficient) speculation • High-bandwidth .. Low power
  • 20. 20 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Power Efficient Architecture and the SPE Power Efficient ISA allows Simple Control • Single mode architecture • No cache • Branch hint • Large unified register file • Channel Interface Efficient Microarchitecture • Single port local store • Extensive clock gating Efficient implementation • See Cool Chips paper by O. Takahashi et al. and T. Asano et al.
  • 21. 21 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Broadband Engine Components
  • 22. 22 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Processor Components Power Core (PPE) L2 Cache NCU Power Processor Element (PPE): •General Purpose, 64-bit RISC Processor (PowerPC AS 2.0.2) •2-Way Hardware Multithreaded •L1 : 32KB I ; 32KB D •L2 : 512KB •Coherent load/store •VMX •3+ GHz •Real-time Control •Locking L2 Cache & TLB •Bandwidth Reservation In the Beginning – the solitary Power Processor Custom Designed – for high frequency, space and power efficiency
  • 23. 23 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation 96 Byte/Cycle 300+GB/sec @ 3.2 GHz Element Interconnect Bus Cell Processor Components Power Core (PPE) L2 Cache NCU N N N MIC EIB data ring for internal communication • Four 16 byte data rings, supporting multiple transfers • 96B/cycle peak bandwidth • Over 100 outstanding requests • 300+ GByte/sec @ 3.2 GHz
  • 24. 24 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation 96 Byte/Cycle 300+ GB/sec @ 3.2 GHz Element Interconnect Bus Cell Processor Components Power Core (PPE) L2 Cache NCU N N N MIC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPUMFC N AUC SPE provides computational performance • Dual issue, up to 8-way 32-bit SIMD • Dedicated resources: 128 128-bit RF, 256KB Local Store • Each DMA MMU can be dynamically configured to protect resources • Dedicated DMA engine: Up to 16 outstanding requests Local Store SPU MFC N AUC Local Store SPU MFC N AUC
  • 25. 25 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation 96 Byte/Cycle 300+ GB/sec @ 3.2GHz Element Interconnect Bus Cell Processor Components Power Core (PPE) L2 Cache NCU N N N MIC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPUMFC N AUC Memory Flow Controller (MFC) and Atomic Update Cache (AUC) • 4 line cache for shared memory atomic update primitives •High performance DMA unit •Local Store aliased into PPE system memory •16 element SPE side DMA Command Queue • 8 element PPE side DMA Command Queue •DMA List supports 2K entry scatter/gather •MFC-MMU controls SPE DMA accesses •Compatible with PowerPC Virtual Memory architecture •Full memory protection for MFC DMA •S/W controllable from PPE MMIO •DMA 1,2,4,8,16,128 -> 16Kbyte transfers for I/O access
  • 26. 26 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Processor Components IOIF0 20 GB/sec BIF or IOIF0 IOIF1 Southbridge I/O 5 GB/sec LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPU MFC N AUC Local Store SPU MFC N AUC 96 Byte/Cycle Element Interconnect Bus Power Core (PPE) L2 Cache NCU Memory Interface Controller (MIC): • Dual XDRTM controller (25.6GB/s @ 3.2Gbps) • ECC support • Suspend to DRAM support Broadband Interface Controller (BIC): Provides a wide connection to external devices Two configurable interfaces (50+GB/s @ 5Gbps) – Configurable number of bytes – Coherent (BIF) and / or I/O (IOIFx) protocols Supports two virtual channels per interface Supports multiple system configurations MIC 25 GB/sec XDR DRAM
  • 27. 27 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Processor Components I/O Bus Master Translation (IOT) Translates Bus Addresses to System Real Addresses Two Level Translation – I/O Segments (256 MB) – I/O Pages (4K, 64K, 1M, 16M byte) I/O Device Identifier per page for LPAR IOST and IOPT Cache – hardware / software managed IOIF0 20 GB/sec BIF or IOIF0 MIC 25 GB/sec XDR DRAM IOIF1 Southbridge I/O 5 GB/sec LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPU MFC N AUC Local Store SPU MFC N AUC 96 Byte/Cycle Element Interconnect Bus Power Core (PPE) L2 Cache NCU IIC IOT Internal Interrupt Controller (IIC) Handles SPE Interrupts Handles External Interrupts – From Coherent Interconnect – From IOIF0 or IOIF1 Interrupt Priority Level Control Interrupt Generation Ports for IPI Duplicated for each PPE hardware thread
  • 28. 28 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation 96 Byte/Cycle 256GB/sec @ 4GHz Element Interconnect Bus Cell Processor Components Power Core (PPE) L2 Cache NCU N N LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC Local Store SPUMFC N AUC Local Store SPUMFC N AUC LocalStore SPUMFC N AUC LocalStore SPUMFC N AUC MIC IOIF1 IOIF0 Southbridge I/O 20 GB / sec BIF or IOIF1 25 GB / sec XDR DRAM 5 GB/sec Token Manager – Bandwidth Reservation for shared resources • Optionally Enabled for RT Tasks or LPAR • Multiple Resource Allocation Groups • Generates access tokens at configurable rate for each allocation group 1 per each memory bank (16 total) 2 for each IOIF (4 total) • Requestors assigned RAG ID by OS/hypervisor Each SPE PPE L2 / NCU IOIF 0 Bus Master IOIF 1 Bus Master • Priority order for using another RAGs unused tokens • Resource overcommit warning interrupt IIC IOTTKM
  • 29. 29 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell ProcessorCell Processor Power Processor Element (PPE): •General Purpose, 64-bit RISC Processor (PowerPC AS 2.0) •2-Way Hardware Multithreaded •L1 : 32KB I ; 32KB D •L2 : 512KB •Coherent load/store •VMX •3.2 GHz Internal Interconnect: •Coherent ring structure •300+ GB/s total internal interconnect bandwidth •DMA control to/from SPEs supports >100 outstanding memory requests Synergistic Processor Elements (SPE): •8 per chip •128-bit wide SIMD Units •Integer and Floating Point capable •256KB Local Store •Up to 25.6 GF/s per SPE --- 200GF/s total * External Interconnects: •25.6 GB/sec BW memory interface •2 Configurable I/O Interfaces •Coherent interface (SMP) •Normal I/O interface (I/O & Graphics) •Total BW configurable between interfaces •Up to 35 GB/s out •Up to 25 GB/s in Memory Management & Mapping •SPE Local Store aliased into PPE system memory •MFC/MMU controls SPE DMA accesses •Compatible with PowerPC Virtual Memory architecture •S/W controllable from PPE MMIO •Hardware or Software TLB management •SPE DMA access protected by MFC/MMU ““Supercomputer-on-a-Chip” * At clock speed of 3.2GHz Element Interconnect Bus MFC Local Store SPU N AUC MFC Local Store SPU N AUC Power Core (PPE) L2 Cache NCU Local Store SPU MFC N AUC Local Store SPU MFC N AUC N N N N MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC 20 GB/sec Coherent Interconnect 25.6 GB/sec Memory Inteface 5 GB/sec I/O Bus
  • 30. 30 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Highlights RISC like organization • 32 bit fixed instructions • Clean design – unified Register file User-mode architecture • No translation/protection within SPU • DMA is full Power Arch protect/x-late VMX-like SIMD dataflow • Broad set of operations (8 / 16 / 32 Byte) • Graphics SP-Float • IEEE DP-Float Unified register file • 128 entry x 128 bit 256KB Local Store • Combined I & D • 16B/cycle L/S bandwidth • 128B/cycle DMA bandwidth LS LS LS LS GPR FXU ODD FXU EVN SFP DP CONTROL CHANNEL DMA SMM ATO SBI RTB FWD 14.5mm2 (90nm SOI)
  • 31. 31 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation What is a Synergistic Processor? (and why is it efficient?) Local Store “is” large 2nd level register file / private instruction store instead of cache • Asynchronous transfer (DMA) to shared memory • Frontal attack on the Memory Wall Media Unit turned into a Processor • Unified (large) Register File • 128 entry x 128 bit Media & Compute optimized • One context • SIMD architecture LS LS LS LS GPR FXU ODD FXU EVN SFP DP CONTROL CHANNEL DMA SMM ATO SBI RTB FWD SPU SMF
  • 32. 32 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Detail LS LS LS LS GPR FXU ODD FXU EVN SFP DP CONTROL CHANNEL DMA SMM ATO SBI RTB FWD Synergistic Processor Element (SPE) User-mode architecture • No translation/protection within SPE • DMA is full PowerPC protect/xlate Direct programmer control • DMA/DMA-list • Branch hint VMX-like SIMD dataflow • Graphics SP-Float • No saturate arith, some byte • IEEE DP-Float (BlueGene- like) Unified register file • 128 entry x 128 bit 256KB Local Store • Combined I & D • 16B/cycle L/S bandwidth • 128B/cycle DMA bandwidth SPE Latencies • Simple fixed point - 2 cycles* • Complex fixed point - 4 cycles* • Load - 6 cycles* Local store size = 256 KB • Single-precision (ER) float - 6 cycles* • Integer multiply - 7 cycles* • Branch miss - 20 cycles No penalty if correctly hinted • DP (IEEE) float - 13 cycles* Partially pipelined • Enqueue DMA Command - 20 cycles* SPU Units: • Simple (FXU even) Add/Compare Rotate Logical, Count Leading Zero • Permute (FXU odd) Permute Table-lookup • FPU (Single / Double Precision) • Control (SCN) Dual Issue, Load/Store, ECC Handling • Channel (SSC) – Interface to MFC • Register File (GPR/FWD)
  • 33. 33 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Coherent Offload Model DMA into and out of Local Store similar to Power core loads & stores Governed by Power Architecture page and segment tables for translation and protection Shared memory model • Power architecture compatible addressing • MMIO capabilities for SPEs • Local Store is mapped (alias) allowing LS to LS DMA transfers • DMA equivalents of locking loads & stores • OS management/virtualization of SPEs Pre-emptive context switch is supported
  • 34. 34 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation MFC Detail Memory Flow Control System •DMA Unit •LS <-> LS, LS<-> Sys Memory, LS<-> I/O Transfers •8 PPE-side Command Queue entries •16 SPU-side Command Queue entries •MMU similar to PowerPC MMU •8 SLBs, 256 TLBs •multiple page sizes •Software/HW page table walk •PT/SLB misses interrupt PPE •Atomic Cache Facility •4 cache lines for atomic updates •2 cache lines for cast out/MMU reload •Up to 16 outstanding DMA requests in BIU •Resource / Bandwidth Management Tables •Token Based Bus Access Management •TLB Locking Isolation Mode Support (Security Feature) Hardware enforced “isolation” • SPE and Local Store not visible (bus or jtag) • Small LS “untrusted area” for communication area Secure Boot • Chip Specific Key • Decrypt/Authenticate Boot code “Secure Vault” – Runtime Isolation Support • Isolate Load Feature • Isolate Exit Feature Legend: Data Bus Snoop Bus Control Bus Xlate Ld/St MMIO Local Store SPU DMA Engine DMA Queue Atomic Facility MMU RMT Bus I/F Control SPE MMIO
  • 35. 35 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Implementation Aspects
  • 36. 36 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Pre-Decode L1 Instruction Cache Microcode SMT Dispatch (Queue) Decode Dependency Issue Branch Scan Fetch Control L2 Interface VMX/FPU Issue (Queue) VMX Load/Store/ Permute VMX Arith./Logic Unit FPU Load/Store FPU Arith/Logic Unit Load/Store Unit Branch Execution Unit Fixed-Point Unit FPU CompletionVMX Completion Completion/Flush 8 4 2 1 2 11 1 1 111 4 Thread A Thread B Threads alternate fetch and dispatch cycles Thread A Thread B Thread A L1 Data Cache 2 PPE BLOCK DIAGRAM
  • 37. 37 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation ID2 IS1 RF2 EX1 EX2 IC1 IC2 IC3 ID1 RF1 RF2 EX1 EX3 EX4EX2 MC1 IS2 RF1 RF2 EX1 EX2 EX3 EX4 Fixed Point Unit Instruction PPE PIPELINE FRONT END PPE PIPELINE BACK END Microcode MC2 MC3 MC4 EX5 EX5 ... MC9 MC10 MC11 IB2 IS3 EX7 WB EX6 IB1 ID3 EX3RF1DLY DLY DLY DLY DLY DLY EX4 IBZ IC0 Branch Instruction EX8 WB IC4 BP1 BP2 BP3 BP4 Load/Store Instruction Branch Prediction Instruction Decode and IssueInstruction Cache and Buffer IC Instruction Cache IB Instruction Buffer BP Branch Prediction MC Microcode ID Instruction Decode IS Instruction Issue DLY Delay Stage RF Register File Access EX Execution WB Write Back
  • 38. 38 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle On-Chip Coherent Bus SPE BLOCK DIAGRAM
  • 39. 39 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation EX1 EX3 EX4EX2 SPU PIPELINE FRONT END SPE PIPELINE BACK END EX5 EX6 RF1 RF2 Branch Instruction WB Load/Store Instruction IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back IF1 IF2 ID2 IS1IF3 IF4 IF5 ID1 IS2IB2IB1 ID3 EX2 Fixed Point Instruction WBEX1 Floating Point Instruction WBEX1 EX2 Permute Instruction WBEX1 EX3 EX4 EX5 EX6EX2 EX3 EX4
  • 40. 40 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation CellBE Processor ~250M transistors ~235mm2 Top frequency >3GHz 9 cores, 10 threads > 200+ GFlops (SP) @3.2 GHz > 20+ GFlops (DP) @3.2 GHz Up to 25.6GB/s memory B/W Up to 50+ GB/s I/O B/W ~400M$(US) design investment
  • 41. 41 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Processor Can Support Many Systems Game console systems Blades HDTV Home media servers Supercomputers Cell BE Processor XDRtm XDRtm IOIF0 IOIF1 Cell BE Processor XDRtm XDRtm IOIF BIF Cell BE Processor XDRtm XDRtm IOIF Cell BE Processor XDRtm XDRtm IOIF BIF Cell BE Processor XDRtm XDRtm IOIF CellBE Processor XDRtm XDRtm IOIF BIF CellBE Processor XDRtm XDRtm IOIFSW
  • 42. 42 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell BE Processor Initial Application Areas Cell excels at processing of rich media content in the context of broad connectivity • Digital content creation (games and movies) • Game playing and game serving • Distribution of dynamic, media rich content • Imaging and image processing • Image analysis (e.g. video surveillance) • Next-generation physics-based visualization • Video conferencing • Streaming applications (codecs etc.) • Physical simulation & science Cell is an excellent match for any applications that require: • Parallel processing • Real time processing • Graphics content creation or rendering • Pattern matching • High-performance SIMD capabilities
  • 43. 43 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Programmability Programmer Productivity Raw Hardware Performance BE Design 25 GB/sec Memory Inteface 20 GB/sec Coherent Interconnect 5 GB/sec I/O Bus Cell Software Overview 256 GB/sec Coherent Ring MFC Local Store SPU N AUC MFC Local Store SPU N AUC Power Processor (PPE) L2 Cache NCU Local Store SPU MFC N AUC Local Store SPU MFC N AUC N N N N Cell Chip MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC MFC LocalStore SPU N AUC Cell Programming Features Flexible Programming Models Function Offload Model Existing Application Accelerator Parallel Computational Acceleration Heterogeneous Multi-Threading Runtime
  • 44. 44 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Programming Characteristics Exploit all of Cell’s 18 asynchronous engines • Through function offload, or parallel computational tasks Decompose work into data parallel blocks • Independent Data parallel tasks are most efficient Overlap SPE compute with DMA loads and stores • Multibuffering , pipelining Reduce PPE/SPE workload ratio • PPE as Control / OS processor • SPE as heavy lifting computational engines Super-Linear speedups over conventional processors may be achieved • Through exploitation of asynchronous DMA engines
  • 45. 45 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Features Exploited by Software Keeping Intermediate/Control Data on-Chip MMU – ERATS, SLBs, TLBs DMA from L2 cache-> LS LS to LS DMA Cache <-> Cache transfers (atomic update) SPE Signalling Registers SPE <-> PPE Mailboxes Resource Reservation and Allocation PPE can be shared across logical partitions SPEs can be assigned to logical partitions SPEs independently or Group Allocated Cache Replacement Management TLB Replacement Management Bandwidth Reservation Cell Programming Features Power Processor (PPE) L2 Cache MFC Local Store SPU N AUC DMA with Intervention Atomic Update Cache Local Store to Local Store DMA Hardware Managed Cache Coherency Cell Chip System Memory I/O MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC BIF/IOIF
  • 46. 46 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Subsystem Programming Model Dedicated Function (problem/privileged subsystem) Programmer writes/uses SPU "libraries" Graphics Pipeline Audio Processing MPEG Encoding/Decoding Encryption / Decryption Main Application in PPE, invokes SPU bound services RPC Like Function Call I/O Device Like Interface (FIFO/ Command Queue) 1 or more SPUs cooperating in subsystem Problem State (Application Allocated) –Transcoding –Realtime data transformation & streaming –Graphics Processing –MPEG Encoding / Decoding –Physical Simulation Privileged State (OS Allocated) –Encryption / Decryption Services –Network Packet Filtering / Routing Code-to-data or data-to-code pipelining possible Very efficient in real-time data streaming applications Function Offload Power Core (PPE) Sys. Memory/QoS SPU Local Store MFC N SPU Local Store MFC N Multi-stage Pipeline SPU Local Store MFC N SPU Local Store MFC N SPU Local Store MFC N Parallel-stages Power Core (PPE) Sys. Memory/QoS
  • 47. 47 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Application Specific Acceleration ModelApplication Specific Acceleration Model –– SPC Accelerated SubsystemsSPC Accelerated Subsystems System Memory Parameter Area PPE PowerPC Application Cell Aware OS (Linux) Compression/ Decompression Realtime MPEG Encoding TCP Offload Data Encryption Data Decryption Conventional PowerPC App. with Library Dependencies OpenGL Encrypt Decrypt Encoding OS loads 3rd party function libraries containing PPC wrapped SPU code PPE part of library code loads SPU allocated by OS Loads SPU part of library code PPC part of library App calls library function, PPE library stub communications to SPEmpeg_encode() MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC MFC Local Store SPU N AUC Maintains PowerPC programming model Similiar to IOP but integrated with main processor • Supports shared memory programming • Extremely high on-chip bandwidth • Scales with “conventional” processor SPU code provided in: • Application or middleware libraries • Operating System Services Does not require application rewrite • Specific subsystems targetted • Separate compilations • Leverage ELF • SPUs managed by OS SPE Programming localized in library • SPU Code vectorization required • Private Local Store and DMA model •Scheduling of code / data movement • Code debug and fault isolation complexity Application Specific Accelerators Decoding
  • 48. 48 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Parallel Computational Acceleration Single Source Compiler (PPE and SPE targets) • Auto parallelization ( treat target Cell as an Shared Memory MP ) • Auto SIMD-ization ( SIMD-vectorization ) for PPE VMX and SPE • Compiler management of Local Store as Software managed cache (I&D) Optimization Options • OpenMP-like pragmas • MPI based Microtasking • Streaming languages • Vector.org SIMD intrinsics • Data/Code partitioning • Streaming / pre-specifying code/data use Compiler or Programmer scheduling of DMAs Compiler use of Local store as soft-cache IBM Research Prototype Single Source Compiler • C Frontend • XLC SPE and XLC PPE back-end IBM Research Prototype Parallelizing Compiler • UPC front-end • Fortran front-end • XLC SPE backend 256 GB/sec Coherent Ring MFC Local Store SPU N AU C MFC Local Store SPU N AU C Power Processor (PPE) L2 Cache NCU Local Store SPU MFC N AU C Local Store SPU MFC N AU C N N N N Broadband Engine Chip MFC Local Store SPU N AU C MFC Local Store SPU N AU C MFC Local Store SPU N AU C MFC Local Store SPU N AU C 256 GB/sec Coherent Ring MFC Local Store SPU N AU C MFC Local Store SPU N AU C Power Processor (PPE) L2 Cache NCU Local Store SPU MFC N AU C Local Store SPU MFC N AU C N N N N Broadband Engine Chip MFC Local Store SPU N AU C MFC Local Store SPU N AU C MFC Local Store SPU N AU C MFC Local Store SPU N AU C XDRtm XDRtm I/O I/O Tools and Programmer Driven
  • 49. 49 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Application Source & Libraries PPE object files SPE object files SPESPE SPE SPE Physical SPEs SPESPE SPE SPE Cell AwareOS ( Linux) SPE Virtualization / Scheduling Layer (m->n spe lwp/threads) New SPE lwp/threadsExisting PPE lwp/threads Physical PPE PPE MT1 MT2 Operating System Runtime Strategy Heterogeneous Multi-Threading Model PPE LWP/Threads SPU LWP/Threads SPU DMA EA = PPE Process EA Space OS supports Create/Destroy SPE tasks Atomic Update Primitives used for Mutex SPE Context Fully Managed Context Save/Restore for Debug Virtualization Mode (indirect access) Direct Access Mode (realtime) OS assignment of SPE threads to SPUs Programmer directed using affinity mask
  • 50. 50 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Prototype Cell Extensions to Linux System Call Interface exec Loader File System Framework Device Framework Network Framework Streams Framework SPU Management Framework Privileged Kernel Extensions Firmware / Hypervisor ILP32 Processes LP64 EM Processes Cell Reference System Hardware 32-bit GNU Libs (glibc,etc) 64-bit Linux Kernel 64-bit GNU Libs (glibc) SPUFS FilesystemMisc format bin SPU Object Loader Extension ptrace, large page, madvise, SPE error, fault handling, IIC support, IOS Architecture Specific Code SPU Allocation, Scheduling & Dispatch Extension PPC32 Apps. Cell32 Workloads PPC64 Apps.Cell64 Workloads std. PPC32 elf interp SPE Object Loader Services SPE Management Runtime Library (32-bit) std. PPC64 elf interp Programming Models Offered: RPC, Device Subsystem, Non-Realtime and Realtime Asymmetric Threads -- Single SPU, SPU Groups, Shared Memory SPE Management Runtime Library (64-bit)
  • 51. 51 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Programmer Experience Development Tools Stack System Level Simulator Linux PPC64 with Cell Extensions SPE Management Lib Application Libs Samples Workloads Demos Code Dev Tools Debug Tools Performance Tools Standards: Language extensions ABI Verification Hypervisor Development Environment End-User Experience Execution Environment Cell Prototype Software Environment Miscellaneous Tools
  • 52. 52 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Cell Standards Application Binary Interface Specifications • Defines such things as data types, register usage, calling conventions, and object formats to ensure compatibility of code generators and portability of code. SPE ABI Linux Cell ABI SPE C/C++ Language Extensions • Defines standardized data types, compiler directives, and language intrinsics used to exploit SIMD capabilities in the core. • Data types and Intrinsics styled to be similar to Altivec/VMX. SPE Assembly Language Specification Standards
  • 53. 53 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation System Level Simulator Cell BE – full system simulator • Uni-Cell and multi-Cell simulation • User Interfaces – TCL and GUI • Cycle accurate SPU simulation (pipeline mode) • Pseudo accurate memory and MFC modes • Emitter facility for tracing and viewing simulation events Other simulators • spusim – standalone SPU simulator Execution Environment
  • 54. 54 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Verification Hypervisor (vHype) Seamless Integration of Dual Environments •Low overhead – small footprint •Realtime Resource management •SPE management •Separate policy manager •Pre-emptive partition switching on high priority interrupts Real time Open Source / Proprietary OS Conventional Network OS (Linux) I/O Hosting Hypervisor (Hypervisor - Multi-Partition Resource Allocation / Reservation /Protection) Hardware Networked Apps. Inter-partition Communications Logical Partitioning RTOS/Linux Execution Environment
  • 55. 55 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Linux on Cell All software in STIDC written on Linux OS • Started with Linux 2.4 PPC64 on Cell Simulator SPEs exposed as I/O Devices (function offload model) SPE DMA required pre-pinned memory Inflexible programming model Moved to 2.6.3 • Added heterogenous lwp/thread model – via system call – moved to SPUFS in 2.6.13 SPE thread API created (similar to pthreads library) User mode direct and indirect SPE access models Full pre-emptive SPE context management spe_ptrace() added for gdb support spe_schedule() for thread to physical SPE assignment currently FIFO – run to completion • SPE threads share address space with parent PPE process (through DMA) Demand paging for SPE accesses Shared hardware page table with PPE • SPE Error, Event and Signal handling directed to parent PPE thread • SPE elf objects wrapped into PPE shared objects with extended gld SPE-side mini-loader • madvise() extended for L2 cache and TLB locking/preloading (realtime feature) • All patches for Cell in architecture dependent layer (subtree of PPC64) Publishing Initial CellBE Patches for 2.6.13 (Fall 2005 target) Execution Environment
  • 56. 56 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Management Library SPEs are exposed as threads • SPE thread model interface is similar to POSIX threads. • SPE thread consists of the local store, register file, program counter, and MFC-DMA queue. • Associated with a single Linux task. • Features include: Threads - create, groups, wait, kill, set affinity, set context Thread Queries - get local store pointer, get problem state pointer, get affinity, get context Groups - create, set group defaults, destroy, memory map/unmap, madvise. Group Queries - get priority, get policy, get threads, get max threads per group, get events. SPE image files - opening and closing SPE Executable • Standalone SPE program managed by a PPE executive. • Executive responsible for loading and executing SPE program. It also services “syscall” requests for I/O (eg, fopen, fwrite, fprintf) and memory requests (eg, mmap, shmat, …). Execution Environment
  • 57. 57 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Optimized Prototype Libraries Audio (resampling) Cryptographic Fast fourier transform Game math Image Large matrix Math Matrix (4x4) Miscellaneous Memory management Multi-precision math Noise & Turbulence Oscillator Parallel programming (MPI like) Shared memory SPU plugin SPU system call Surfaces / curves Synchronization Vector Execution Environment
  • 58. 58 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Code Development Tools GNU based binutils • gas SPE assembler • gld SPE ELF object linker gld extensions for embedding SPE object modules in PPE executables • misc bin utils (ar, nm, ...) targeting SPE modules • hosted on Linux IA32, Linux PowerPC GNU based C/C++ compiler targeting SPE • From STI Partner • retargeted compiler to SPE • Supports common SPE Language Extensions and ABI (ELF/Dwarf2) object output Cell Broadband Engine Optimizing Compiler (IBM Proprietary) • IBM XLC C/C++ for PowerPC (Tobey) • IBM XLC C retargeted to SPE assembler (including vector intrinsics) - highly optimizing • Prototype - XLC Compiler supporting CellBE Programmer Productivity Aids Single Source compilation using OpenMP like pragmas (PPE and SPE object code generated) Auto-Vectorization (auto-SIMD) for SPE code Auto-Parallelization across SPEs UPC Front end – parallelization across SPEs Local Store software managed caching model • Hosted on Linux • Executables to be available on IBM Alphaworks – Fall 2005 Development Environment
  • 59. 59 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Debug Tools CellBE system simulator • Executable availability on AlphaWorks (Fall 2005 target) GNU gdb • ptrace and spe_ptrace enabled • Multi-core Application source level debugger supporting PPE multithreading, SPE multithreading, interacting PPE and SPE threads • Three modes of debugging SPU threads Attach to SPE thread Launch mode – launch a new debug session for each SPE thread Pass-thru mode – follow execution into SPE thread RISCwatch • Low level hardware (JTAG) debugger Development Environment
  • 60. 60 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Prototype Performance Tools pmcount • Tool to access to HW performance counters Performance inspector • Suite of GPL based performance analysis tools extended to support SPE threads tprof – timer based analysis tool ptt – per thread time ai – above idle post – report generator a2n – address to name ctrace • Branch tracing performance monitor (under development) Development Environment
  • 61. 61 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Performance Tools Static analysis (spexlc_timing) • Annotates assembly source with instruction pipeline state Dynamic analysis (CellBE System Simulator) • Generates statistical data on SPE execution Cycles, instructions, and CPI Single/Dual issue rates Stall statistics Register usage Instruction histogramming Development Environment
  • 62. 62 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Miscellaneous Tools – IDL Compiler Development Environment
  • 63. 63 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Samples / Workloads / Demos Numerous code samples provided to demonstrate system design constructs Complex workloads and demos used to evaluate and demonstrate system performance Terrain Rendering Engine Subdivision Surfaces Physics Simulation Geometry Engine Execution Environment
  • 64. 64 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Subsystem Sample – Geometry Engine OpenGL-like geometry engine •Geometry processing is offloaded to compile-time configurable SPE “vertex shader” •User Queue communication model consisting of 4KB blocks for SPU command requests with command headers in SPE Mailbox FIFO PPE Application G E L I B GE Shader User Queues PPE SPE System Memory Viewer Execution Environment
  • 65. 65 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Terrain Rendering Engine Overview Visualization of Terrain Data Increasingly Important •General availability of high resolution satellite images •Publicly available USGS Digital Elevation Models (DEMs) •Mobile GPS devices (land, sea, air) + wireless networks Inferior Current Solutions •Polygonal Models (low quality, non-Real Time) •Requires CPU + GPU (Graphics Processing Unit) Superior CellBE Solution •Highly Compressed Height Map Models (10x less data) •Requires only one CellBE (No GPU) •High Quality Images (Multi-Sampled Raycast) •Fast (Real Time Animations)
  • 66. 66 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Earthviewer – Polygon Render
  • 67. 67 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Real-Time Terrain Rendering Engine (TRE) Cell Optimized Raycaster Advanced SPUAdvanced SPU shadershader functionfunction Real time rendering with only one Cell processoreal time rendering with only one Cell processor No graphics adapter assistNo graphics adapter assist High definition resolutionHigh definition resolution Ray/Terrain intersection computationRay/Terrain intersection computation Texture FilteringTexture Filtering Normal computationNormal computation Bump map computationBump map computation Diffuse + Ambient lighting modelDiffuse + Ambient lighting model PerlinPerlin Noise based cloudsNoise based clouds Atmosphere computation (haze, sun, halo)Atmosphere computation (haze, sun, halo) Dynamic multi-sampling (4Dynamic multi-sampling (4 –– 16 samples per pixel)16 samples per pixel) Image based input (16 bit height + 16 bit texture)Image based input (16 bit height + 16 bit texture) 47 KB of SPU object code47 KB of SPU object code MJPEG like compression via SPUMJPEG like compression via SPU Performance scales linearly with number of availablePerformance scales linearly with number of available SPUsSPUs Written completely in C withWritten completely in C with intrinsicsintrinsics Client supportClient support –– OpenGL workstation, wireless PDAOpenGL workstation, wireless PDA User inputsUser inputs –– graphical, joystick, GPS/accelerometergraphical, joystick, GPS/accelerometer
  • 68. 68 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation 3D Surface from 25x25 Height Data 0 1000 2000 3000 4000 5000 6000 7.5 KB for just the Surface
  • 69. 69 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Raw 25x25 Height Data 0 1000 2000 3000 4000 5000 6000 1.25 KB for Height Map (6x Compression)
  • 70. 70 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Ray Casting
  • 71. 71 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Height Color Data Layout (Main Memory) © 2005 IBM Corporation71 Quad Word (128 bits)Quad Word (128 bits) H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C
  • 72. 72 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation PPE Data Staging via L2 Main MemoryMain Memory PPEPPE L2L2 SPESPE SPESPESPESPE SPESPE SPESPE SPESPESPESPE SPESPE The L2 has 4 Outstanding Loads + 2 Prefetch
  • 73. 73 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Data Staging 16 Outstanding Loads per SPE Main MemoryMain Memory PPEPPE L2L2 SPESPE SPESPESPESPE SPESPE SPESPE SPESPESPESPE SPESPE
  • 74. 74 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Height Color Data Layout (Main Memory) © 2005 IBM Corporation74 Quad Word (128 bits) H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C Quad Word (128 bits)
  • 75. 75 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Height Color Data Layout (Local Store) © 2005 IBM Corporation75 SIMD Register (128 bits) H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C H C Shuffle Byte Quad Word (128 bits)
  • 76. 76 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation TRE Image Pipeline Time PPE SPE 0-6 SPE 7 Prep Frame 0Prep Frame 0 Prep Frame 1Prep Frame 1 Render Frame 0Render Frame 0 Render Frame 1Render Frame 1 Encode Frame 0Encode Frame 0 Encode Frame 1Encode Frame 1 Prep Frame 2Prep Frame 2 Prep Frame 3Prep Frame 3 Render Frame 2Render Frame 2
  • 77. 77 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation TRE SPE Render Pipeline MFC Load RC 0 Load RC 0 Load HC/Accm 0Load HC/Accm 0 Load HC/Accm 1Load HC/Accm 1 Store Accm 1 Store Accm 1 Load HC/Accm 0Load HC/Accm 0 Store Accm 0 Store Accm 0 Gen Lists 0Gen Lists 0 Gen Samples 1Gen Samples 1 Gen Lists 0Gen Lists 0Gen Samples 0Gen Samples 0Gen Lists 1Gen Lists 1 Store Accm 0 Store Accm 0 Load RC 0 Load RC 0 Gen Samples 1Gen Samples 1 Load RC 1 Load RC 1 TimeTime SPU
  • 78. 78 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Ray-casting with SIMD Ray 1 Ray 4Ray 3Ray 2 Vector (128 bits) 4 3 2 1
  • 79. 79 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation TRE SPE Ray Kernel Ray IntersectionRay Intersection Multi - Sample AdjustmentMulti - Sample Adjustment Shader Texture Filtering Normal Generation Bump Mapping Lighting Atmosphere Shader Texture Filtering Normal Generation Bump Mapping Lighting Atmosphere Sample AccumulationSample Accumulation
  • 80. 80 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation SPE Local Store Memory Layout Height Color Data (57 KB) Height Color Data (57 KB) Height Color Data (57 KB) Height Color Data (57 KB) Accumulation Data (30 KB) Accumulation Data (30 KB) Accumulation Data (30 KB) Accumulation Data (30 KB) Accum DMA List (15 KB)Accum DMA List (15 KB) Accum DMA List (15 KB)Accum DMA List (15 KB) RCRC RCRC ACAC Stack 4 KBStack 4 KB Code (29 KB) Code (29 KB) HC DMA List (13 KB)HC DMA List (13 KB) HC DMA List (13 KB)HC DMA List (13 KB)
  • 81. 81 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Chip
  • 82. 82 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Sample SPE Simulator Output Performance Cycle count 96187613 Performance Instruction count 113483578 (108381008) Performance CPI 0.85 (0.89) Branch instructions 1255537 Branch taken 826730 Branch not taken 428807 Hint instructions 507711 Hint hit 809520 Single cycle 59886926 ( 62.3%) Dual cycle 24247041 ( 25.2%) Nop cycle 133131 ( 0.1%) Stall due to branch miss 1024965 ( 1.1%) Stall due to prefetch miss 22394 ( 0.0%) Stall due to dependency 9709208 ( 10.1%) Stall due to fp resource conflict 0 ( 0.0%) Stall due to waiting for hint target 1163937 ( 1.2%) Stall due to dp pipeline 0 ( 0.0%) Channel stall cycle 0 ( 0.0%) SPU Initialization cycle 9 ( 0.0%) ----------------------------------------------------------------------- Total cycle 96187611 (100.0%) The number of used registers are 128, the used ratio is 100.00
  • 83. 83 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Austin
  • 84. 84 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Mount Saint Helens
  • 85. 85 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation TRE 720P Performance 2.0 GHz Apple G5 0.6 frames/sec • 40% of cycles spent waiting for Memory 3.2 GHz Cell 30.0 frames/sec • 1% of cycles spent waiting for Memory Cell has 50x advantage
  • 86. 86 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation Summary Cell ushers in a new era of leading edge processors optimized for digital media and entertainment Desire for realism is driving a convergence between supercomputing and entertainment New levels of performance and power efficiency beyond what is achieved by PC processors Responsiveness to the human user and the network are key drivers for Cell Cell will enable entirely new classes of applications, even beyond those we contemplate today
  • 87. 87 IBM Systems and Technology Group STI Technology © 2005 IBM Corporation (c) Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United Sates September 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division The IBM home page is http://www.ibm.com 1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com