SlideShare une entreprise Scribd logo
1  sur  60
Computer Architecture –
An Introduction
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
Outline
 Walls
 Classes of computers
 Instruction set architecture
 Trends
 Technology
 Power & energy
 Cost
 Principles of computer design
2
Single Processor Performance
3
RISC
Move to multi-processor
Why Such Rapid Change?
 Performance improvements
 Improvements in semiconductor technology
 Clock speed, feature size
 Improvements in computer architectures
 High-level language compilers, UNIX
 Lead to RISC architectures
 Lower costs
 Simpler development
 Higher volumes
 Lower margins
 Function
 Rise of networking & interconnection technology 4
Today’s Status
5
Moore’s Law – No of
transistors on a chip
tends to double about
every 2 years
Transistor
count still
rising
Clock speed
flattening
sharply
Source: www.extremetech.com/wp-
content/uploads/2012/02/CPU-Scaling.jpg
Clock Speed vs. Power
 Intel 80386 consumed ~
2 W
 3.3 GHz Intel Core i7
consumes 130 W
 Heat must be
dissipated from 1.5 x
1.5 cm2 chip
 Limits what can be
cooled by air
6
Conventional Wisdom in Question
 Conventional Wisdom – Power is free, Transistors
are expensive
 Today – Power is expensive, Transistors are free
 Power wall
 Can put more on chip than can afford to turn on
 Conventional Wisdom – Increase Instruction Level
Parallelism (ILP) via compilers, innovation
 Out-of-order, speculation, VLIW
 Today – Law of diminishing returns on more
hardware for ILP
 ILP wall 7
Conventional Wisdom in Question (Cont.)
 Conventional Wisdom – Multiplies are slow,
Memory access is fast
 Today – Memory is slow, multiplies are fast
 Memory wall
 200 clock cycles to DRAM memory, 4 clocks to multiply
 Conventional Wisdom – Uniprocessor
performance 2× / 1.5 years
 Today – Power Wall + ILP Wall + Memory Wall =
Brick Wall
 Multi-cores
 Simpler processors are more power efficient 8
Current Trends in Architecture
 Can’t continue to leverage ILP
 Uniprocessor performance improvement ended in
2003
 New models for performance
 Data-level parallelism (DLP)
 Thread-level parallelism (TLP)
 Request-level parallelism (RLP)
 These require explicit restructuring of applications
9
Parallelism
10
Parallelism (Cont.)
 Classes of parallelism in applications
 Data-Level Parallelism (DLP)
 Task-Level Parallelism (TLP)
 Classes of architectural parallelism
 Instruction-Level Parallelism (ILP)
 Exploits DLP in pipelining & speculative execution
 Vector architectures/Graphic Processor Units (GPUs)
 Exploit DLP by applying same instruction on many data items
 Thread-Level Parallelism
 Exploit DLP & TLP in cooperative processing by threads
 Request-Level Parallelism
 Parallel execution of tasks that are independent
11
Flynn’s Taxonomy
 Single instruction stream, single data stream (SISD)
 Normal sequential programs
 Uniprocessor
 Single instruction stream, multiple data streams (SIMD)
 Data parallelism
 Vector architectures
 Multimedia extensions (Intel MMX)
 Graphics Processor Units (GPUs)
 Multiple instruction streams, single data stream (MISD)
 No commercial implementation
 Fault tolerant sachems
 Multiple instruction streams, multiple data streams (MIMD)
 Most parallel programs
 Multi-core
12
Classes of Computers & Performance
Metrics
13
Want to achieve these performance metrics?
Then you need to understand & design based on
principles of computer architecture
Classes of Computers
 Personal Mobile Device (PMD)
 Smart phones & tablets
 Emphasis is on energy efficiency, cost, responsiveness,
& multimedia performance
 Desktop Computing
 Desktops, netbooks, & laptops
 Emphasis is on price-performance, energy, & graphic
performance
 Servers
 Emphasis is on availability, scalability, throughput, &
energy
14
Classes of Computers (Cont.)
 Clusters / Warehouse Scale Computers
 Used for “Software as a Service (SaaS)”
 Emphasis on availability, price-performance,
throughput, & energy
 Sub-class – Supercomputers
 Emphasis – floating-point performance & fast internal
networks
 Embedded Computers
 Emphasis on price, power, size, application-specific
performance
15
Terminology
16
Computer
Design
Computer
Organization
Computer
Architecture
Maps a given organization to a logic design, logic design to a Silicon
layout, & chip packaging
View of hardware designer
Design decisions based on constrains like circuit-level delays, Silicon
real estate, heat generation, & cost
e.g., Intel Core i7-6800K vs. Xeon E5-2643 v4
Internal details of operational units, their interconnection, & control
View of a computer designer
How to support multiplication – multiply circuit or repeated addition
e.g., Intel & AMD both support x86 with different organizations
Blueprint/plan that is visible to programmer
Key functional units, their interconnection, & instruction to program
Instruction Set Architecture (ISA)
e.g., x86 vs. ARM
Blocks of a Microprocessor
17
Literal
Address
Operation
Program
Memory
Instruction
Register
STACK Program Counter
Instruction
Decoder
Timing, Control and Register selection
Accumulator
RAM &
Data
Registers
ALU
IO
IO
FLAG &
Special
Function
Registers
Clock
Reset
Interrupts
Program Execution Section Register Processing Section
Set up
Set up
Modify
Address
Internal data bus
Source: Makis Malliris & Sabir Ghauri, UWE
18
Uniprocessor – Internal Structure
A
E
D
C
B
ALU
Address
BUS
Control Unit
IR
FLAG
ALU
PC
+1
Data
BUS
CTRL
BUS
19
Instruction Execution Sequence
1. Fetch next instruction from memory to IR
2. Change PC to point to next instruction
3. Determine type of instruction just fetched
4. If instruction needs data from memory, determine where
it is
5. Fetch data if needed into register
6. Execute instruction
7. Go to step 1 & continue with next instruction
20
Sample Program
 100: Load A,10
 101: Load B,15
 102: Add A,B
 103: STORE A,[20]
Load A,10
Load B,15
ADD A,B
STORE A,[20]
100
101
102
103
104
105
Program memory
18
19
20
21
Data memory
00
00
00
00
21
Before Execution 1st Fetch Cycle
A
E
D
C
B
ALU
Address
BIU
Control Unit
IR
FLAG
ALU
100
+1
Data
BIU
CTRL
BIU
22
After 1st Fetch Cycle …
A
E
D
C
B
ALU
Address
BIU
Control Unit
Load A,10
FLAG
ALU
101
+1
Data
BIU
CTRL
BIU
23
After 1st Instruction Cycle …
10
E
D
C
B
ALU
Address
BIU
Control Unit
Load A,10
FLAG
ALU
101
+1
Data
BIU
CTRL
BIU
24
Sample Program
 100: Load A,10
 101: Load B,15
 102: Add A,B
25
After 2nd Fetch Cycle …
A
E
D
C
B
ALU
Address
BIU
Control Unit
Load B,15
FLAG
ALU
102
+1
Data
BIU
CTRL
BIU
26
After 2nd Instruction Cycle …
10
E
D
C
15
ALU
Address
BIU
Control Unit
Load B,15
FLAG
ALU
102
+1
Data
BIU
CTRL
BIU
27
Sample Program
 100: Load A,10
 101: Load B,15
 102: Add A,B
28
After 3rd Fetch Cycle …
10
E
D
C
15
ALU
Address
BIU
Control Unit
ADD A,B
FLAG
ALU
103
+1
Data
BIU
CTRL
BIU
29
After 3rd Instruction Cycle …
25
E
D
C
15
ALU
Address
BIU
Control Unit
ADD A,B
FLAG
ALU
103
+1
Data
BIU
CTRL
BIU
Architectural Differences
 Length of microprocessors’ data word
 4, 8, 16, 32, & 64 bit
 Speed of instruction execution
 Clock rate & processor speed
 Size of direct addressable memory
 CPU architecture
 Instruction set
 Number & types of registers
 Support circuits
 Compatibility with existing software & hardware
development systems
30
Instruction Set Architecture (ISA)
31
Instruction Set
Software
Hardware
Properties of a Good ISA Abstraction
 Lasts through many generations (portability)
 Used in many different ways (generality)
 Provides convenient functionality to higher levels
 Permits an efficient implementation at lower
levels
32
Computer Architecture Topics
33
Instruction Set Architecture
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
Addressing,
Protection,
Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,
Bandwidth,
Latency
Emerging Technologies
Interleaving
Bus protocols
RAID, SSD
Input/Output & Storage
Memory
Hierarchy
Pipelining & Instruction
Level Parallelism
Course Focus
34
Understanding design techniques, machine structures,
technology factors, evaluation methods that will determine
forms of computers in 21st Century
Technology Programming
Languages
Operating
Systems History
Applications
Interface Design
(ISA)
Measurement &
Evaluation
Parallelism
Computer Architecture
• Instruction Set Design
• Organization
• Hardware
Trends in Technology
 Integrated circuit technology
 Transistor density – +35%/year
 Die size – +10-20%/yea
 Integration overall – +40-55%/year
 DRAM capacity – +25-40%/year (slowing)
 Flash capacity – +50-60%/year
 15-20× cheaper/bit than DRAM
 Magnetic disk technology – +40%/year (slowing)
 15-25× cheaper/bit than Flash
 300-500× cheaper/bit than DRAM
35
Measuring Performance
 Typical performance metrics
 Response time
 Throughput
 Execution time
 Wall clock time – includes all system overheads
 CPU time – only computation time
 Speedup of X relative to Y
 Speed up = Execution timeY / Execution timeX
 Benchmarks
 Kernels (e.g., matrix multiply)
 Toy programs (e.g., sorting)
 Synthetic benchmarks (e.g., Dhrystone)
 Benchmark suites (e.g., SPEC06fp, TPC-C, PCMark) 36
Bandwidth & Latency
 Bandwidth or throughput
 Total work done in a given time
 10,000-25,000X improvement for processors
 300-1200X improvement for memory & disks
 Latency or response time
 Time between start & completion of an event
 30-80X improvement for processors
 6-8X improvement for memory & disks
 While bandwidth is increasing latency isn’t
reducing
37
Transistors & Wires
 Feature size
 Minimum size of transistor or wire in x or y dimension
 10 microns in 1971 to 0.014 microns in 2014
 Transistor performance used to scale
 Wires
 Feature size reduce  shorter wires
 High density
 But resistance & capacitance per unit length grow
 Wire delay don’t reduce with feature size!
 While transistors are getting small latency isn’t
reducing 38
Power & Energy
 Problem – Getting power in & out
 Thermal Design Power (TDP)
 Characterizes sustained power consumption
 Used as target for power supply & cooling system
 Lower than peak power, higher than average power
 Intel i7-4770K 4 Cores @ 3.5 GHz TPD 84W & Peak ~140W
 Clock rate can be reduced dynamically to limit
power consumption
 Intel i7, AMD Ryzen
 Energy per task is often a better measurement
 Tight to the task & execution time 39
Techniques for Reducing Power
 Do nothing well
 Dynamic Voltage-Frequency
Scaling (DVFS)
 e.g., AMD Opteron
 Low power state for DRAM, disks
 Sleep mode
 Overclocking, turning off cores
 Intel i7, AMD Ryzen
40
Source: AMD
Dynamic Energy & Power
 Dynamic energy
 Transistor switch from 0  1 or 1  0
 ½ × Capacitive load × Voltage2
 Dynamic power
 ½ × Capacitive load × Voltage2 × Frequency switched
 Reducing voltage reduce energy
 Reducing clock rate reduces power, not energy
41
Static Power
 Static power consumption
 Currentstatic × Voltage
 Scales with no of transistors
 Not giving clock signal is insufficient
 Power gating
42
Exercise
 Which processor has better performance-power
gain?
 Core i7-4770K
 4 core, 3.9 GHz
 TDP – 84W, average consumption 95.5W
 Apple A8
 2 core, 1.5 GHz (iPad Mini)
 2W
43
Trends in Cost
 Cost driven down by learning curve
 Yield
 Microprocessors – price depends on volume
 10% less for each doubling of volume
 DRAM – price closely tracks cost
44
Principles of Computer Design
1. Take Advantage of Parallelism
2. Principle of Locality
3. Focus on the Common Case
4. Amdahl’s Law
5. Processor Performance Equation
45
1. Taking Advantage of Parallelism
 Increasing throughput via multiple processors or
multiple disks
 Examples
 Multiple processors
 RAID
 Memory banks
 Pipelining
 Multiple functional units – superscalar
46
47
Source:
http://mail.humber.ca/~paul.mi
chaud/Pipeline.htm
Instruction Level
Parallelism (ILP)
Pipelining
 Overlap instruction execution to reduce total time
to complete an instruction sequence
 Not every instruction depends on immediate
predecessor  executing instructions
completely/partially in parallel when possible
 Classic 5-stage pipeline
1. Instruction Fetch
2. Register Read
3. Execute (ALU)
4. Data Memory Access
5. Register Write (Reg)
48
Pipelined Instruction Execution
49
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Reg
ALU
DMem
Ifetch Reg
Reg
ALU
DMem
Ifetch Reg
Reg
ALU
DMem
Ifetch Reg
Reg
ALU
DMem
Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7
Cycle 5
Limits to Pipelining
 Hazards prevent next instruction from executing
during its designated clock cycle
 Structural hazards
 Attempt to use same hardware to do 2 different things at once
 Data hazards
 Instruction depends on result of prior instruction still in pipeline
 Control hazards
 Caused by delay between fetching of instructions & decisions
about changes in control flow (branches & jumps)
50
2. Principle of Locality
 Program access a relatively small portion of
address space at any instant of time
 Types of locality
 Spatial Locality
 If an item is referenced, items whose addresses are close by
tend to be referenced soon
 e.g., straight-line code, array access
 Temporal Locality
 If an item is referenced, it will tend to be referenced again
soon
 e.g., loops, reuse
51
Locality – Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
 Data
 Access array elements in succession – Spatial locality
 Reference sum each iteration – Temporal locality
 Instructions
 Reference instructions in sequence – Spatial locality
 Cycle through loop repeatedly – Temporal locality
52
a[0] a[1] a[2] a[3] … …
3. Focus on Common Case
 Common sense guides computer design
 It’s engineering!
 Favor frequent case over infrequent case
 e.g., instruction fetch & decode unit used more
frequently than multiplier, so optimize it 1st
 e.g., in databases storage dependability dominates
system dependability, so optimize it 1st
 Frequent case is often simpler & can be done
faster than infrequent case
 e.g., overflow is rare when adding numbers, so improve
performance by optimizing common case of no overflow
 May slow down overflow, but overall performance
improved by optimizing for normal case
53
4. Amdahl’s Law
54
Best you could ever hope to do
 
enhanced
maximum
Fraction
-
1
1
Speedup 
Amdahl’s Law – Example
 Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
55
Speedupoverall =
1
0.95
= 1.053
ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold
5. Processor Performance Equation
56
CPU time = Seconds = Instructions × Cycles × Seconds
Program Program Instruction Cycle
Instruction count
CPI
Cycle time
57
5. Processor Performance Equation (Cont.)
Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst. Set. X X
Organization X
Technology X
Fallacies & Pitfalls
 Fallacies – commonly held misconceptions
 When discussing a fallacy, we try to give a
counterexample
 Pitfalls – easily made mistakes
 Often generalizations of principles true in limited
context
 Show Fallacies & Pitfalls to help you avoid these
errors
58
Fallacies & Pitfalls (Cont.)
 Fallacy – Benchmarks remain valid indefinitely
 Once a benchmark becomes popular, tremendous
pressure to improve performance by
 Targeted optimizations or
 Aggressive interpretation of rules for running the benchmark
 A.k.a. “benchmarksmanship”
 70 benchmarks from the 5 SPEC releases
 70% were dropped from next release because no
longer useful
59
Fallacies & Pitfalls (Cont.)
 Pitfall – A single point of failure
 System is as reliable as its weakest link
 Rule of thumb for fault tolerant systems – make sure
that every component was redundant so that no
single component failure could bring down the whole
system
 e.g., power supply vs. fan
60

Contenu connexe

Similaire à Advanced Computer Architecture – An Introduction

Unit i-introduction
Unit i-introductionUnit i-introduction
Unit i-introductionakruthi k
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentationvinaya.hs
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersDilum Bandara
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorialcybercbm
 
Computer Architechture and Organization
Computer Architechture and OrganizationComputer Architechture and Organization
Computer Architechture and OrganizationAiman Hafeez
 
Genesys System - 8dec2010
Genesys System - 8dec2010Genesys System - 8dec2010
Genesys System - 8dec2010Agora Group
 
Basics of embedded system design
Basics of embedded system designBasics of embedded system design
Basics of embedded system designK Senthil Kumar
 
A15 ibm informix on power8 power linux
A15 ibm informix on power8  power linuxA15 ibm informix on power8  power linux
A15 ibm informix on power8 power linuxBeGooden-IT Consulting
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemDeepak Shankar
 
Datacenter Strategy, Design, and Build
Datacenter Strategy, Design, and BuildDatacenter Strategy, Design, and Build
Datacenter Strategy, Design, and BuildChristopher Kelley
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC designAishwaryaRavishankar8
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxSlot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxvun24122002
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 

Similaire à Advanced Computer Architecture – An Introduction (20)

Unit i-introduction
Unit i-introductionUnit i-introduction
Unit i-introduction
 
Quad Core Processors - Technology Presentation
Quad Core Processors - Technology PresentationQuad Core Processors - Technology Presentation
Quad Core Processors - Technology Presentation
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Chapter_01.pptx
Chapter_01.pptxChapter_01.pptx
Chapter_01.pptx
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 
Computer Architechture and Organization
Computer Architechture and OrganizationComputer Architechture and Organization
Computer Architechture and Organization
 
Genesys System - 8dec2010
Genesys System - 8dec2010Genesys System - 8dec2010
Genesys System - 8dec2010
 
Basics of embedded system design
Basics of embedded system designBasics of embedded system design
Basics of embedded system design
 
A15 ibm informix on power8 power linux
A15 ibm informix on power8  power linuxA15 ibm informix on power8  power linux
A15 ibm informix on power8 power linux
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
Datacenter Strategy, Design, and Build
Datacenter Strategy, Design, and BuildDatacenter Strategy, Design, and Build
Datacenter Strategy, Design, and Build
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC design
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Webinaron muticoreprocessors
Webinaron muticoreprocessorsWebinaron muticoreprocessors
Webinaron muticoreprocessors
 
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptxSlot29-CH18-MultiCoreComputers-18-slides (1).pptx
Slot29-CH18-MultiCoreComputers-18-slides (1).pptx
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
Par com
Par comPar com
Par com
 
Fmcad08
Fmcad08Fmcad08
Fmcad08
 

Plus de Dilum Bandara

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeDilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresDilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixDilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesDilum Bandara
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesDilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionDilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPDilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery NetworksDilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingDilum Bandara
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband CommunicationDilum Bandara
 

Plus de Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband Communication
 
Mobile IP
Mobile IPMobile IP
Mobile IP
 

Dernier

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Advanced Computer Architecture – An Introduction

  • 1. Computer Architecture – An Introduction CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2. Outline  Walls  Classes of computers  Instruction set architecture  Trends  Technology  Power & energy  Cost  Principles of computer design 2
  • 4. Why Such Rapid Change?  Performance improvements  Improvements in semiconductor technology  Clock speed, feature size  Improvements in computer architectures  High-level language compilers, UNIX  Lead to RISC architectures  Lower costs  Simpler development  Higher volumes  Lower margins  Function  Rise of networking & interconnection technology 4
  • 5. Today’s Status 5 Moore’s Law – No of transistors on a chip tends to double about every 2 years Transistor count still rising Clock speed flattening sharply Source: www.extremetech.com/wp- content/uploads/2012/02/CPU-Scaling.jpg
  • 6. Clock Speed vs. Power  Intel 80386 consumed ~ 2 W  3.3 GHz Intel Core i7 consumes 130 W  Heat must be dissipated from 1.5 x 1.5 cm2 chip  Limits what can be cooled by air 6
  • 7. Conventional Wisdom in Question  Conventional Wisdom – Power is free, Transistors are expensive  Today – Power is expensive, Transistors are free  Power wall  Can put more on chip than can afford to turn on  Conventional Wisdom – Increase Instruction Level Parallelism (ILP) via compilers, innovation  Out-of-order, speculation, VLIW  Today – Law of diminishing returns on more hardware for ILP  ILP wall 7
  • 8. Conventional Wisdom in Question (Cont.)  Conventional Wisdom – Multiplies are slow, Memory access is fast  Today – Memory is slow, multiplies are fast  Memory wall  200 clock cycles to DRAM memory, 4 clocks to multiply  Conventional Wisdom – Uniprocessor performance 2× / 1.5 years  Today – Power Wall + ILP Wall + Memory Wall = Brick Wall  Multi-cores  Simpler processors are more power efficient 8
  • 9. Current Trends in Architecture  Can’t continue to leverage ILP  Uniprocessor performance improvement ended in 2003  New models for performance  Data-level parallelism (DLP)  Thread-level parallelism (TLP)  Request-level parallelism (RLP)  These require explicit restructuring of applications 9
  • 11. Parallelism (Cont.)  Classes of parallelism in applications  Data-Level Parallelism (DLP)  Task-Level Parallelism (TLP)  Classes of architectural parallelism  Instruction-Level Parallelism (ILP)  Exploits DLP in pipelining & speculative execution  Vector architectures/Graphic Processor Units (GPUs)  Exploit DLP by applying same instruction on many data items  Thread-Level Parallelism  Exploit DLP & TLP in cooperative processing by threads  Request-Level Parallelism  Parallel execution of tasks that are independent 11
  • 12. Flynn’s Taxonomy  Single instruction stream, single data stream (SISD)  Normal sequential programs  Uniprocessor  Single instruction stream, multiple data streams (SIMD)  Data parallelism  Vector architectures  Multimedia extensions (Intel MMX)  Graphics Processor Units (GPUs)  Multiple instruction streams, single data stream (MISD)  No commercial implementation  Fault tolerant sachems  Multiple instruction streams, multiple data streams (MIMD)  Most parallel programs  Multi-core 12
  • 13. Classes of Computers & Performance Metrics 13 Want to achieve these performance metrics? Then you need to understand & design based on principles of computer architecture
  • 14. Classes of Computers  Personal Mobile Device (PMD)  Smart phones & tablets  Emphasis is on energy efficiency, cost, responsiveness, & multimedia performance  Desktop Computing  Desktops, netbooks, & laptops  Emphasis is on price-performance, energy, & graphic performance  Servers  Emphasis is on availability, scalability, throughput, & energy 14
  • 15. Classes of Computers (Cont.)  Clusters / Warehouse Scale Computers  Used for “Software as a Service (SaaS)”  Emphasis on availability, price-performance, throughput, & energy  Sub-class – Supercomputers  Emphasis – floating-point performance & fast internal networks  Embedded Computers  Emphasis on price, power, size, application-specific performance 15
  • 16. Terminology 16 Computer Design Computer Organization Computer Architecture Maps a given organization to a logic design, logic design to a Silicon layout, & chip packaging View of hardware designer Design decisions based on constrains like circuit-level delays, Silicon real estate, heat generation, & cost e.g., Intel Core i7-6800K vs. Xeon E5-2643 v4 Internal details of operational units, their interconnection, & control View of a computer designer How to support multiplication – multiply circuit or repeated addition e.g., Intel & AMD both support x86 with different organizations Blueprint/plan that is visible to programmer Key functional units, their interconnection, & instruction to program Instruction Set Architecture (ISA) e.g., x86 vs. ARM
  • 17. Blocks of a Microprocessor 17 Literal Address Operation Program Memory Instruction Register STACK Program Counter Instruction Decoder Timing, Control and Register selection Accumulator RAM & Data Registers ALU IO IO FLAG & Special Function Registers Clock Reset Interrupts Program Execution Section Register Processing Section Set up Set up Modify Address Internal data bus Source: Makis Malliris & Sabir Ghauri, UWE
  • 18. 18 Uniprocessor – Internal Structure A E D C B ALU Address BUS Control Unit IR FLAG ALU PC +1 Data BUS CTRL BUS
  • 19. 19 Instruction Execution Sequence 1. Fetch next instruction from memory to IR 2. Change PC to point to next instruction 3. Determine type of instruction just fetched 4. If instruction needs data from memory, determine where it is 5. Fetch data if needed into register 6. Execute instruction 7. Go to step 1 & continue with next instruction
  • 20. 20 Sample Program  100: Load A,10  101: Load B,15  102: Add A,B  103: STORE A,[20] Load A,10 Load B,15 ADD A,B STORE A,[20] 100 101 102 103 104 105 Program memory 18 19 20 21 Data memory 00 00 00 00
  • 21. 21 Before Execution 1st Fetch Cycle A E D C B ALU Address BIU Control Unit IR FLAG ALU 100 +1 Data BIU CTRL BIU
  • 22. 22 After 1st Fetch Cycle … A E D C B ALU Address BIU Control Unit Load A,10 FLAG ALU 101 +1 Data BIU CTRL BIU
  • 23. 23 After 1st Instruction Cycle … 10 E D C B ALU Address BIU Control Unit Load A,10 FLAG ALU 101 +1 Data BIU CTRL BIU
  • 24. 24 Sample Program  100: Load A,10  101: Load B,15  102: Add A,B
  • 25. 25 After 2nd Fetch Cycle … A E D C B ALU Address BIU Control Unit Load B,15 FLAG ALU 102 +1 Data BIU CTRL BIU
  • 26. 26 After 2nd Instruction Cycle … 10 E D C 15 ALU Address BIU Control Unit Load B,15 FLAG ALU 102 +1 Data BIU CTRL BIU
  • 27. 27 Sample Program  100: Load A,10  101: Load B,15  102: Add A,B
  • 28. 28 After 3rd Fetch Cycle … 10 E D C 15 ALU Address BIU Control Unit ADD A,B FLAG ALU 103 +1 Data BIU CTRL BIU
  • 29. 29 After 3rd Instruction Cycle … 25 E D C 15 ALU Address BIU Control Unit ADD A,B FLAG ALU 103 +1 Data BIU CTRL BIU
  • 30. Architectural Differences  Length of microprocessors’ data word  4, 8, 16, 32, & 64 bit  Speed of instruction execution  Clock rate & processor speed  Size of direct addressable memory  CPU architecture  Instruction set  Number & types of registers  Support circuits  Compatibility with existing software & hardware development systems 30
  • 31. Instruction Set Architecture (ISA) 31 Instruction Set Software Hardware
  • 32. Properties of a Good ISA Abstraction  Lasts through many generations (portability)  Used in many different ways (generality)  Provides convenient functionality to higher levels  Permits an efficient implementation at lower levels 32
  • 33. Computer Architecture Topics 33 Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP Addressing, Protection, Exception Handling L1 Cache L2 Cache DRAM Disks, WORM, Tape Coherence, Bandwidth, Latency Emerging Technologies Interleaving Bus protocols RAID, SSD Input/Output & Storage Memory Hierarchy Pipelining & Instruction Level Parallelism
  • 34. Course Focus 34 Understanding design techniques, machine structures, technology factors, evaluation methods that will determine forms of computers in 21st Century Technology Programming Languages Operating Systems History Applications Interface Design (ISA) Measurement & Evaluation Parallelism Computer Architecture • Instruction Set Design • Organization • Hardware
  • 35. Trends in Technology  Integrated circuit technology  Transistor density – +35%/year  Die size – +10-20%/yea  Integration overall – +40-55%/year  DRAM capacity – +25-40%/year (slowing)  Flash capacity – +50-60%/year  15-20× cheaper/bit than DRAM  Magnetic disk technology – +40%/year (slowing)  15-25× cheaper/bit than Flash  300-500× cheaper/bit than DRAM 35
  • 36. Measuring Performance  Typical performance metrics  Response time  Throughput  Execution time  Wall clock time – includes all system overheads  CPU time – only computation time  Speedup of X relative to Y  Speed up = Execution timeY / Execution timeX  Benchmarks  Kernels (e.g., matrix multiply)  Toy programs (e.g., sorting)  Synthetic benchmarks (e.g., Dhrystone)  Benchmark suites (e.g., SPEC06fp, TPC-C, PCMark) 36
  • 37. Bandwidth & Latency  Bandwidth or throughput  Total work done in a given time  10,000-25,000X improvement for processors  300-1200X improvement for memory & disks  Latency or response time  Time between start & completion of an event  30-80X improvement for processors  6-8X improvement for memory & disks  While bandwidth is increasing latency isn’t reducing 37
  • 38. Transistors & Wires  Feature size  Minimum size of transistor or wire in x or y dimension  10 microns in 1971 to 0.014 microns in 2014  Transistor performance used to scale  Wires  Feature size reduce  shorter wires  High density  But resistance & capacitance per unit length grow  Wire delay don’t reduce with feature size!  While transistors are getting small latency isn’t reducing 38
  • 39. Power & Energy  Problem – Getting power in & out  Thermal Design Power (TDP)  Characterizes sustained power consumption  Used as target for power supply & cooling system  Lower than peak power, higher than average power  Intel i7-4770K 4 Cores @ 3.5 GHz TPD 84W & Peak ~140W  Clock rate can be reduced dynamically to limit power consumption  Intel i7, AMD Ryzen  Energy per task is often a better measurement  Tight to the task & execution time 39
  • 40. Techniques for Reducing Power  Do nothing well  Dynamic Voltage-Frequency Scaling (DVFS)  e.g., AMD Opteron  Low power state for DRAM, disks  Sleep mode  Overclocking, turning off cores  Intel i7, AMD Ryzen 40 Source: AMD
  • 41. Dynamic Energy & Power  Dynamic energy  Transistor switch from 0  1 or 1  0  ½ × Capacitive load × Voltage2  Dynamic power  ½ × Capacitive load × Voltage2 × Frequency switched  Reducing voltage reduce energy  Reducing clock rate reduces power, not energy 41
  • 42. Static Power  Static power consumption  Currentstatic × Voltage  Scales with no of transistors  Not giving clock signal is insufficient  Power gating 42
  • 43. Exercise  Which processor has better performance-power gain?  Core i7-4770K  4 core, 3.9 GHz  TDP – 84W, average consumption 95.5W  Apple A8  2 core, 1.5 GHz (iPad Mini)  2W 43
  • 44. Trends in Cost  Cost driven down by learning curve  Yield  Microprocessors – price depends on volume  10% less for each doubling of volume  DRAM – price closely tracks cost 44
  • 45. Principles of Computer Design 1. Take Advantage of Parallelism 2. Principle of Locality 3. Focus on the Common Case 4. Amdahl’s Law 5. Processor Performance Equation 45
  • 46. 1. Taking Advantage of Parallelism  Increasing throughput via multiple processors or multiple disks  Examples  Multiple processors  RAID  Memory banks  Pipelining  Multiple functional units – superscalar 46
  • 48. Pipelining  Overlap instruction execution to reduce total time to complete an instruction sequence  Not every instruction depends on immediate predecessor  executing instructions completely/partially in parallel when possible  Classic 5-stage pipeline 1. Instruction Fetch 2. Register Read 3. Execute (ALU) 4. Data Memory Access 5. Register Write (Reg) 48
  • 49. Pipelined Instruction Execution 49 I n s t r. O r d e r Time (clock cycles) Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
  • 50. Limits to Pipelining  Hazards prevent next instruction from executing during its designated clock cycle  Structural hazards  Attempt to use same hardware to do 2 different things at once  Data hazards  Instruction depends on result of prior instruction still in pipeline  Control hazards  Caused by delay between fetching of instructions & decisions about changes in control flow (branches & jumps) 50
  • 51. 2. Principle of Locality  Program access a relatively small portion of address space at any instant of time  Types of locality  Spatial Locality  If an item is referenced, items whose addresses are close by tend to be referenced soon  e.g., straight-line code, array access  Temporal Locality  If an item is referenced, it will tend to be referenced again soon  e.g., loops, reuse 51
  • 52. Locality – Example sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;  Data  Access array elements in succession – Spatial locality  Reference sum each iteration – Temporal locality  Instructions  Reference instructions in sequence – Spatial locality  Cycle through loop repeatedly – Temporal locality 52 a[0] a[1] a[2] a[3] … …
  • 53. 3. Focus on Common Case  Common sense guides computer design  It’s engineering!  Favor frequent case over infrequent case  e.g., instruction fetch & decode unit used more frequently than multiplier, so optimize it 1st  e.g., in databases storage dependability dominates system dependability, so optimize it 1st  Frequent case is often simpler & can be done faster than infrequent case  e.g., overflow is rare when adding numbers, so improve performance by optimizing common case of no overflow  May slow down overflow, but overall performance improved by optimizing for normal case 53
  • 54. 4. Amdahl’s Law 54 Best you could ever hope to do   enhanced maximum Fraction - 1 1 Speedup 
  • 55. Amdahl’s Law – Example  Floating point instructions improved to run 2X; but only 10% of actual instructions are FP 55 Speedupoverall = 1 0.95 = 1.053 ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold
  • 56. 5. Processor Performance Equation 56 CPU time = Seconds = Instructions × Cycles × Seconds Program Program Instruction Cycle Instruction count CPI Cycle time
  • 57. 57 5. Processor Performance Equation (Cont.) Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X Technology X
  • 58. Fallacies & Pitfalls  Fallacies – commonly held misconceptions  When discussing a fallacy, we try to give a counterexample  Pitfalls – easily made mistakes  Often generalizations of principles true in limited context  Show Fallacies & Pitfalls to help you avoid these errors 58
  • 59. Fallacies & Pitfalls (Cont.)  Fallacy – Benchmarks remain valid indefinitely  Once a benchmark becomes popular, tremendous pressure to improve performance by  Targeted optimizations or  Aggressive interpretation of rules for running the benchmark  A.k.a. “benchmarksmanship”  70 benchmarks from the 5 SPEC releases  70% were dropped from next release because no longer useful 59
  • 60. Fallacies & Pitfalls (Cont.)  Pitfall – A single point of failure  System is as reliable as its weakest link  Rule of thumb for fault tolerant systems – make sure that every component was redundant so that no single component failure could bring down the whole system  e.g., power supply vs. fan 60