Introduction to advanced computer architecture, including classes of computers,
Instruction set architecture, Trends, Technology, Power and energy
Cost
Principles of computer design
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Advanced Computer Architecture – An Introduction
1. Computer Architecture –
An Introduction
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
2. Outline
Walls
Classes of computers
Instruction set architecture
Trends
Technology
Power & energy
Cost
Principles of computer design
2
4. Why Such Rapid Change?
Performance improvements
Improvements in semiconductor technology
Clock speed, feature size
Improvements in computer architectures
High-level language compilers, UNIX
Lead to RISC architectures
Lower costs
Simpler development
Higher volumes
Lower margins
Function
Rise of networking & interconnection technology 4
5. Today’s Status
5
Moore’s Law – No of
transistors on a chip
tends to double about
every 2 years
Transistor
count still
rising
Clock speed
flattening
sharply
Source: www.extremetech.com/wp-
content/uploads/2012/02/CPU-Scaling.jpg
6. Clock Speed vs. Power
Intel 80386 consumed ~
2 W
3.3 GHz Intel Core i7
consumes 130 W
Heat must be
dissipated from 1.5 x
1.5 cm2 chip
Limits what can be
cooled by air
6
7. Conventional Wisdom in Question
Conventional Wisdom – Power is free, Transistors
are expensive
Today – Power is expensive, Transistors are free
Power wall
Can put more on chip than can afford to turn on
Conventional Wisdom – Increase Instruction Level
Parallelism (ILP) via compilers, innovation
Out-of-order, speculation, VLIW
Today – Law of diminishing returns on more
hardware for ILP
ILP wall 7
8. Conventional Wisdom in Question (Cont.)
Conventional Wisdom – Multiplies are slow,
Memory access is fast
Today – Memory is slow, multiplies are fast
Memory wall
200 clock cycles to DRAM memory, 4 clocks to multiply
Conventional Wisdom – Uniprocessor
performance 2× / 1.5 years
Today – Power Wall + ILP Wall + Memory Wall =
Brick Wall
Multi-cores
Simpler processors are more power efficient 8
9. Current Trends in Architecture
Can’t continue to leverage ILP
Uniprocessor performance improvement ended in
2003
New models for performance
Data-level parallelism (DLP)
Thread-level parallelism (TLP)
Request-level parallelism (RLP)
These require explicit restructuring of applications
9
11. Parallelism (Cont.)
Classes of parallelism in applications
Data-Level Parallelism (DLP)
Task-Level Parallelism (TLP)
Classes of architectural parallelism
Instruction-Level Parallelism (ILP)
Exploits DLP in pipelining & speculative execution
Vector architectures/Graphic Processor Units (GPUs)
Exploit DLP by applying same instruction on many data items
Thread-Level Parallelism
Exploit DLP & TLP in cooperative processing by threads
Request-Level Parallelism
Parallel execution of tasks that are independent
11
12. Flynn’s Taxonomy
Single instruction stream, single data stream (SISD)
Normal sequential programs
Uniprocessor
Single instruction stream, multiple data streams (SIMD)
Data parallelism
Vector architectures
Multimedia extensions (Intel MMX)
Graphics Processor Units (GPUs)
Multiple instruction streams, single data stream (MISD)
No commercial implementation
Fault tolerant sachems
Multiple instruction streams, multiple data streams (MIMD)
Most parallel programs
Multi-core
12
13. Classes of Computers & Performance
Metrics
13
Want to achieve these performance metrics?
Then you need to understand & design based on
principles of computer architecture
14. Classes of Computers
Personal Mobile Device (PMD)
Smart phones & tablets
Emphasis is on energy efficiency, cost, responsiveness,
& multimedia performance
Desktop Computing
Desktops, netbooks, & laptops
Emphasis is on price-performance, energy, & graphic
performance
Servers
Emphasis is on availability, scalability, throughput, &
energy
14
15. Classes of Computers (Cont.)
Clusters / Warehouse Scale Computers
Used for “Software as a Service (SaaS)”
Emphasis on availability, price-performance,
throughput, & energy
Sub-class – Supercomputers
Emphasis – floating-point performance & fast internal
networks
Embedded Computers
Emphasis on price, power, size, application-specific
performance
15
16. Terminology
16
Computer
Design
Computer
Organization
Computer
Architecture
Maps a given organization to a logic design, logic design to a Silicon
layout, & chip packaging
View of hardware designer
Design decisions based on constrains like circuit-level delays, Silicon
real estate, heat generation, & cost
e.g., Intel Core i7-6800K vs. Xeon E5-2643 v4
Internal details of operational units, their interconnection, & control
View of a computer designer
How to support multiplication – multiply circuit or repeated addition
e.g., Intel & AMD both support x86 with different organizations
Blueprint/plan that is visible to programmer
Key functional units, their interconnection, & instruction to program
Instruction Set Architecture (ISA)
e.g., x86 vs. ARM
17. Blocks of a Microprocessor
17
Literal
Address
Operation
Program
Memory
Instruction
Register
STACK Program Counter
Instruction
Decoder
Timing, Control and Register selection
Accumulator
RAM &
Data
Registers
ALU
IO
IO
FLAG &
Special
Function
Registers
Clock
Reset
Interrupts
Program Execution Section Register Processing Section
Set up
Set up
Modify
Address
Internal data bus
Source: Makis Malliris & Sabir Ghauri, UWE
18. 18
Uniprocessor – Internal Structure
A
E
D
C
B
ALU
Address
BUS
Control Unit
IR
FLAG
ALU
PC
+1
Data
BUS
CTRL
BUS
19. 19
Instruction Execution Sequence
1. Fetch next instruction from memory to IR
2. Change PC to point to next instruction
3. Determine type of instruction just fetched
4. If instruction needs data from memory, determine where
it is
5. Fetch data if needed into register
6. Execute instruction
7. Go to step 1 & continue with next instruction
20. 20
Sample Program
100: Load A,10
101: Load B,15
102: Add A,B
103: STORE A,[20]
Load A,10
Load B,15
ADD A,B
STORE A,[20]
100
101
102
103
104
105
Program memory
18
19
20
21
Data memory
00
00
00
00
21. 21
Before Execution 1st Fetch Cycle
A
E
D
C
B
ALU
Address
BIU
Control Unit
IR
FLAG
ALU
100
+1
Data
BIU
CTRL
BIU
22. 22
After 1st Fetch Cycle …
A
E
D
C
B
ALU
Address
BIU
Control Unit
Load A,10
FLAG
ALU
101
+1
Data
BIU
CTRL
BIU
23. 23
After 1st Instruction Cycle …
10
E
D
C
B
ALU
Address
BIU
Control Unit
Load A,10
FLAG
ALU
101
+1
Data
BIU
CTRL
BIU
28. 28
After 3rd Fetch Cycle …
10
E
D
C
15
ALU
Address
BIU
Control Unit
ADD A,B
FLAG
ALU
103
+1
Data
BIU
CTRL
BIU
29. 29
After 3rd Instruction Cycle …
25
E
D
C
15
ALU
Address
BIU
Control Unit
ADD A,B
FLAG
ALU
103
+1
Data
BIU
CTRL
BIU
30. Architectural Differences
Length of microprocessors’ data word
4, 8, 16, 32, & 64 bit
Speed of instruction execution
Clock rate & processor speed
Size of direct addressable memory
CPU architecture
Instruction set
Number & types of registers
Support circuits
Compatibility with existing software & hardware
development systems
30
32. Properties of a Good ISA Abstraction
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower
levels
32
34. Course Focus
34
Understanding design techniques, machine structures,
technology factors, evaluation methods that will determine
forms of computers in 21st Century
Technology Programming
Languages
Operating
Systems History
Applications
Interface Design
(ISA)
Measurement &
Evaluation
Parallelism
Computer Architecture
• Instruction Set Design
• Organization
• Hardware
35. Trends in Technology
Integrated circuit technology
Transistor density – +35%/year
Die size – +10-20%/yea
Integration overall – +40-55%/year
DRAM capacity – +25-40%/year (slowing)
Flash capacity – +50-60%/year
15-20× cheaper/bit than DRAM
Magnetic disk technology – +40%/year (slowing)
15-25× cheaper/bit than Flash
300-500× cheaper/bit than DRAM
35
36. Measuring Performance
Typical performance metrics
Response time
Throughput
Execution time
Wall clock time – includes all system overheads
CPU time – only computation time
Speedup of X relative to Y
Speed up = Execution timeY / Execution timeX
Benchmarks
Kernels (e.g., matrix multiply)
Toy programs (e.g., sorting)
Synthetic benchmarks (e.g., Dhrystone)
Benchmark suites (e.g., SPEC06fp, TPC-C, PCMark) 36
37. Bandwidth & Latency
Bandwidth or throughput
Total work done in a given time
10,000-25,000X improvement for processors
300-1200X improvement for memory & disks
Latency or response time
Time between start & completion of an event
30-80X improvement for processors
6-8X improvement for memory & disks
While bandwidth is increasing latency isn’t
reducing
37
38. Transistors & Wires
Feature size
Minimum size of transistor or wire in x or y dimension
10 microns in 1971 to 0.014 microns in 2014
Transistor performance used to scale
Wires
Feature size reduce shorter wires
High density
But resistance & capacitance per unit length grow
Wire delay don’t reduce with feature size!
While transistors are getting small latency isn’t
reducing 38
39. Power & Energy
Problem – Getting power in & out
Thermal Design Power (TDP)
Characterizes sustained power consumption
Used as target for power supply & cooling system
Lower than peak power, higher than average power
Intel i7-4770K 4 Cores @ 3.5 GHz TPD 84W & Peak ~140W
Clock rate can be reduced dynamically to limit
power consumption
Intel i7, AMD Ryzen
Energy per task is often a better measurement
Tight to the task & execution time 39
40. Techniques for Reducing Power
Do nothing well
Dynamic Voltage-Frequency
Scaling (DVFS)
e.g., AMD Opteron
Low power state for DRAM, disks
Sleep mode
Overclocking, turning off cores
Intel i7, AMD Ryzen
40
Source: AMD
41. Dynamic Energy & Power
Dynamic energy
Transistor switch from 0 1 or 1 0
½ × Capacitive load × Voltage2
Dynamic power
½ × Capacitive load × Voltage2 × Frequency switched
Reducing voltage reduce energy
Reducing clock rate reduces power, not energy
41
42. Static Power
Static power consumption
Currentstatic × Voltage
Scales with no of transistors
Not giving clock signal is insufficient
Power gating
42
43. Exercise
Which processor has better performance-power
gain?
Core i7-4770K
4 core, 3.9 GHz
TDP – 84W, average consumption 95.5W
Apple A8
2 core, 1.5 GHz (iPad Mini)
2W
43
44. Trends in Cost
Cost driven down by learning curve
Yield
Microprocessors – price depends on volume
10% less for each doubling of volume
DRAM – price closely tracks cost
44
45. Principles of Computer Design
1. Take Advantage of Parallelism
2. Principle of Locality
3. Focus on the Common Case
4. Amdahl’s Law
5. Processor Performance Equation
45
46. 1. Taking Advantage of Parallelism
Increasing throughput via multiple processors or
multiple disks
Examples
Multiple processors
RAID
Memory banks
Pipelining
Multiple functional units – superscalar
46
48. Pipelining
Overlap instruction execution to reduce total time
to complete an instruction sequence
Not every instruction depends on immediate
predecessor executing instructions
completely/partially in parallel when possible
Classic 5-stage pipeline
1. Instruction Fetch
2. Register Read
3. Execute (ALU)
4. Data Memory Access
5. Register Write (Reg)
48
50. Limits to Pipelining
Hazards prevent next instruction from executing
during its designated clock cycle
Structural hazards
Attempt to use same hardware to do 2 different things at once
Data hazards
Instruction depends on result of prior instruction still in pipeline
Control hazards
Caused by delay between fetching of instructions & decisions
about changes in control flow (branches & jumps)
50
51. 2. Principle of Locality
Program access a relatively small portion of
address space at any instant of time
Types of locality
Spatial Locality
If an item is referenced, items whose addresses are close by
tend to be referenced soon
e.g., straight-line code, array access
Temporal Locality
If an item is referenced, it will tend to be referenced again
soon
e.g., loops, reuse
51
52. Locality – Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Data
Access array elements in succession – Spatial locality
Reference sum each iteration – Temporal locality
Instructions
Reference instructions in sequence – Spatial locality
Cycle through loop repeatedly – Temporal locality
52
a[0] a[1] a[2] a[3] … …
53. 3. Focus on Common Case
Common sense guides computer design
It’s engineering!
Favor frequent case over infrequent case
e.g., instruction fetch & decode unit used more
frequently than multiplier, so optimize it 1st
e.g., in databases storage dependability dominates
system dependability, so optimize it 1st
Frequent case is often simpler & can be done
faster than infrequent case
e.g., overflow is rare when adding numbers, so improve
performance by optimizing common case of no overflow
May slow down overflow, but overall performance
improved by optimizing for normal case
53
54. 4. Amdahl’s Law
54
Best you could ever hope to do
enhanced
maximum
Fraction
-
1
1
Speedup
55. Amdahl’s Law – Example
Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
55
Speedupoverall =
1
0.95
= 1.053
ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold
56. 5. Processor Performance Equation
56
CPU time = Seconds = Instructions × Cycles × Seconds
Program Program Instruction Cycle
Instruction count
CPI
Cycle time
57. 57
5. Processor Performance Equation (Cont.)
Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst. Set. X X
Organization X
Technology X
58. Fallacies & Pitfalls
Fallacies – commonly held misconceptions
When discussing a fallacy, we try to give a
counterexample
Pitfalls – easily made mistakes
Often generalizations of principles true in limited
context
Show Fallacies & Pitfalls to help you avoid these
errors
58
59. Fallacies & Pitfalls (Cont.)
Fallacy – Benchmarks remain valid indefinitely
Once a benchmark becomes popular, tremendous
pressure to improve performance by
Targeted optimizations or
Aggressive interpretation of rules for running the benchmark
A.k.a. “benchmarksmanship”
70 benchmarks from the 5 SPEC releases
70% were dropped from next release because no
longer useful
59
60. Fallacies & Pitfalls (Cont.)
Pitfall – A single point of failure
System is as reliable as its weakest link
Rule of thumb for fault tolerant systems – make sure
that every component was redundant so that no
single component failure could bring down the whole
system
e.g., power supply vs. fan
60