SlideShare a Scribd company logo
1 of 28
Download to read offline
ECS-165A WQ'11 136 
8. Query Processing 
Goals: Understand the basic concepts underlying the steps in 
query processing and optimization and estimating query processing 
cost; apply query optimization techniques; 
Contents: 
 Overview 
 Catalog Information for Cost Estimation 
 Measures of Query Cost 
 Selection 
 Join Operations 
 Other Operations 
 Evaluation and Transformation of Expressions 
Query Processing  Optimization 
Task: Find an ecient physical query plan (aka execution plan) 
for an SQL query 
Goal: Minimize the evaluation time for the query, i.e., compute 
query result as fast as possible 
Cost Factors: Disk accesses, read/write operations, [I/O, page 
transfer] (CPU time is typically ignored) 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 137 
Basic Steps in Processing an SQL Query 
(System Catalogs) 
SQL Query 
Relational Algebra 
Expression 
Optimizer 
Statistics 
Query Result Evaluation Engine Execution Plan 
Data Files 
Parser  
Translator 
 Parsing and Translating 
{ Translate the query into its internal form (parse tree). 
This is then translated into an expression of the relational 
algebra. 
{ Parser checks syntax, validates relations, attributes and 
access permissions 
 Evaluation 
{ The query execution engine takes a physical query plan 
(aka execution plan), executes the plan, and returns the 
result. 
 Optimization: Find the cheapest execution plan for a query 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 138 
 A relational algebra expression may have many equivalent 
expressions, e.g., 
CName(Price5000((CUSTOMERS 1 ORDERS) 1 OFFERS)) 
CName((CUSTOMERS 1 ORDERS) 1 (Price5000(OFFERS))) 
Representation as logical query plan (a tree): 
o 
o 
CName 
Price  5000 CName 
Price  5000 
OFFERS 
ORDERS CUSTOMERS ORDERS OFFERS 
CUSTOMERS 
Non-leaf nodes  operations of relational algebra (with 
parameters); Leaf nodes  relations 
 A relational algebra expression can be evaluated in many 
ways. An annotated expression specifying detailed evaluation 
strategy is called the execution plan (includes, e.g., whether 
index is used, join algorithms, . . . ) 
 Among all semantically equivalent expressions, the one with 
the least costly evaluation plan is chosen. Cost estimate of a 
plan is based on statistical information in the system catalogs. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 139 
Catalog Information for Cost Estimation 
Information about relations and attributes: 
 NR: number of tuples in the relation R. 
 BR: number of blocks that contain tuples of the relation R. 
 SR: size of a tuple of R. 
 FR: blocking factor; number of tuples from R that
t into one 
block (FR = dNR=BRe) 
 V(A; R): number of distinct values for attribute A in R. 
 SC(A; R): selectivity of attribute A 
 average number of tuples of R that satisfy an 
equality condition on A. 
SC(A; R) = NR=V(A; R). 
Information about indexes: 
 HTI: number of levels in index I (B+-tree). 
 LBI: number of blocks occupied by leaf nodes in index I 
(
rst-level blocks). 
 ValI: number of distinct values for the search key. 
Some relevant tables in the Oracle system catalogs: 
USER TABLES USER TAB COLUMNS USER INDEXES 
NUM ROWS NUM DISTINCT BLEVEL 
BLOCKS LOW VALUE LEAF BLOCKS 
EMPTY BLOCKS HIGH VALUE DISTINCT KEYS 
AVG SPACE DENSITY AVG LEAF BLOCKS PER KEY 
CHAIN CNT NUM BUCKETS 
AVG ROW LEN LAST ANALYZED 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 140 
Measures of Query Cost 
 There are many possible ways to estimate cost, e.g., based on 
disk accesses, CPU time, or communication overhead. 
 Disk access is the predominant cost (in terms of time); 
relatively easy to estimate; therefore, number of block transfers 
from/to disk is typically used as measure. 
{ Simplifying assumption: each block transfer has the same 
cost. 
 Cost of algorithm (e.g., for join or selection) depends on 
database buer size; more memory for DB buer reduces disk 
accesses. Thus DB buer size is a parameter for estimating 
cost. 
 We refer to the cost estimate of algorithm S as cost(S). We 
do not consider cost of writing output to disk. 
Selection Operation 
A=a(R) where a is a constant value, A an attribute of R 
 File Scan { search algorithms that locate and retrieve records 
that satisfy a selection condition 
 S1 { Linear search 
cost(S1)= BR 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 141 
Selection Operation (cont.) 
 S2 { Binary search, i.e., the
le ordered based on attribute A 
(primary index) 
cost(S2) = dlog2(BR)e + 
 
SC(A; R) 
FR 
 
 1 
{ dlog2(BR)e  cost to locate the
rst tuple using binary 
search 
{ Second term  blocks that contain records satisfying the 
selection. 
{ If A is primary key, then SC(A; R) = 1, hence 
cost(S2) = dlog2(BR)e. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 142 
 Example (for Employee DB) 
{ FEmployee = 10; 
V(Deptno; Employee) = 50 (dierent departments) 
{ NEmployee = 10; 000 (Relation Employee has 10,000 tuples) 
{ Assume selection Deptno=20(Employee) and Employee is 
sorted on search key Deptno : 
=) 10,000/50 = 200 tuples in Employee belong to 
Deptno 20; 
(assuming an equal distribution) 
200/10 = 20 blocks for these tuples 
=) A binary search
nding the
rst block would require 
dlog2(1; 000)e = 10 block accesses 
Total cost of binary search is 10+20 block accesses (versus 
1,000 for linear search and Employee not sorted by Deptno). 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 143 
 Index scan { search algorithms that use an index (here, a 
B+-tree); selection condition is on search key of index 
 S3 { Primary index I for A, A primary key, equality A = a 
cost(S3) = HTI + 1 (only 1 tuple satis
es condition) 
 S4 { Primary index I on non-key A equality A = a 
cost(S4) = HTI + 
 
SC(A; R) 
FR 
 
 S5 { Non-primary (non-clustered) index on non-key A, 
equality A = a 
cost(S5) = HTI + SC(A; R) 
Worst case: each matching record resides in a dierent block. 
 Example (Cont.): 
{ Assume primary (B+-tree) index for attribute Deptno 
{ 200/10=20 blocks accesses are required to read Employee 
tuples 
{ If B+-tree index stores 20 pointers per (inner) node, then 
the B+-tree index must have between 3 and 5 leaf nodes 
and the entire tree has a depth of 2 
=) a total of 22 blocks must be read. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 144 
Selections Involving Comparisons 
 Selections of the form Av(R) or Av(R) are implemented 
using a
le scan or binary search, or by using either a 
{ S6 { A primary index on A, or 
{ S7 { A secondary index on A (in this case, typically a 
linear
le scan may be cheaper; but this depends on the 
selectivity of A) 
Complex Selections 
 General pattern: 
{ Conjunction { 1^:::^n(R) 
{ Disjunction { 1_:::_n(R) 
{ Negation { :(R) 
 The selectivity of a condition i is the probability that a tuple 
in the relation R satis
es i. If si is the number of tuples in 
R that satisfy i, then i's selectivity is estimated as si=NR. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 145 
Join Operations 
 There are several dierent algorithms that can be used to 
implement joins (natural-, equi-, condition-join) 
{ Nested-Loop Join 
{ Block Nested-Loop Join 
{ Index Nested-Loop Join 
{ Sort-Merge Join 
{ Hash-Join 
 Choice of a particular algorithm is based on cost estimate 
 For this, join size estimates are required and in particular 
cost estimates for outer-level operations in a relational algebra 
expression. 
 Example: 
Assume the query CUSTOMERS 1 ORDERS (with join 
attribute only being CName) 
{ NCUSTOMERS = 5,000 tuples 
{ FCUSTOMERS = 20, i.e., BCUSTOMERS = 5,000/20 = 250 blocks 
{ NORDERS = 10,000 tuples 
{ FORDERS = 25, i.e., BORDERS = 400 blocks 
{ V(CName; ORDERS) = 2,500, meaning that in this relation, 
on average, each customer has four orders 
{ Also assume that CName in ORDERS is a foreign key on 
CUSTOMERS 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 146 
Estimating the Size of Joins 
 The Cartesian product R  S results in NR  NS tuples; each 
tuple requires SR + SS bytes. 
 If schema(R)  schema(S) = primary key for R, then a tuple 
of S will match with at most one tuple from R. 
Therefore, the number of tuples in R1S is not greater than NS 
If schema(R)  schema(S) = foreign key in S referencing R, 
then the number of tuples in R1S is exactly NS. 
Other cases are symmetric. 
 In the example query CUSTOMERS 1 ORDERS, CName in 
ORDERS is a foreign key of CUSTOMERS; the result thus 
has exactly NORDERS = 10,000 tuples 
 If schema(R)  schema(S) = fAg is not a key for R or S; 
assume that every tuple in R produces tuples in R 1 S. Then 
the number of tuples in R 1 S is estimated to be: 
NR  NS 
V(A; S) 
If the reverse is true, the estimate is 
NR  NS 
V(A; R) 
and the lower of the two estimates is probably the more 
accurate one. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 147 
 Size estimates for CUSTOMERS 1 ORDERS without using 
information about foreign keys: 
{ V(CName; CUSTOMERS) = 5,000, and 
V(CName; ORDERS) = 2,500 
{ The two estimates are 5,000*10,000/2,500=20,000 and 
5,000*10,000/5,000=10,000. 
 We choose the lower estimate, which, in this case, is the same 
as our earlier computation using foreign key information. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 148 
Nested-Loop Join 
 Evaluate the condition join R 1C S 
 for each tuple tR in R do begin 
for each tuple tS in S do begin 
check whether pair (tR; tS) satis
es join condition 
if they do, add tR  tS to the result 
end 
end 
 R is called the outer and S the inner relation of the join. 
 Requires no indexes and can be used with any kind of join 
condition. 
 Worst case: db buer can only hold one block of each relation 
=) BR + NR  BS disk accesses 
 Best case: both relations
t into db buer 
=) BR + BS disk accesses. 
Dept. of Computer Science UC Davis 8. Query Processing and Optimization
ECS-165A WQ'11 149 
An Improvement: Block Nested-Loop Join 
 Evaluate the condition join R 1C S 
 for each block BR of R do begin 
for each block BS of S do begin 
for each tuple tR in BR do 
for each tuple tS in BS do 
check whether pair (tR; tS) 
satis
es join condition 
if they do, add tR  tS to the result 
end end end end 
 Also requires no indexes and can be used with any kind of join 
condition. 
 Worst case: db buer can only hold one block of each relation 
=) BR + BR  BS disk accesses. 
 Best case: both relations
t into db buer 
=) BR + BS disk accesses. 
 If smaller relation completely

More Related Content

What's hot

What's hot (20)

DBMS : Relational Algebra
DBMS : Relational Algebra DBMS : Relational Algebra
DBMS : Relational Algebra
 
Arrays in SAS
Arrays in SASArrays in SAS
Arrays in SAS
 
Chapter 4 ds
Chapter 4 dsChapter 4 ds
Chapter 4 ds
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
1643 y є r relational calculus-1
1643 y є r  relational calculus-11643 y є r  relational calculus-1
1643 y є r relational calculus-1
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Single linked list
Single linked listSingle linked list
Single linked list
 
Monad presentation scala as a category
Monad presentation   scala as a categoryMonad presentation   scala as a category
Monad presentation scala as a category
 
Chapter 10 ds
Chapter 10 dsChapter 10 ds
Chapter 10 ds
 
linked list
linked listlinked list
linked list
 
Relational Calculus
Relational CalculusRelational Calculus
Relational Calculus
 
Programming in R
Programming in RProgramming in R
Programming in R
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in R
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
14285 lecture2
14285 lecture214285 lecture2
14285 lecture2
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
5 top-down-parsers
5  top-down-parsers 5  top-down-parsers
5 top-down-parsers
 
La tex basics
La tex basicsLa tex basics
La tex basics
 

Viewers also liked

Where Are We Going With This?
Where Are We Going With This? Where Are We Going With This?
Where Are We Going With This? Jim Gray
 
Picture Presentation 112009
Picture Presentation 112009Picture Presentation 112009
Picture Presentation 112009Cresit
 
Dumpsters,Ninjas,Guerrillas
Dumpsters,Ninjas,GuerrillasDumpsters,Ninjas,Guerrillas
Dumpsters,Ninjas,GuerrillasJim Gray
 
Back To School Night 2009 Without Media
Back To School Night 2009 Without MediaBack To School Night 2009 Without Media
Back To School Night 2009 Without MediaRudy Alfonso
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseEditor IJMTER
 
Social Media in Construction and Property - Where are we Now?
Social Media in Construction and Property - Where are we Now?Social Media in Construction and Property - Where are we Now?
Social Media in Construction and Property - Where are we Now?Su Butcher
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMSkoolkampus
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 

Viewers also liked (11)

Where Are We Going With This?
Where Are We Going With This? Where Are We Going With This?
Where Are We Going With This?
 
Picture Presentation 112009
Picture Presentation 112009Picture Presentation 112009
Picture Presentation 112009
 
Dumpsters,Ninjas,Guerrillas
Dumpsters,Ninjas,GuerrillasDumpsters,Ninjas,Guerrillas
Dumpsters,Ninjas,Guerrillas
 
Obra Parking
Obra ParkingObra Parking
Obra Parking
 
Back To School Night 2009 Without Media
Back To School Night 2009 Without MediaBack To School Night 2009 Without Media
Back To School Night 2009 Without Media
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed Database
 
Social Media in Construction and Property - Where are we Now?
Social Media in Construction and Property - Where are we Now?Social Media in Construction and Property - Where are we Now?
Social Media in Construction and Property - Where are we Now?
 
Query processing
Query processingQuery processing
Query processing
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similar to 8 query

Lecture21-Query-Optimization-1April-2018.pptx
Lecture21-Query-Optimization-1April-2018.pptxLecture21-Query-Optimization-1April-2018.pptx
Lecture21-Query-Optimization-1April-2018.pptxRishabMishra25
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
 
CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29Bilal Ahmed
 
Query processing and Query Optimization
Query processing and Query OptimizationQuery processing and Query Optimization
Query processing and Query OptimizationNiraj Gandha
 
Query processing and Query Optimization
Query processing and Query OptimizationQuery processing and Query Optimization
Query processing and Query OptimizationNiraj Gandha
 
lecture 41 cost optimization.pptx
lecture 41 cost optimization.pptxlecture 41 cost optimization.pptx
lecture 41 cost optimization.pptxSundaramyadav3
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Adbms 40 heuristics in query optimization
Adbms 40 heuristics in query optimizationAdbms 40 heuristics in query optimization
Adbms 40 heuristics in query optimizationVaibhav Khanna
 
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptxLeenaChaudhari24
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorialMohd Tousif
 

Similar to 8 query (20)

Lecture21-Query-Optimization-1April-2018.pptx
Lecture21-Query-Optimization-1April-2018.pptxLecture21-Query-Optimization-1April-2018.pptx
Lecture21-Query-Optimization-1April-2018.pptx
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Chapter15
Chapter15Chapter15
Chapter15
 
CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29
 
Query processing and Query Optimization
Query processing and Query OptimizationQuery processing and Query Optimization
Query processing and Query Optimization
 
Query processing and Query Optimization
Query processing and Query OptimizationQuery processing and Query Optimization
Query processing and Query Optimization
 
lecture 41 cost optimization.pptx
lecture 41 cost optimization.pptxlecture 41 cost optimization.pptx
lecture 41 cost optimization.pptx
 
Cs341
Cs341Cs341
Cs341
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Ch14
Ch14Ch14
Ch14
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Adbms 40 heuristics in query optimization
Adbms 40 heuristics in query optimizationAdbms 40 heuristics in query optimization
Adbms 40 heuristics in query optimization
 
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx
3110003_PPS_GTU_Study_Material_Presentations_Unit-2_18122020041700AM (1).pptx
 
Unit04 dbms
Unit04 dbmsUnit04 dbms
Unit04 dbms
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorial
 

8 query

  • 1. ECS-165A WQ'11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques; Contents: Overview Catalog Information for Cost Estimation Measures of Query Cost Selection Join Operations Other Operations Evaluation and Transformation of Expressions Query Processing Optimization Task: Find an ecient physical query plan (aka execution plan) for an SQL query Goal: Minimize the evaluation time for the query, i.e., compute query result as fast as possible Cost Factors: Disk accesses, read/write operations, [I/O, page transfer] (CPU time is typically ignored) Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 2. ECS-165A WQ'11 137 Basic Steps in Processing an SQL Query (System Catalogs) SQL Query Relational Algebra Expression Optimizer Statistics Query Result Evaluation Engine Execution Plan Data Files Parser Translator Parsing and Translating { Translate the query into its internal form (parse tree). This is then translated into an expression of the relational algebra. { Parser checks syntax, validates relations, attributes and access permissions Evaluation { The query execution engine takes a physical query plan (aka execution plan), executes the plan, and returns the result. Optimization: Find the cheapest execution plan for a query Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 3. ECS-165A WQ'11 138 A relational algebra expression may have many equivalent expressions, e.g., CName(Price5000((CUSTOMERS 1 ORDERS) 1 OFFERS)) CName((CUSTOMERS 1 ORDERS) 1 (Price5000(OFFERS))) Representation as logical query plan (a tree): o o CName Price 5000 CName Price 5000 OFFERS ORDERS CUSTOMERS ORDERS OFFERS CUSTOMERS Non-leaf nodes operations of relational algebra (with parameters); Leaf nodes relations A relational algebra expression can be evaluated in many ways. An annotated expression specifying detailed evaluation strategy is called the execution plan (includes, e.g., whether index is used, join algorithms, . . . ) Among all semantically equivalent expressions, the one with the least costly evaluation plan is chosen. Cost estimate of a plan is based on statistical information in the system catalogs. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 4. ECS-165A WQ'11 139 Catalog Information for Cost Estimation Information about relations and attributes: NR: number of tuples in the relation R. BR: number of blocks that contain tuples of the relation R. SR: size of a tuple of R. FR: blocking factor; number of tuples from R that
  • 5. t into one block (FR = dNR=BRe) V(A; R): number of distinct values for attribute A in R. SC(A; R): selectivity of attribute A average number of tuples of R that satisfy an equality condition on A. SC(A; R) = NR=V(A; R). Information about indexes: HTI: number of levels in index I (B+-tree). LBI: number of blocks occupied by leaf nodes in index I (
  • 6. rst-level blocks). ValI: number of distinct values for the search key. Some relevant tables in the Oracle system catalogs: USER TABLES USER TAB COLUMNS USER INDEXES NUM ROWS NUM DISTINCT BLEVEL BLOCKS LOW VALUE LEAF BLOCKS EMPTY BLOCKS HIGH VALUE DISTINCT KEYS AVG SPACE DENSITY AVG LEAF BLOCKS PER KEY CHAIN CNT NUM BUCKETS AVG ROW LEN LAST ANALYZED Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 7. ECS-165A WQ'11 140 Measures of Query Cost There are many possible ways to estimate cost, e.g., based on disk accesses, CPU time, or communication overhead. Disk access is the predominant cost (in terms of time); relatively easy to estimate; therefore, number of block transfers from/to disk is typically used as measure. { Simplifying assumption: each block transfer has the same cost. Cost of algorithm (e.g., for join or selection) depends on database buer size; more memory for DB buer reduces disk accesses. Thus DB buer size is a parameter for estimating cost. We refer to the cost estimate of algorithm S as cost(S). We do not consider cost of writing output to disk. Selection Operation A=a(R) where a is a constant value, A an attribute of R File Scan { search algorithms that locate and retrieve records that satisfy a selection condition S1 { Linear search cost(S1)= BR Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 8. ECS-165A WQ'11 141 Selection Operation (cont.) S2 { Binary search, i.e., the
  • 9. le ordered based on attribute A (primary index) cost(S2) = dlog2(BR)e + SC(A; R) FR 1 { dlog2(BR)e cost to locate the
  • 10. rst tuple using binary search { Second term blocks that contain records satisfying the selection. { If A is primary key, then SC(A; R) = 1, hence cost(S2) = dlog2(BR)e. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 11. ECS-165A WQ'11 142 Example (for Employee DB) { FEmployee = 10; V(Deptno; Employee) = 50 (dierent departments) { NEmployee = 10; 000 (Relation Employee has 10,000 tuples) { Assume selection Deptno=20(Employee) and Employee is sorted on search key Deptno : =) 10,000/50 = 200 tuples in Employee belong to Deptno 20; (assuming an equal distribution) 200/10 = 20 blocks for these tuples =) A binary search
  • 13. rst block would require dlog2(1; 000)e = 10 block accesses Total cost of binary search is 10+20 block accesses (versus 1,000 for linear search and Employee not sorted by Deptno). Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 14. ECS-165A WQ'11 143 Index scan { search algorithms that use an index (here, a B+-tree); selection condition is on search key of index S3 { Primary index I for A, A primary key, equality A = a cost(S3) = HTI + 1 (only 1 tuple satis
  • 15. es condition) S4 { Primary index I on non-key A equality A = a cost(S4) = HTI + SC(A; R) FR S5 { Non-primary (non-clustered) index on non-key A, equality A = a cost(S5) = HTI + SC(A; R) Worst case: each matching record resides in a dierent block. Example (Cont.): { Assume primary (B+-tree) index for attribute Deptno { 200/10=20 blocks accesses are required to read Employee tuples { If B+-tree index stores 20 pointers per (inner) node, then the B+-tree index must have between 3 and 5 leaf nodes and the entire tree has a depth of 2 =) a total of 22 blocks must be read. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 16. ECS-165A WQ'11 144 Selections Involving Comparisons Selections of the form Av(R) or Av(R) are implemented using a
  • 17. le scan or binary search, or by using either a { S6 { A primary index on A, or { S7 { A secondary index on A (in this case, typically a linear
  • 18. le scan may be cheaper; but this depends on the selectivity of A) Complex Selections General pattern: { Conjunction { 1^:::^n(R) { Disjunction { 1_:::_n(R) { Negation { :(R) The selectivity of a condition i is the probability that a tuple in the relation R satis
  • 19. es i. If si is the number of tuples in R that satisfy i, then i's selectivity is estimated as si=NR. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 20. ECS-165A WQ'11 145 Join Operations There are several dierent algorithms that can be used to implement joins (natural-, equi-, condition-join) { Nested-Loop Join { Block Nested-Loop Join { Index Nested-Loop Join { Sort-Merge Join { Hash-Join Choice of a particular algorithm is based on cost estimate For this, join size estimates are required and in particular cost estimates for outer-level operations in a relational algebra expression. Example: Assume the query CUSTOMERS 1 ORDERS (with join attribute only being CName) { NCUSTOMERS = 5,000 tuples { FCUSTOMERS = 20, i.e., BCUSTOMERS = 5,000/20 = 250 blocks { NORDERS = 10,000 tuples { FORDERS = 25, i.e., BORDERS = 400 blocks { V(CName; ORDERS) = 2,500, meaning that in this relation, on average, each customer has four orders { Also assume that CName in ORDERS is a foreign key on CUSTOMERS Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 21. ECS-165A WQ'11 146 Estimating the Size of Joins The Cartesian product R S results in NR NS tuples; each tuple requires SR + SS bytes. If schema(R) schema(S) = primary key for R, then a tuple of S will match with at most one tuple from R. Therefore, the number of tuples in R1S is not greater than NS If schema(R) schema(S) = foreign key in S referencing R, then the number of tuples in R1S is exactly NS. Other cases are symmetric. In the example query CUSTOMERS 1 ORDERS, CName in ORDERS is a foreign key of CUSTOMERS; the result thus has exactly NORDERS = 10,000 tuples If schema(R) schema(S) = fAg is not a key for R or S; assume that every tuple in R produces tuples in R 1 S. Then the number of tuples in R 1 S is estimated to be: NR NS V(A; S) If the reverse is true, the estimate is NR NS V(A; R) and the lower of the two estimates is probably the more accurate one. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 22. ECS-165A WQ'11 147 Size estimates for CUSTOMERS 1 ORDERS without using information about foreign keys: { V(CName; CUSTOMERS) = 5,000, and V(CName; ORDERS) = 2,500 { The two estimates are 5,000*10,000/2,500=20,000 and 5,000*10,000/5,000=10,000. We choose the lower estimate, which, in this case, is the same as our earlier computation using foreign key information. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 23. ECS-165A WQ'11 148 Nested-Loop Join Evaluate the condition join R 1C S for each tuple tR in R do begin for each tuple tS in S do begin check whether pair (tR; tS) satis
  • 24. es join condition if they do, add tR tS to the result end end R is called the outer and S the inner relation of the join. Requires no indexes and can be used with any kind of join condition. Worst case: db buer can only hold one block of each relation =) BR + NR BS disk accesses Best case: both relations
  • 25. t into db buer =) BR + BS disk accesses. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 26. ECS-165A WQ'11 149 An Improvement: Block Nested-Loop Join Evaluate the condition join R 1C S for each block BR of R do begin for each block BS of S do begin for each tuple tR in BR do for each tuple tS in BS do check whether pair (tR; tS) satis
  • 27. es join condition if they do, add tR tS to the result end end end end Also requires no indexes and can be used with any kind of join condition. Worst case: db buer can only hold one block of each relation =) BR + BR BS disk accesses. Best case: both relations
  • 28. t into db buer =) BR + BS disk accesses. If smaller relation completely
  • 29. ts into db buer, use that as inner relation. Reduces the cost estimate to BR + BS disk accesses. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 30. ECS-165A WQ'11 150 Block Nested-Loop Join (cont.) Some improvements of block nested-loop algorithm { If equi-join attribute is the key on inner relation, stop inner loop with
  • 31. rst match { Use M 2 disk blocks as blocking unit for outer relation, where M = db buer size in blocks; use remaining two blocks to buer inner relation and output. Reduces number of scans of inner relation greatly. { Scan inner loop forward and backward alternately, to make use of blocks remaining in buer (with LRU replacement strategy) { Use index on inner relation, if available . . . Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 32. ECS-165A WQ'11 151 Index Nested-Loop Join If an index is available on the inner loop's join attribute and join is an equi-join or natural join, more ecient index lookups can replace
  • 33. le scans. It is even possible (reasonable) to construct index just to compute a join. For each tuple tR in the outer relation R, use the index to lookup tuples in S that satisfy join condition with tR Worst case: db buer has space for only one page of R and one page of the index associated with S: { BR disk accesses to read R, and for each tuple in R, perform index lookup on S. { Cost of the join: BR + NR c, where c is the cost of a single selection on S using the join condition. If indexes are available on both R and S, use the one with the fewer tuples as the outer relation. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 34. ECS-165A WQ'11 152 Example: { Compute CUSTOMERS 1 ORDERS, with CUSTOMERS as the outer relation. { Let ORDERS have a primary B+-tree index on the join- attribute CName, which contains 20 entries per index node { Since ORDERS has 10,000 tuples, the height of the tree is 4, and one more access is needed to
  • 35. nd the actual data records (based on tuple identi
  • 36. er). { Since NCUSTOMERS is 5,000, the total cost is 250 + 5000 5 = 25,250 disk accesses. { This cost is lower than the 100,250 accesses needed for a block nested-loop join. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 37. ECS-165A WQ'11 153 Sort-Merge Join Basic idea:
  • 38. rst sort both relations on join attribute (if not already sorted this way) Join steps are similar to the merge stage in the external sort-merge algorithm (discussed later) Every pair with same value on join attribute must be matched. values of join attributes 1 1 2 2 2 3 3 3 4 5 5 Relation R Relation S If no repeated join attribute values, each tuple needs to be read only once. As a result, each block is read only once. Thus, the number of block accesses is BR + BS (plus the cost of sorting, if relations are unsorted). Worst case: all join attribute values are the same. Then the number of block accesses is BR + BR BS. If one relation is sorted and the other has a secondary B+-tree index on the join attribute, a hybrid merge-join is possible. The sorted relation is merged with the leaf node entries of the B+-tree. The result is sorted on the addresses (rids) of the unsorted relation's tuples, and then the addresses can be replaced by the actual tuples eciently. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 39. ECS-165A WQ'11 154 Hash-Join { only applicable in case of equi-join or natural join { a hash function is used to partition tuples of both relations into sets that have the same hash value on the join attribute Partitioning Phase: 2 (BR + BS) block accesses Matching Phase: BR + BS block accesses (under the assumption that one partition of each relation
  • 40. ts into the database buer) Cost Estimates for other Operations Sorting: If whole relation
  • 41. ts into db buer ; quick-sort Or, build index on the relation, and use index to read relation in sorted order. Relation that does not
  • 42. t into db buer;external sort-merge 1. Phase: Create runs by sorting portions of the relation in db buer 2. Phase: Read runs from disk and merge runs in sort order Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 43. ECS-165A WQ'11 155 Duplicate Elimination: Sorting: remove all but one copy of tuples having identical value(s) on projection attribute(s) Hashing: partition relation using hash function on projection attribute(s); then read partitions into buer and create in-memory hash index; tuple is only inserted into index if not already present Set Operations: Sorting or hashing Hashing: Partition both relations using the same hash function; use in-memory index for partitions Ri R [ S: if tuple in Ri or in Si, add tuple to result : if tuple in Ri and in Si, . . . : if tuple in Ri and not in Si, . . . Grouping and aggregation: Compute groups via sorting or hashing. Hashing: while groups (partitions) are built, compute partial aggregate values (for group attribute A, V(A,R) tuples to store values) Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 44. ECS-165A WQ'11 156 Evaluation of Expressions Strategy 1: materialization. Evaluate one operation at a time, starting at the lowest level. Use intermediate results materialized in temporary relations to evaluate next level operation(s). o Price 5000 CName CUSTOMERS ORDERS OFFERS First compute and store Price5000(OFFERS); then compute and store join of CUSTOMERS and ORDERS;
  • 45. nally, join the two materialized relations and project on to CName. Strategy 2: pipelining. evaluate several operations simultaneously, and pass the result (tuple- or block-wise) on to the next operation. In the example above, once a tuple from OFFERS satisfying selection condition has been found, pass it on to the join. Similarly, don't store result of (
  • 46. nal) join, but pass tuples directly to projection. Much cheaper than materialization, because temporary relations are not generated and stored on disk. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 47. ECS-165A WQ'11 157 Evaluation of Expressions (cont.) Pipelining is not always possible, e.g., for all operations that include sorting (blocking operation). Pipelining can be executed in either demand driven or producer driven fashion. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 48. ECS-165A WQ'11 158 Transformation of Relational Expressions Generating a query-evaluation plan for an expression of the relational algebra involves two steps: 1. generate logically equivalent expressions 2. annotate these evaluation plans by speci
  • 49. c algorithms and access structures to get alternative query plans Use equivalence rules to transform a relational algebra expression into an equivalent one. Based on estimated cost, the most cost-eective annotated plan is selected for evaluation. The process is called cost-based query optimization. Equivalence of Expressions Result relations generated by two equivalent relational algebra expressions have the same set of attributes and contain the same set of tuples, although their attributes may be ordered dierently. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 50. ECS-165A WQ'11 159 Equivalence Rules (for expressions E, E1, E2, conditions Fi) Applying distribution and commutativity of relational algebra operations 1. F1(F2(E)) F1^F2(E) 2. F(E1 [[; ;] E2) F(E1) [[; ;] F(E2) 3. F(E1 E2) F0(F1(E1) F2(E2)); F F0 ^ F1 ^ F2, Fi contains only attributes of Ei; i = 1; 2. 4. A=B(E1 E2) E1 1 A=B E2 5. A(E1 [[; ;] E2)6 A(E1) [[; ;] A(E2) 6. A(E1 E2) A1(E1) A2(E2), with Ai = A f attributes in Eig; i = 1; 2. 7. E1 [[; ] E2 E2 [[; ] E1 (E1 [ E2) [ E3 E1 [ (E2 [ E3) (the analogous holds for ) 8. E1 E2 A1;A2(E2 E1) (E1 E2) E3 E1 (E2 E3) (E1 E2) E3 ((E1 E3) E2) 9. E1 1 E2 E2 1 E1 (E1 1 E2) 1 E3 E1 1 (E2 1 E3) The application of equivalence rules to a relational algebra expression is also sometimes called algebraic optimization. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 51. ECS-165A WQ'11 160 Examples: Selection: { Find the name of all customers who have ordered a product for more than $5,000 from a supplier located in Davis. CName(SAddress like 0%Davis%0 ^ Price5000 (CUSTOMERS 1 (ORDERS 1 (OFFERS 1 SUPPLIERS)))) Perform selection as early as possible (but take existing indexes on relations into account) CName(CUSTOMERS 1 (ORDERS 1 (Price5000(OFFERS) 1 (SAddress like 0%Davis%0(SUPPLIERS))))) Projection: { CName;account(CUSTOMERS 1 Prodname=0CDROM0(ORDERS)) Reduce the size of argument relation in join CName;account(CUSTOMERS 1 CName(Prodname=0CDROM0(ORDERS))) Projection should not be shifted before selections, because minimizing the number of tuples in general leads to more ecient plans than reducing the size of tuples. Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 52. ECS-165A WQ'11 161 Join Ordering For relations R1, R2, R3, (R1 1 R2) 1 R3 R1 1 (R2 1 R3) If (R2 1 R3) is quite large and (R1 1 R2) is small, we choose (R1 1 R2) 1 R3 so that a smaller temporary relation is computed and materialized Example: List the name of all customers who have ordered a product from a supplier located in Davis. CName(SAddress like 0%Davis%0 (SUPPLIERS 1 ORDERS 1 CUSTOMERS)) ORDERS 1 CUSTOMERS is likely to be a large relation. Because it is likely that only a small fraction of suppliers are from Davis, we compute the join SAddress like 0%Davis%0(SUPPLIERS 1 ORDERS)
  • 53. rst. Summary of Algebraic Optimization Rules 1. Perform selection as early as possible 2. Replace Cartesian Product by join whenever possible 3. Project out useless attributes early. 4. If there are several joins, perform most restrictive join
  • 54. rst Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 55. ECS-165A WQ'11 162 Evaluation Plan An evaluation plan for a query exactly de
  • 56. nes what algorithm is used for each operation, which access structures are used (tables, indexes, clusters), and how the execution of the operations is coordinated. Example of Annotated Evaluation Plan Query: List the name of all customers who have ordered a product that costs more than $5,000. Assume that for both CUSTOMERS and ORDERS an index on CName exists: I1(CName; CUSTOMERS); I2(CName; ORDERS). CName (sort to remove duplicates) o get tuples for tids of I index−nested loop join block nested−loop join Price 5000 I1 (CName, CUSTOMERS) 2 I (CName, ORDERS) ORDERS 2 pipeline pipeline full table scan Dept. of Computer Science UC Davis 8. Query Processing and Optimization
  • 57. ECS-165A WQ'11 163 Choice of an Evaluation Plan Must consider interaction of evaluation techniques when choosing evaluation plan: choosing the algorithm with the least cost for each operation independently may not yield the best overall algorithm. Practical query optimizers incorporate elements of the following two optimization approaches: { Cost-based: enumerate all the plans and choose the best plan in a cost-based fashion. { Rule-based: Use rules (heuristics) to choose plan. Remarks on cost-based optimization: { Finding a join order for R1 1 R2 1 : : : 1 Rn: n! dierent left-deep join orders For example, for n = 9, the number is 362880. ; use of dynamic programming techniques Heuristic (or rule-based) optimization transforms a given query tree by using a set of rules that typically (but not in all cases) improve execution performance: { Perform selection early (reduces number of tuples) { Perform projection early (reduces number of attributes) { Perform most restrictive selection and join operations before other similar operations. Dept. of Computer Science UC Davis 8. Query Processing and Optimization