SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Databases for Data Science
Dr. Nalini N
SCOPE
VIT
Dr.Nalini N, SCOPE, VIT, Vellore
Introduction
• The role of a data scientist is to turn raw data into actionable
insights.
• Much of the world's raw data, such as electronic medical records and
customer transaction histories, lives in organized collections of tables
called relational databases.
• Therefore, to be an effective data scientist, you must know how to
wrangle and extract data from these databases using a domain-
specific language called SQL (Structured Query Language).
Relational databases
• Relational database - collection of tables.
• A table is just a set of rows and columns which represents exactly one type of entity.
• Each row, or record, of a table contains information about a single entity; i.e. in a table
representing employees, each row represents a single person.
• Each column, or field, of a table contains a single attribute for all rows in the table; i.e.
in a table representing employees, we might have a column containing first and last
names for all employees.
SQL
• SQL can be used to create and modify databases, the focus of this
course will be querying databases.
• A query is a request for data from a database table, or combination of
tables.
• Querying is an essential skill for a data scientist, since the data you
need for your analyses will often live in databases.
SQL Environment
• Catalog
• A set of schemas that constitute the description of a database
• Schema
• The structure that contains descriptions of objects created by a user (base tables, views, constraints)
• Data Definition Language (DDL)
• Commands that define a database, including creating, altering, and dropping tables and establishing
constraints
• Data Manipulation Language (DML)
• Commands that maintain and query a database
• Data Control Language (DCL)
• Commands that control a database, including administering privileges and committing data
5
6
DDL, DML, DCL, and the database development process
6
Copyright © 2014 Pearson Education, Inc.
Simple SQL Query
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
SELECT *
FROM Product
WHERE category=‘Gadgets’
Product
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
“selection”
Simple SQL Query
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
SELECT PName, Price, Manufacturer
FROM Product
WHERE Price > 100
Product
PName Price Manufacturer
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
“selection” and
“projection”
Eliminating Duplicates - DISTINCT keyword
SELECT DISTINCT category
FROM Product
Compare to:
SELECT category
FROM Product
Category
Gadgets
Gadgets
Photography
Household
Category
Gadgets
Photography
Household
If your data includes duplicate values and you only want to return all of the
unique values from a column, you can use the DISTINCT keyword.
The OR keyword
• If you wanted to select rows based on multiple conditions where
some but not all of the conditions need to be bet, you can use the OR
keyword.
In: Out:
• When using AND and OR, ensure that you enclose the individual clauses
in parentheses.
In: Out:
The IN keyword
• If you want to select rows based upon three or more different values
from a single column, the WHERE keyword can start to become
unwieldly.
• This is where the IN keyword comes in useful.
In:
Out:
The BETWEEN keyword
• If you wanted to get the records where the average weight is between
two values, you don’t have to use < and >.
• Instead, you can use BETWEEN.
In: Out:
NULL and IS NULL
• NULL represents a missing or unknown value.
• You can check values using the expression IS NULL.
• The IS NULL is useful when combined with the WHERE keyword to figure
out what data you’re missing.
• If you want to filter out missing values so that you only get results
which are not NULL. To do this, you can use the IS NOT NULL keyword.
In:
Out:
The LIKE and NOT LIKE keywords
• When filtering by text, the WHERE command only allows you to filter by
text that matches your search criteria exactly.
• However, in the real world, you often want to search for a pattern
rather than a specific match.
• This is where the LIKE keyword comes in.
• LIKE allows you to search for a pattern in a column.
• The LIKE command requires you to use a wildcard placeholder for
some other values. There are two of these you can use with the LIKE
command.
• The % wildcard will match zero, one, or many characters in text; i.e.
the following would return ‘Data’, ‘DataC’, ‘DataCamp’, ‘DataMind’, and so on.
• The _ wildcard will match a single character; i.e. the following query
matches companies like ‘DataCamp’, ‘DataComp’, and so on.
• You can also use the NOT LIKE operator to find records that don’t match
the pattern you specify.
Aggregation
SELECT count(*)
FROM Product
WHERE year > 1995
SELECT avg(price)
FROM Product
WHERE maker=“Toyota”
• SQL supports several aggregation operations:
sum, count, min, max, avg
• Except count, all aggregations apply to a single attribute
• An aggregate may not appear in the WHERE clause.
You can count the number of rows in your table by using the COUNT keyword
SELECT Count(category)
FROM Product
WHERE year > 1995
same as Count(*)
We probably want:
SELECT Count(DISTINCT category)
FROM Product
WHERE year > 1995
Aggregation: Count
Simple Aggregations
Purchase
Product Date Price Quantity
Bagel 10/21 1 20
Banana 10/3 0.5 10
Banana 10/10 1 10
Bagel 10/25 1.50 20
SELECT Sum(price * quantity)
FROM Purchase
WHERE product = ‘bagel’
50 (= 20+30)
Ordering the Results
• Ties(multiple columns) are broken by the second attribute on the ORDER BY list, etc. Ex:
• Example: For above query: Product table, sorted by the “price" and the “pname" column. This means
that it orders by price, but if some rows have the same price,then it orders them by pname.
• The ORDER BY keywords sorts the values of a column in either
ascending or descending order.
• By default, it will sort in ascending order. You use the DESC
keyword to sort in descending order.
Price product
500 Apple
500 Orange
Ordering the Results
Dr.Nalini N, SCOPE, VIT, Vellore
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
TouchEX $149.99 Photography Canon
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
PName Price Manufacturer
SingleTouch $149.99 Canon
TouchEX $149.99 Canon
SELECT pname, price, manufacturer
FROM Product
WHERE category=‘Photography’ AND price > 50
ORDER BY price, pname
SELECT Category
FROM Product
ORDER BY PName
PName Price Category Manufacturer
Gizmo $19.99 Gadgets GizmoWorks
Powergizmo $29.99 Gadgets GizmoWorks
TouchEX $149.99 Photography Canon
SingleTouch $149.99 Photography Canon
MultiTouch $203.99 Household Hitachi
?
SELECT DISTINCT category
FROM Product
ORDER BY category
SELECT DISTINCT category
FROM Product
ORDER BY Pname DESC
?
?
Grouping and Aggregation
Purchase(product, date, price, quantity)
SELECT product, Sum(price*quantity) AS TotalSales
FROM Purchase
WHERE date > ‘10/1/2005’
GROUP BY product
Find total sales after 10/1/2005 per product.
1. Compute the FROM and WHERE clauses.
2. Group by the attributes in the GROUPBY
3. Compute the SELECT clause: grouped attributes and aggregates.
1&2. FROM-WHERE-GROUPBY
Product Date Price Quantity
Bagel 10/21 1 20
Bagel 10/25 1.50 20
Banana 10/3 0.5 10
Banana 10/10 1 10
3. SELECT
SELECT product, Sum(price*quantity) AS TotalSales
FROM Purchase
WHERE date > ‘10/1/2005’
GROUP BY product
Product Date Price Quantity
Bagel 10/21 1 20
Bagel 10/25 1.50 20
Banana 10/3 0.5 10
Banana 10/10 1 10
Product TotalSales
Bagel 50
Banana 15
GROUP BY v.s. Nested Quereis
SELECT product, Sum(price*quantity) AS TotalSales
FROM Purchase
WHERE date > ‘10/1/2005’
GROUP BY product
SELECT DISTINCT x.product, (SELECT Sum(y.price*y.quantity)
FROM Purchase y
WHERE x.product = y.product
AND y.date > ‘10/1/2005’)
AS TotalSales
FROM Purchase x
WHERE x.date > ‘10/1/2005’
Qualifying Results by Categories
Using the HAVING Clause
• The HAVING clause was added to SQL because the WHERE keyword
cannot be used with aggregate functions. Use with GROUP BY
• Having operates on groups (categories), not on individual rows WHERE
clause. Here, only those groups with total numbers greater than 5 will
be included in final result.
26
SELECT product, count(quantity) AS TotalSales
FROM Purchase
GROUP BY product
HAVING Sum(quantity) > 5
HAVING is used with aggregations to filter out results returned by the aggregation.
It is similar to WHERE except that WHERE removes values before the aggregation
function is applied to the values, and HAVING removes values after aggregation has
occurred.
SELECT sales_agent, COUNT(sales_pipeline.close_value) AS `number won`
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
HAVING COUNT(sales_pipeline.close_value) > 200
Dr.Nalini N, SCOPE, VIT, Vellore
Summary
• Data filtering process consists of different
strategies for refining and reducing datasets.
• Clauses of the SELECT statement:
• SELECT
• List the columns (and expressions) to be returned from the query
• FROM
• Indicate the table(s) or view(s) from which data will be obtained
• WHERE
• Indicate the conditions under which a row will be included in the result
• GROUP BY
• Indicate categorization of results
• HAVING
• Indicate the conditions under which a category (group) will be included
• ORDER BY
• Sorts the result according to specified criteria
SQL statement
processing order
Basic Statistics
• Mean - SELECT Avg(ColumnName) as MEANFROM TableName
• Mode - SELECT TOP 1 ColumnName FROM TableNameGROUP BY
ColumnNameORDER BY COUNT(*) DESC
• Median
Dr.Nalini N, SCOPE, VIT, Vellore
Median
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
Code to calculate the median of salary column
Result: median value 5,500
Code to calculate the median of salary column
• Calculate Median Value Using PERCENTILE_CONT
PERCENTILE_CONT(percentile)OVER (ORDER BY (column_name))
Dr.Nalini N, SCOPE, VIT, Vellore
from stock;
Code to calculate the median of salary column
• Calculate Median Value Using the Ranking Function
SELECT ( (SELECT MAX(Marks) FROM (SELECT TOP 50 PERCENT Marks,
FROM student_details ORDER BY Marks) AS BOTTOM HALF)
+ (SELECT MIN(Marks), FROM (SELECT TOP 50 PERCENT Marks,
FROM student_details ORDER BY Marks DESC) AS TOPHALF) ) / 2 AS
MEDIAN
Dr.Nalini N, SCOPE, VIT, Vellore
Data Munging
• Data munging (or wrangling) is the phase of data transformation.
• It trans-forms data into various states so that it is simpler to work and
understand the data.
• The transformation may lead to manually convert or merge or update
the data manually in a certain format to generate data
• Transform and map data from one format to another format to make
it more valuable for a variety of analytics tools.
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
Filtering
• Data filtering process consists of different strategies for refining and
reducing datasets.
• FILTER is a modifier used on an aggregate function to limit the values
used in an aggregation. All the columns in the select statement that
aren’t aggregated should be specified in a GROUP BY clause in the
query.
• Filter data using WHERE Clause.
FILTER Keyword
• FILTER is more flexible than WHERE because you can use more than
one FILTER modifier in an aggregate query while you can only use only
one WHERE clause.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions
• The window function calculates on a set of rows and returns a value for
each row from the given query.
• In the window function, the term window represents the set of rows on
which the function operates. It calculates the returned values based on the
values of the rows in a window.
• Window functions applies aggregate and ranking functions over a
particular window (set of rows). OVER clause is used with window
functions to define that window.
• OVER clause does two things :
• Partitions rows into form set of rows. (PARTITION BY clause is used)
• Orders rows within those partitions into a particular order. (ORDER BY clause is
used)
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - ROW_NUMBER()
• ROW_NUMBER() provides a guaranteed unique, ascending integer
value which starts from 1 and continues through the end of the set.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - Rank
• RANK() provides an ascending integer value which starts from 1, but it
is not guaranteed to be unique.
• Instead, any ties in the window will get the same value and then the
next value gets its ROW_NUMBER() value.
• For example, the second and third entries are tied, both will get a
rank of 2 and the fourth entry will get a rank of 4.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - Rank
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - Rank
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - Rank
Window Functions - DENSE_RANK()
• DENSE_RANK() behaves like RANK(), except that it does not skip
numbers even with ties.
• If the second and third entries are tied, both will get a dense rank of 2
and the fourth entry will get a dense rank of 3.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - PERCENT_RANK()
• The PERCENT_RANK() function calculates the SQL percentile rank of
each row.
• This percentile ranking number may range from zero to one.For each
row, PERCENT_RANK() calculates the percentile rank using the
following formula:
• In this formula, rank represents the rank of the row.
total_number_rows is the number of rows that are being evaluated. It
always returns the rank of the first row as 0.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - PERCENT_RANK()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - NLITE
• The SQL NTILE() function partitions a logically ordered
dataset into a number of buckets demonstrated by the
expression and allocates the bucket number to each row.
• NTILE divides the rows in roughly equal sized buckets.
• Suppose you have 20 rows,
• When using NTILE(2) - 2 buckets with 10 rows each.
• When using NTILE(3) - 2 buckets with 7 rows and 1 bucket
with 6 rows.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - NLITE
Dr.Nalini N, SCOPE, VIT, Vellore
• LEAD() function will allows to access data of the following row, or the row
after the subsequent row, and continue on.
• return_value - The return_value of the subsequent row supported a specified offset. The return_value must
be one value.
• offset - The offset is that the number of rows forward from the present row from where to access data. The
offset should be a positive integer. If you don’t define the default value of offset is 1.
• default - The LEAD() function results default if offset is beyond the scope of the partition. If not defined,
default is NULL.
• As next row is not available for the last row, it returns a NULL value for last row.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - LEAD()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - LEAD()
Window Functions - LEAD()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - LEAD()
• The LEAD() function can also be very useful for calculating the difference between the value of
the current row and the value of the following row.
• The following query finds the difference between the salaries of person in the same department.
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - LAG()
• LAG() to access previous row’s data based on defined offset value.
• It works similar to a LEAD() function.
• In the SQL LEAD() function, we access the values of subsequent rows, but in LAG()
function, we access previous row’s data.
• It is useful to compare the current row value from the previous row value.
• As no previous row is available for the first row in each department, it returns a
NULL value
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - CUME_DIST()
• The SQL window function CUME_DIST() returns the cumulative
distribution of a value within a partition of values.
• The cumulative distribution of a value calculated by the number of
rows with values less than or equal to (<=) the current row’s value is
divided by the total number of rows.
N/totalrows
• where N is the number of rows with the value less than or equal to
the current row value and total rows is the number of rows in the
group or result set.
• Function returns value having a range between 0 and 1
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - CUME_DIST()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions – Aggregation MIN() and
MAX()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions – Aggregation sum() and
avg()
Dr.Nalini N, SCOPE, VIT, Vellore
Window Functions - FIRST_VALUE and LAST_VALUE
• FIRST_VALUE() - FIRST_VALUE() returns the first value in an ordered
group of a result set or window frame.
• LAST_VALUE() - LAST_VALUE() returns the last value in an ordered
group of a result set.
Dr.Nalini N, SCOPE, VIT, Vellore
Ordered Data & Join
• Ordered By-DESC
• Join
• Joins are commands that are
used to combine rows from
two or more tables.
• These tables are combined
based on a related column
between those tables.
• Inner, left, right, and full are
four basic types of SQL joins.
Dr.Nalini N, SCOPE, VIT, Vellore
Preparing Data for Analytics Tool
• One of the primary steps performed for data science is the cleaning of
the data-set.
• Maximum of the time spent by a data scientist or analyst includes
preparing datasets for use in analysis. SQL can help to speed up this
step.
• Various SQL queries can be used to clean, update, and filter data, by
eliminating redundant and
• This can be done with the different SQL clauses like CASE WHEN,
COALESCE, NULLIF, LEAST/GREATEST, Casting, and DISTINCT.
Dr.Nalini N, SCOPE, VIT, Vellore
CASEWHEN
• CASEWHEN - The CASE statement goes through various conditions specified with
WHEN clause and returns a value when the first condition is met.
• Suppose we fetch all data of the above sales table and want to add an extra
column that labels as summary which categorizes sales into More, Less, and Avg,
this table can be created using a CASE statement
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
Coalesce
• Coalesce returns the first non-null value in a list. If all the values in the list
are NULL, then the function returns null.
• This function is useful when combining the values from several columns
into one.
COALESCE(value_1, value_2, ...., value_n)
• The COALESCE() function takes in at least one value (value_1). It will return
the first value in the list that is non-null.
• it will first check if value_1 is null. If not, then it returns value_1.
Otherwise, it checks if value_2 is null. The process goes on until the list is
complete.
• SELECT COALESCE(NULL, 1, 2, 'W3Schools.com'); Output: 1
Dr.Nalini N, SCOPE, VIT, Vellore
Coalesce
Dr.Nalini N, SCOPE, VIT, Vellore
Coalesce
Dr.Nalini N, SCOPE, VIT, Vellore
COALESCE() two columns
• Products table contains the product name and its description. Some descriptions
are too long (more than 60 characters), In that case, we replace the description
with the product name.
Dr.Nalini N, SCOPE, VIT, Vellore
COALESCE() two columns
• Products table contains the product name and its description. Some descriptions
are too long (more than 60 characters), In that case, we replace the description
with the product name.
Dr.Nalini N, SCOPE, VIT, Vellore
Value1
Value2
COALESCE() two columns
• Products table contains the product name and its description. Some descriptions
are too long (more than 60 characters), In that case, we replace the description
with the product name.
Dr.Nalini N, SCOPE, VIT, Vellore
Value1
Value2
NULLIF
• The NULLIF() function returns NULL if two expressions are equal, otherwise it returns the first
expression.
• SELECT NULLIF(25, 25); output: nothing(Null)
• SELECT NULLIF('Hello', 'world'); output: Hello
• SELECT e.last_name, NULLIF(j.job_id, e.job_id) "Old Job ID" FROM employees e, job_history j
WHERE e.employee_id = j.employee_id ORDER BY last_name, "Old Job ID";
Dr.Nalini N, SCOPE, VIT, Vellore
LEAST/GREATEST
• The greatest() function returns the largest of input values.
• The least() function returns the smallest of input values.
• Need to specify at least two input values and a maximum of four values. Variable-
length lists are not supported.
• The comparison for string values is based on a character set value. The character
with the higher character set value is considered the greatest value.
Dr.Nalini N, SCOPE, VIT, Vellore
Dr.Nalini N, SCOPE, VIT, Vellore
LEAST/GREATEST
Casting
• The CAST() function converts a value (of any type) into a specified
datatype. Like Convert function.
CAST(expression AS datatype(length))
• SELECT CAST(25.65 AS varchar); output: 25.65
Dr.Nalini N, SCOPE, VIT, Vellore

Contenu connexe

Similaire à FOUNDATION OF DATA SCIENCE SQL QUESTIONS

Database Architecture and Basic Concepts
Database Architecture and Basic ConceptsDatabase Architecture and Basic Concepts
Database Architecture and Basic ConceptsTony Wong
 
Randomizing Data With SQL Server
Randomizing Data With SQL ServerRandomizing Data With SQL Server
Randomizing Data With SQL ServerWally Pons
 
ITT PROJECT IN EXCEL AND WORD
ITT PROJECT IN EXCEL AND WORDITT PROJECT IN EXCEL AND WORD
ITT PROJECT IN EXCEL AND WORDAVIRAL161
 
02 database oprimization - improving sql performance - ent-db
02  database oprimization - improving sql performance - ent-db02  database oprimization - improving sql performance - ent-db
02 database oprimization - improving sql performance - ent-dbuncleRhyme
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Mahesh Vallampati
 
Intro to SQL for Beginners
Intro to SQL for BeginnersIntro to SQL for Beginners
Intro to SQL for BeginnersProduct School
 
Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Mahesh Vallampati
 
Best Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerBest Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerEdgar Alejandro Villegas
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learningSimone Caldaro
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...Datavail
 
Pl sql best practices document
Pl sql best practices documentPl sql best practices document
Pl sql best practices documentAshwani Pandey
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesZohar Elkayam
 
Database optimization
Database optimizationDatabase optimization
Database optimizationEsraaAlattar1
 
I Simply Excel
I Simply ExcelI Simply Excel
I Simply ExcelEric Couch
 

Similaire à FOUNDATION OF DATA SCIENCE SQL QUESTIONS (20)

Database Architecture and Basic Concepts
Database Architecture and Basic ConceptsDatabase Architecture and Basic Concepts
Database Architecture and Basic Concepts
 
MySQL basics
MySQL basicsMySQL basics
MySQL basics
 
Randomizing Data With SQL Server
Randomizing Data With SQL ServerRandomizing Data With SQL Server
Randomizing Data With SQL Server
 
IS100 Week 8
IS100 Week 8IS100 Week 8
IS100 Week 8
 
ITT PROJECT IN EXCEL AND WORD
ITT PROJECT IN EXCEL AND WORDITT PROJECT IN EXCEL AND WORD
ITT PROJECT IN EXCEL AND WORD
 
02 database oprimization - improving sql performance - ent-db
02  database oprimization - improving sql performance - ent-db02  database oprimization - improving sql performance - ent-db
02 database oprimization - improving sql performance - ent-db
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2
 
Intro to SQL for Beginners
Intro to SQL for BeginnersIntro to SQL for Beginners
Intro to SQL for Beginners
 
Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 2 of 2
 
Best Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerBest Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle Optimizer
 
Mentor Your Indexes
Mentor Your IndexesMentor Your Indexes
Mentor Your Indexes
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
 
SQL(database)
SQL(database)SQL(database)
SQL(database)
 
Pl sql best practices document
Pl sql best practices documentPl sql best practices document
Pl sql best practices document
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
The art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniquesThe art of querying – newest and advanced SQL techniques
The art of querying – newest and advanced SQL techniques
 
Database optimization
Database optimizationDatabase optimization
Database optimization
 
SQL_Part1
SQL_Part1SQL_Part1
SQL_Part1
 
I Simply Excel
I Simply ExcelI Simply Excel
I Simply Excel
 

Dernier

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Dernier (20)

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

FOUNDATION OF DATA SCIENCE SQL QUESTIONS

  • 1. Databases for Data Science Dr. Nalini N SCOPE VIT Dr.Nalini N, SCOPE, VIT, Vellore
  • 2. Introduction • The role of a data scientist is to turn raw data into actionable insights. • Much of the world's raw data, such as electronic medical records and customer transaction histories, lives in organized collections of tables called relational databases. • Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a domain- specific language called SQL (Structured Query Language).
  • 3. Relational databases • Relational database - collection of tables. • A table is just a set of rows and columns which represents exactly one type of entity. • Each row, or record, of a table contains information about a single entity; i.e. in a table representing employees, each row represents a single person. • Each column, or field, of a table contains a single attribute for all rows in the table; i.e. in a table representing employees, we might have a column containing first and last names for all employees.
  • 4. SQL • SQL can be used to create and modify databases, the focus of this course will be querying databases. • A query is a request for data from a database table, or combination of tables. • Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases.
  • 5. SQL Environment • Catalog • A set of schemas that constitute the description of a database • Schema • The structure that contains descriptions of objects created by a user (base tables, views, constraints) • Data Definition Language (DDL) • Commands that define a database, including creating, altering, and dropping tables and establishing constraints • Data Manipulation Language (DML) • Commands that maintain and query a database • Data Control Language (DCL) • Commands that control a database, including administering privileges and committing data 5
  • 6. 6 DDL, DML, DCL, and the database development process 6 Copyright © 2014 Pearson Education, Inc.
  • 7. Simple SQL Query PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT * FROM Product WHERE category=‘Gadgets’ Product PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks “selection”
  • 8. Simple SQL Query PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT PName, Price, Manufacturer FROM Product WHERE Price > 100 Product PName Price Manufacturer SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi “selection” and “projection”
  • 9. Eliminating Duplicates - DISTINCT keyword SELECT DISTINCT category FROM Product Compare to: SELECT category FROM Product Category Gadgets Gadgets Photography Household Category Gadgets Photography Household If your data includes duplicate values and you only want to return all of the unique values from a column, you can use the DISTINCT keyword.
  • 10. The OR keyword • If you wanted to select rows based on multiple conditions where some but not all of the conditions need to be bet, you can use the OR keyword. In: Out: • When using AND and OR, ensure that you enclose the individual clauses in parentheses. In: Out:
  • 11. The IN keyword • If you want to select rows based upon three or more different values from a single column, the WHERE keyword can start to become unwieldly. • This is where the IN keyword comes in useful. In: Out:
  • 12. The BETWEEN keyword • If you wanted to get the records where the average weight is between two values, you don’t have to use < and >. • Instead, you can use BETWEEN. In: Out:
  • 13. NULL and IS NULL • NULL represents a missing or unknown value. • You can check values using the expression IS NULL. • The IS NULL is useful when combined with the WHERE keyword to figure out what data you’re missing. • If you want to filter out missing values so that you only get results which are not NULL. To do this, you can use the IS NOT NULL keyword. In: Out:
  • 14. The LIKE and NOT LIKE keywords • When filtering by text, the WHERE command only allows you to filter by text that matches your search criteria exactly. • However, in the real world, you often want to search for a pattern rather than a specific match. • This is where the LIKE keyword comes in. • LIKE allows you to search for a pattern in a column. • The LIKE command requires you to use a wildcard placeholder for some other values. There are two of these you can use with the LIKE command.
  • 15. • The % wildcard will match zero, one, or many characters in text; i.e. the following would return ‘Data’, ‘DataC’, ‘DataCamp’, ‘DataMind’, and so on. • The _ wildcard will match a single character; i.e. the following query matches companies like ‘DataCamp’, ‘DataComp’, and so on. • You can also use the NOT LIKE operator to find records that don’t match the pattern you specify.
  • 16. Aggregation SELECT count(*) FROM Product WHERE year > 1995 SELECT avg(price) FROM Product WHERE maker=“Toyota” • SQL supports several aggregation operations: sum, count, min, max, avg • Except count, all aggregations apply to a single attribute • An aggregate may not appear in the WHERE clause.
  • 17. You can count the number of rows in your table by using the COUNT keyword SELECT Count(category) FROM Product WHERE year > 1995 same as Count(*) We probably want: SELECT Count(DISTINCT category) FROM Product WHERE year > 1995 Aggregation: Count
  • 18. Simple Aggregations Purchase Product Date Price Quantity Bagel 10/21 1 20 Banana 10/3 0.5 10 Banana 10/10 1 10 Bagel 10/25 1.50 20 SELECT Sum(price * quantity) FROM Purchase WHERE product = ‘bagel’ 50 (= 20+30)
  • 19. Ordering the Results • Ties(multiple columns) are broken by the second attribute on the ORDER BY list, etc. Ex: • Example: For above query: Product table, sorted by the “price" and the “pname" column. This means that it orders by price, but if some rows have the same price,then it orders them by pname. • The ORDER BY keywords sorts the values of a column in either ascending or descending order. • By default, it will sort in ascending order. You use the DESC keyword to sort in descending order. Price product 500 Apple 500 Orange
  • 20. Ordering the Results Dr.Nalini N, SCOPE, VIT, Vellore PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks TouchEX $149.99 Photography Canon SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi PName Price Manufacturer SingleTouch $149.99 Canon TouchEX $149.99 Canon SELECT pname, price, manufacturer FROM Product WHERE category=‘Photography’ AND price > 50 ORDER BY price, pname
  • 21. SELECT Category FROM Product ORDER BY PName PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks TouchEX $149.99 Photography Canon SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi ? SELECT DISTINCT category FROM Product ORDER BY category SELECT DISTINCT category FROM Product ORDER BY Pname DESC ? ?
  • 22. Grouping and Aggregation Purchase(product, date, price, quantity) SELECT product, Sum(price*quantity) AS TotalSales FROM Purchase WHERE date > ‘10/1/2005’ GROUP BY product Find total sales after 10/1/2005 per product. 1. Compute the FROM and WHERE clauses. 2. Group by the attributes in the GROUPBY 3. Compute the SELECT clause: grouped attributes and aggregates.
  • 23. 1&2. FROM-WHERE-GROUPBY Product Date Price Quantity Bagel 10/21 1 20 Bagel 10/25 1.50 20 Banana 10/3 0.5 10 Banana 10/10 1 10
  • 24. 3. SELECT SELECT product, Sum(price*quantity) AS TotalSales FROM Purchase WHERE date > ‘10/1/2005’ GROUP BY product Product Date Price Quantity Bagel 10/21 1 20 Bagel 10/25 1.50 20 Banana 10/3 0.5 10 Banana 10/10 1 10 Product TotalSales Bagel 50 Banana 15
  • 25. GROUP BY v.s. Nested Quereis SELECT product, Sum(price*quantity) AS TotalSales FROM Purchase WHERE date > ‘10/1/2005’ GROUP BY product SELECT DISTINCT x.product, (SELECT Sum(y.price*y.quantity) FROM Purchase y WHERE x.product = y.product AND y.date > ‘10/1/2005’) AS TotalSales FROM Purchase x WHERE x.date > ‘10/1/2005’
  • 26. Qualifying Results by Categories Using the HAVING Clause • The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions. Use with GROUP BY • Having operates on groups (categories), not on individual rows WHERE clause. Here, only those groups with total numbers greater than 5 will be included in final result. 26 SELECT product, count(quantity) AS TotalSales FROM Purchase GROUP BY product HAVING Sum(quantity) > 5
  • 27. HAVING is used with aggregations to filter out results returned by the aggregation. It is similar to WHERE except that WHERE removes values before the aggregation function is applied to the values, and HAVING removes values after aggregation has occurred. SELECT sales_agent, COUNT(sales_pipeline.close_value) AS `number won` FROM sales_pipeline WHERE sales_pipeline.deal_stage = "Won" GROUP BY sales_pipeline.sales_agent HAVING COUNT(sales_pipeline.close_value) > 200 Dr.Nalini N, SCOPE, VIT, Vellore
  • 28. Summary • Data filtering process consists of different strategies for refining and reducing datasets. • Clauses of the SELECT statement: • SELECT • List the columns (and expressions) to be returned from the query • FROM • Indicate the table(s) or view(s) from which data will be obtained • WHERE • Indicate the conditions under which a row will be included in the result • GROUP BY • Indicate categorization of results • HAVING • Indicate the conditions under which a category (group) will be included • ORDER BY • Sorts the result according to specified criteria SQL statement processing order
  • 29. Basic Statistics • Mean - SELECT Avg(ColumnName) as MEANFROM TableName • Mode - SELECT TOP 1 ColumnName FROM TableNameGROUP BY ColumnNameORDER BY COUNT(*) DESC • Median Dr.Nalini N, SCOPE, VIT, Vellore
  • 31. Dr.Nalini N, SCOPE, VIT, Vellore Code to calculate the median of salary column Result: median value 5,500
  • 32. Code to calculate the median of salary column • Calculate Median Value Using PERCENTILE_CONT PERCENTILE_CONT(percentile)OVER (ORDER BY (column_name)) Dr.Nalini N, SCOPE, VIT, Vellore from stock;
  • 33. Code to calculate the median of salary column • Calculate Median Value Using the Ranking Function SELECT ( (SELECT MAX(Marks) FROM (SELECT TOP 50 PERCENT Marks, FROM student_details ORDER BY Marks) AS BOTTOM HALF) + (SELECT MIN(Marks), FROM (SELECT TOP 50 PERCENT Marks, FROM student_details ORDER BY Marks DESC) AS TOPHALF) ) / 2 AS MEDIAN Dr.Nalini N, SCOPE, VIT, Vellore
  • 34. Data Munging • Data munging (or wrangling) is the phase of data transformation. • It trans-forms data into various states so that it is simpler to work and understand the data. • The transformation may lead to manually convert or merge or update the data manually in a certain format to generate data • Transform and map data from one format to another format to make it more valuable for a variety of analytics tools. Dr.Nalini N, SCOPE, VIT, Vellore
  • 35. Dr.Nalini N, SCOPE, VIT, Vellore
  • 36. Dr.Nalini N, SCOPE, VIT, Vellore
  • 37. Filtering • Data filtering process consists of different strategies for refining and reducing datasets. • FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren’t aggregated should be specified in a GROUP BY clause in the query. • Filter data using WHERE Clause.
  • 38. FILTER Keyword • FILTER is more flexible than WHERE because you can use more than one FILTER modifier in an aggregate query while you can only use only one WHERE clause. Dr.Nalini N, SCOPE, VIT, Vellore
  • 39. Window Functions • The window function calculates on a set of rows and returns a value for each row from the given query. • In the window function, the term window represents the set of rows on which the function operates. It calculates the returned values based on the values of the rows in a window. • Window functions applies aggregate and ranking functions over a particular window (set of rows). OVER clause is used with window functions to define that window. • OVER clause does two things : • Partitions rows into form set of rows. (PARTITION BY clause is used) • Orders rows within those partitions into a particular order. (ORDER BY clause is used) Dr.Nalini N, SCOPE, VIT, Vellore
  • 40. Window Functions Dr.Nalini N, SCOPE, VIT, Vellore
  • 41. Window Functions - ROW_NUMBER() • ROW_NUMBER() provides a guaranteed unique, ascending integer value which starts from 1 and continues through the end of the set. Dr.Nalini N, SCOPE, VIT, Vellore
  • 42. Window Functions - Rank • RANK() provides an ascending integer value which starts from 1, but it is not guaranteed to be unique. • Instead, any ties in the window will get the same value and then the next value gets its ROW_NUMBER() value. • For example, the second and third entries are tied, both will get a rank of 2 and the fourth entry will get a rank of 4. Dr.Nalini N, SCOPE, VIT, Vellore
  • 43. Window Functions - Rank Dr.Nalini N, SCOPE, VIT, Vellore
  • 44. Dr.Nalini N, SCOPE, VIT, Vellore Window Functions - Rank
  • 45. Dr.Nalini N, SCOPE, VIT, Vellore Window Functions - Rank
  • 46. Window Functions - DENSE_RANK() • DENSE_RANK() behaves like RANK(), except that it does not skip numbers even with ties. • If the second and third entries are tied, both will get a dense rank of 2 and the fourth entry will get a dense rank of 3. Dr.Nalini N, SCOPE, VIT, Vellore
  • 47. Window Functions - PERCENT_RANK() • The PERCENT_RANK() function calculates the SQL percentile rank of each row. • This percentile ranking number may range from zero to one.For each row, PERCENT_RANK() calculates the percentile rank using the following formula: • In this formula, rank represents the rank of the row. total_number_rows is the number of rows that are being evaluated. It always returns the rank of the first row as 0. Dr.Nalini N, SCOPE, VIT, Vellore
  • 48. Window Functions - PERCENT_RANK() Dr.Nalini N, SCOPE, VIT, Vellore
  • 49. Window Functions - NLITE • The SQL NTILE() function partitions a logically ordered dataset into a number of buckets demonstrated by the expression and allocates the bucket number to each row. • NTILE divides the rows in roughly equal sized buckets. • Suppose you have 20 rows, • When using NTILE(2) - 2 buckets with 10 rows each. • When using NTILE(3) - 2 buckets with 7 rows and 1 bucket with 6 rows. Dr.Nalini N, SCOPE, VIT, Vellore
  • 50. Window Functions - NLITE Dr.Nalini N, SCOPE, VIT, Vellore
  • 51. • LEAD() function will allows to access data of the following row, or the row after the subsequent row, and continue on. • return_value - The return_value of the subsequent row supported a specified offset. The return_value must be one value. • offset - The offset is that the number of rows forward from the present row from where to access data. The offset should be a positive integer. If you don’t define the default value of offset is 1. • default - The LEAD() function results default if offset is beyond the scope of the partition. If not defined, default is NULL. • As next row is not available for the last row, it returns a NULL value for last row. Dr.Nalini N, SCOPE, VIT, Vellore Window Functions - LEAD()
  • 52. Dr.Nalini N, SCOPE, VIT, Vellore Window Functions - LEAD()
  • 53. Window Functions - LEAD() Dr.Nalini N, SCOPE, VIT, Vellore
  • 54. Window Functions - LEAD() • The LEAD() function can also be very useful for calculating the difference between the value of the current row and the value of the following row. • The following query finds the difference between the salaries of person in the same department. Dr.Nalini N, SCOPE, VIT, Vellore
  • 55. Window Functions - LAG() • LAG() to access previous row’s data based on defined offset value. • It works similar to a LEAD() function. • In the SQL LEAD() function, we access the values of subsequent rows, but in LAG() function, we access previous row’s data. • It is useful to compare the current row value from the previous row value. • As no previous row is available for the first row in each department, it returns a NULL value Dr.Nalini N, SCOPE, VIT, Vellore
  • 56. Window Functions - CUME_DIST() • The SQL window function CUME_DIST() returns the cumulative distribution of a value within a partition of values. • The cumulative distribution of a value calculated by the number of rows with values less than or equal to (<=) the current row’s value is divided by the total number of rows. N/totalrows • where N is the number of rows with the value less than or equal to the current row value and total rows is the number of rows in the group or result set. • Function returns value having a range between 0 and 1 Dr.Nalini N, SCOPE, VIT, Vellore
  • 57. Window Functions - CUME_DIST() Dr.Nalini N, SCOPE, VIT, Vellore
  • 58. Window Functions – Aggregation MIN() and MAX() Dr.Nalini N, SCOPE, VIT, Vellore
  • 59. Window Functions – Aggregation sum() and avg() Dr.Nalini N, SCOPE, VIT, Vellore
  • 60. Window Functions - FIRST_VALUE and LAST_VALUE • FIRST_VALUE() - FIRST_VALUE() returns the first value in an ordered group of a result set or window frame. • LAST_VALUE() - LAST_VALUE() returns the last value in an ordered group of a result set. Dr.Nalini N, SCOPE, VIT, Vellore
  • 61. Ordered Data & Join • Ordered By-DESC • Join • Joins are commands that are used to combine rows from two or more tables. • These tables are combined based on a related column between those tables. • Inner, left, right, and full are four basic types of SQL joins. Dr.Nalini N, SCOPE, VIT, Vellore
  • 62. Preparing Data for Analytics Tool • One of the primary steps performed for data science is the cleaning of the data-set. • Maximum of the time spent by a data scientist or analyst includes preparing datasets for use in analysis. SQL can help to speed up this step. • Various SQL queries can be used to clean, update, and filter data, by eliminating redundant and • This can be done with the different SQL clauses like CASE WHEN, COALESCE, NULLIF, LEAST/GREATEST, Casting, and DISTINCT. Dr.Nalini N, SCOPE, VIT, Vellore
  • 63. CASEWHEN • CASEWHEN - The CASE statement goes through various conditions specified with WHEN clause and returns a value when the first condition is met. • Suppose we fetch all data of the above sales table and want to add an extra column that labels as summary which categorizes sales into More, Less, and Avg, this table can be created using a CASE statement Dr.Nalini N, SCOPE, VIT, Vellore
  • 64. Dr.Nalini N, SCOPE, VIT, Vellore
  • 65. Coalesce • Coalesce returns the first non-null value in a list. If all the values in the list are NULL, then the function returns null. • This function is useful when combining the values from several columns into one. COALESCE(value_1, value_2, ...., value_n) • The COALESCE() function takes in at least one value (value_1). It will return the first value in the list that is non-null. • it will first check if value_1 is null. If not, then it returns value_1. Otherwise, it checks if value_2 is null. The process goes on until the list is complete. • SELECT COALESCE(NULL, 1, 2, 'W3Schools.com'); Output: 1 Dr.Nalini N, SCOPE, VIT, Vellore
  • 68. COALESCE() two columns • Products table contains the product name and its description. Some descriptions are too long (more than 60 characters), In that case, we replace the description with the product name. Dr.Nalini N, SCOPE, VIT, Vellore
  • 69. COALESCE() two columns • Products table contains the product name and its description. Some descriptions are too long (more than 60 characters), In that case, we replace the description with the product name. Dr.Nalini N, SCOPE, VIT, Vellore Value1 Value2
  • 70. COALESCE() two columns • Products table contains the product name and its description. Some descriptions are too long (more than 60 characters), In that case, we replace the description with the product name. Dr.Nalini N, SCOPE, VIT, Vellore Value1 Value2
  • 71. NULLIF • The NULLIF() function returns NULL if two expressions are equal, otherwise it returns the first expression. • SELECT NULLIF(25, 25); output: nothing(Null) • SELECT NULLIF('Hello', 'world'); output: Hello • SELECT e.last_name, NULLIF(j.job_id, e.job_id) "Old Job ID" FROM employees e, job_history j WHERE e.employee_id = j.employee_id ORDER BY last_name, "Old Job ID"; Dr.Nalini N, SCOPE, VIT, Vellore
  • 72. LEAST/GREATEST • The greatest() function returns the largest of input values. • The least() function returns the smallest of input values. • Need to specify at least two input values and a maximum of four values. Variable- length lists are not supported. • The comparison for string values is based on a character set value. The character with the higher character set value is considered the greatest value. Dr.Nalini N, SCOPE, VIT, Vellore
  • 73. Dr.Nalini N, SCOPE, VIT, Vellore LEAST/GREATEST
  • 74. Casting • The CAST() function converts a value (of any type) into a specified datatype. Like Convert function. CAST(expression AS datatype(length)) • SELECT CAST(25.65 AS varchar); output: 25.65 Dr.Nalini N, SCOPE, VIT, Vellore