Column orientation - rotate your thinking 90 degrees

•Télécharger en tant que ODP, PDF•

7 j'aime•1,933 vues

With ever increasing data and greater analytics requirements, a new breed of databases is becoming popular - column-based databases. Some popular real world examples of column based DBs are - Sybase IQ, Vertica, and to some degree, Infobright - MySQL's column based storage engine. These databases store data "column-wise" in pages instead of "row-wise". This re-orientation claims to provide significant advantages over row-based storage for read type analytics queries. In my talk, I will discuss the technicalities, benefits and motivating use-cases for column-based databases. We shall also see why more indexing or partitioning in a row-based storage won't achieve the same effect.

Technologie Business

What is a column based DB?
ID NAME SEX AGE SALARY ADDRRESS PHONE PAN...
1 Sunil Sharma M 40 10,000 ... ... ...
2 Neha Agarwal F 25 12,000 ... ... ...
3 Anant Agarwal M 28 15,000 ... ... ...
4 Vishal Mehta M 30 8,000 ... ... ...

One page of the table storage
1|Shweta Agrawal|M| 1|2|3|4...|Shweta
40|10000...|2|Neha Agrawal|Neha Agrawal|
Agrawal|F|25| Anant Agarwal|Vishal
12000...|3|Anant Mehta...|M|F|M|M...|
Agarwal|M|28| 40|25|28|30...|10000|
15000...|4|Vishal 12000|15000|8000...
Mehta|M|30|8000...

Row based storage Column based storage

Column stores
1|2|3|4|5| Shweta M|F|M|M|M|F| 40|25|28|30| 10000|12000|
6|.... Agrawal|Neha F... 45|20... 15000|8000|
Agrawal| 15000|
Anant 5000...
Agarwal|
Vishal ...
Mehta|
Srinivas
Pathak|
Rubina
Mehta....

1st page of each column store

Query processing on row store
SELECT name, salary FROM employee WHERE age > 40

 Evaluate condition age>40 possibly using an index
on age.
 Get a foundset containing row number/ID of rows
that satisfy above condition.
 Retrieve all rows in the above foundset.
 Send only name, and salary from the rows as result
to client

Query processing on a column
store
SELECT name, salary FROM employee WHERE age > 40
 Evaluate condition age > 40 on column age, using an
index if present
 Get a foundset containing row number/ID of rows that
satisfy above condition
 Retrieve name's from name's column store for all rows in
the foundset
 Retrieve salary's from salary column for all rows in the
foundset
 Associate name with salary by row id/number for final
result

A quick calculation of IO
 Table has 10 columns
 1 million rows.
 Each row is 100 bytes
 30% of employees are above age 40
 Total amount of data read in row based store =
100MB * 0.3 = 30MB
 Total amount of data read in column based
store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB

Why is it important?
 Wide fact tables in datawarehouses
 Analytics queries on datawarehouse tend to
aggregate/analyse a few columns but a large
number of rows.
 Full table scans for analytics queries in row
stores
 Normalization means more joins

Benefits of column based DB
 Low pages read = Less IO = faster queries
 Processes CPU bound instead of IO bound
 Compression
 Page level compression
 Column level compression (lookup tables)
 Natural intraquery parallelism on conditions on
different columns

Row based equivalents
 Index every column?
 Maintenance: updates/insert/deletes
 Storage
 Most importantly: Index is value=>id, column is
id=>value
 Useful for selective queries only

Row based equivalents
 Vertical partitioning?
 Joins (although fast ones)
 Table overhead
 Cannot use horizontal partitioning
 Row based query engine not geared up to make
use of the column based storage.

Summary
 For adhoc analytics queries, column based
storage reduces IO, and makes queries faster
 Column based query engines written ground up
for analytics queries make good use of this
storage.
 Indexing every column, or vertical partioning not
same as column based storage.

References
 Commercial products
 Sybase IQ
 Vertica
 MySQL's InfoBright storage engine
 To know more, read
 http://databasecolumn.vertica.com/

Recommandé

A comparative study of static and response spectrum analysis of a rc buildingTameem Samdanee

A Comperative study of Analysis of a G+3 Residential Building by the Equivale...Kumar Aman

Multi storey building design of 7 storey commercial buildingRazes Dhakal

Analysis and design of a multi storey reinforced concreteSurat Construction PVT LTD

DESIGN AND ANALAYSIS OF MULTI STOREY BUILDING USING STAAD PROAli Meer

Slideshare pptMandy Suzanne

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Recommandé

A comparative study of static and response spectrum analysis of a rc buildingTameem Samdanee

A Comperative study of Analysis of a G+3 Residential Building by the Equivale...Kumar Aman

Multi storey building design of 7 storey commercial buildingRazes Dhakal

Analysis and design of a multi storey reinforced concreteSurat Construction PVT LTD

DESIGN AND ANALAYSIS OF MULTI STOREY BUILDING USING STAAD PROAli Meer

Slideshare pptMandy Suzanne

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

AI as an Interface for Commercial BuildingsMemoori

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

CloudStudio User manual (basic edition):comworks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

Contenu connexe

Dernier

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

AI as an Interface for Commercial BuildingsMemoori

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

CloudStudio User manual (basic edition):comworks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Dernier (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko

Gen AI in Business - Global Trends Report 2024.pdf

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

DevEX - reference for building teams, processes, and platforms

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Commit 2024 - Secret Management made easy

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Human Factors of XR: Using Human Factors to Design XR Systems

Dev Dives: Streamline document processing with UiPath Studio Web

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Unleash Your Potential - Namagunga Girls Coding Club

AI as an Interface for Commercial Buildings

Developer Data Modeling Mistakes: From Postgres to NoSQL

Nell’iperspazio con Rocket: il Framework Web di Rust!

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

CloudStudio User manual (basic edition):

Are Multi-Cloud and Serverless Good or Bad?

En vedette

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

En vedette (20)

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

Column orientation - rotate your thinking 90 degrees

1. Column based Databases Shweta Agrawal

2. What is a column based DB? ID NAME SEX AGE SALARY ADDRRESS PHONE PAN... 1 Sunil Sharma M 40 10,000 ... ... ... 2 Neha Agarwal F 25 12,000 ... ... ... 3 Anant Agarwal M 28 15,000 ... ... ... 4 Vishal Mehta M 30 8,000 ... ... ... One page of the table storage 1|Shweta Agrawal|M| 1|2|3|4...|Shweta 40|10000...|2|Neha Agrawal|Neha Agrawal| Agrawal|F|25| Anant Agarwal|Vishal 12000...|3|Anant Mehta...|M|F|M|M...| Agarwal|M|28| 40|25|28|30...|10000| 15000...|4|Vishal 12000|15000|8000... Mehta|M|30|8000... Row based storage Column based storage

3. Column stores 1|2|3|4|5| Shweta M|F|M|M|M|F| 40|25|28|30| 10000|12000| 6|.... Agrawal|Neha F... 45|20... 15000|8000| Agrawal| 15000| Anant 5000... Agarwal| Vishal ... Mehta| Srinivas Pathak| Rubina Mehta.... 1st page of each column store

4. Query processing on row store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age>40 possibly using an index on age.  Get a foundset containing row number/ID of rows that satisfy above condition.  Retrieve all rows in the above foundset.  Send only name, and salary from the rows as result to client

5. Query processing on a column store SELECT name, salary FROM employee WHERE age > 40  Evaluate condition age > 40 on column age, using an index if present  Get a foundset containing row number/ID of rows that satisfy above condition  Retrieve name's from name's column store for all rows in the foundset  Retrieve salary's from salary column for all rows in the foundset  Associate name with salary by row id/number for final result

6. A quick calculation of IO  Table has 10 columns  1 million rows.  Each row is 100 bytes  30% of employees are above age 40  Total amount of data read in row based store = 100MB * 0.3 = 30MB  Total amount of data read in column based store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB

7. Why is it important?  Wide fact tables in datawarehouses  Analytics queries on datawarehouse tend to aggregate/analyse a few columns but a large number of rows.  Full table scans for analytics queries in row stores  Normalization means more joins

8. An example star schema

9. Benefits of column based DB  Low pages read = Less IO = faster queries  Processes CPU bound instead of IO bound  Compression  Page level compression  Column level compression (lookup tables)  Natural intraquery parallelism on conditions on different columns

10. Row based equivalents  Index every column?  Maintenance: updates/insert/deletes  Storage  Most importantly: Index is value=>id, column is id=>value  Useful for selective queries only

11. Row based equivalents  Vertical partitioning?  Joins (although fast ones)  Table overhead  Cannot use horizontal partitioning  Row based query engine not geared up to make use of the column based storage.

12. Summary  For adhoc analytics queries, column based storage reduces IO, and makes queries faster  Column based query engines written ground up for analytics queries make good use of this storage.  Indexing every column, or vertical partioning not same as column based storage.

13. References  Commercial products  Sybase IQ  Vertica  MySQL's InfoBright storage engine  To know more, read  http://databasecolumn.vertica.com/