With ever increasing data and greater analytics requirements, a new breed of databases is becoming popular - column-based databases. Some popular real world examples of column based DBs are - Sybase IQ, Vertica, and to some degree, Infobright - MySQL's column based storage engine. These databases store data "column-wise" in pages instead of "row-wise". This re-orientation claims to provide significant advantages over row-based storage for read type analytics queries. In my talk, I will discuss the technicalities, benefits and motivating use-cases for column-based databases. We shall also see why more indexing or partitioning in a row-based storage won't achieve the same effect.
2. What is a column based DB?
ID NAME SEX AGE SALARY ADDRRESS PHONE PAN...
1 Sunil Sharma M 40 10,000 ... ... ...
2 Neha Agarwal F 25 12,000 ... ... ...
3 Anant Agarwal M 28 15,000 ... ... ...
4 Vishal Mehta M 30 8,000 ... ... ...
One page of the table storage
1|Shweta Agrawal|M| 1|2|3|4...|Shweta
40|10000...|2|Neha Agrawal|Neha Agrawal|
Agrawal|F|25| Anant Agarwal|Vishal
12000...|3|Anant Mehta...|M|F|M|M...|
Agarwal|M|28| 40|25|28|30...|10000|
15000...|4|Vishal 12000|15000|8000...
Mehta|M|30|8000...
Row based storage Column based storage
4. Query processing on row store
SELECT name, salary FROM employee WHERE age > 40
Evaluate condition age>40 possibly using an index
on age.
Get a foundset containing row number/ID of rows
that satisfy above condition.
Retrieve all rows in the above foundset.
Send only name, and salary from the rows as result
to client
5. Query processing on a column
store
SELECT name, salary FROM employee WHERE age > 40
Evaluate condition age > 40 on column age, using an
index if present
Get a foundset containing row number/ID of rows that
satisfy above condition
Retrieve name's from name's column store for all rows in
the foundset
Retrieve salary's from salary column for all rows in the
foundset
Associate name with salary by row id/number for final
result
6. A quick calculation of IO
Table has 10 columns
1 million rows.
Each row is 100 bytes
30% of employees are above age 40
Total amount of data read in row based store =
100MB * 0.3 = 30MB
Total amount of data read in column based
store 100MB * 0.3 * 0.2 (only 2 columns) = 6MB
7. Why is it important?
Wide fact tables in datawarehouses
Analytics queries on datawarehouse tend to
aggregate/analyse a few columns but a large
number of rows.
Full table scans for analytics queries in row
stores
Normalization means more joins
9. Benefits of column based DB
Low pages read = Less IO = faster queries
Processes CPU bound instead of IO bound
Compression
Page level compression
Column level compression (lookup tables)
Natural intraquery parallelism on conditions on
different columns
10. Row based equivalents
Index every column?
Maintenance: updates/insert/deletes
Storage
Most importantly: Index is value=>id, column is
id=>value
Useful for selective queries only
11. Row based equivalents
Vertical partitioning?
Joins (although fast ones)
Table overhead
Cannot use horizontal partitioning
Row based query engine not geared up to make
use of the column based storage.
12. Summary
For adhoc analytics queries, column based
storage reduces IO, and makes queries faster
Column based query engines written ground up
for analytics queries make good use of this
storage.
Indexing every column, or vertical partioning not
same as column based storage.