SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Big Data Developers Meetup #1 Aug 2014 
Andrey.vykhodtsev@ru.ibm.com 
Central & Eastern Europe BigData Tech Sales
Первый Meetup 2014 
•Про SQL on hadoop 
•По возможности объективный обзор и конструктивный диалог 
•Основан на уважении к другим технологиям, в т.ч конкурирующим 
•Без holywar 
•Скромные закуски – угощайтесь 
•Время с 19-00 до 22-00, в 21-00 заканчиваем программу, в 22-00 нужно покинуть здание
Agenda 
•What is this Hadoop thing? 
•Why SQL on Hadoop? 
•What is Hive? 
•SQL-on-Hadoop landscape 
•InfoSphere BigInsights for Hadoop with Big SQL 
•What is it? 
•SQL capabilities 
•Architecture 
•Application portability and integration 
•Enterprise capabilities 
•Performance 
•Conclusion
Big Data Scenarios Span Many Industries – and rely on Hadoop 
•Optimize existing EDW environment – size, performance, and TCO 
•Capture, off load, analyze massive amounts of data to get new insights 
Data Warehouse Modernization 
•Text analytics on social media commentary around life events 
•Link social media profiles to actual customers 
360 View of the Customer 
•Analyze massive volumes of data that can’t be handled by existing SIEM systems 
•Internet drug trafficking, prostitution, monitoring all the web, email traffic to identify potential threats 
Cyber Security
The Goal of Hadoop 
Manage large volumes of data 
Scalable to any volume 
Off-load from the warehouse 
Identify unique customers 
Reduce Costs 
Commodity hardware 
Common tools 
In-house skills 
Analyze new data types 
Improve business decisions 
Understand sentiment 
Analyze data-in-motion
What is Hadoop? 
6 
split 0 
split 1 
split 2 
split 3 
split 4 
split 5 
Map 
Map 
Map 
Reduce 
Reduce 
Reduce 
C 
Client 
output 0 
output 1 
output 2 
M 
Master 
Input 
Files 
Map 
Phase 
Intermediate 
Files 
Reduce 
Phase 
Output 
Files 
•Framework to process big data in parallel on a cluster 
•What's new/different? 
•Free, open source 
•Uses commodity hardware 
•“Move programs to the data” 
•Scale both processing and storage by simply adding nodes 
•Makes big data processing accessible to everyone 
•Two key things to understand Hadoop: 
•How files are stored 
•How files are processed
How files are stored: HDFS 
•Key ideas: 
•Divide big files in blocks and store blocks randomly across cluster 
•Provide API to ask: where are the pieces of this file? 
•=> Programs can be shipped to nodes for parallel distributed processing 
101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100 
Logical File 
1 
2 
3 
4 
Blocks 
1 
Cluster 
1 
1 
2 
2 
2 
3 
3 
3 
4 
4 
4
How Files are Processed: MapReduce 
•Common pattern in data processing: apply a function, then aggregate 
grep "World Cup” *.txt | wc -l 
•User simply writes two pieces of code: “mapper” and “reducer” 
•Mapper code executes on every split of every file 
•Reducer consumes/aggregates mapper outputs 
•The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, temping of intermediate results, storage of final result on HDFS) 
1011010010100100111001111110010100111010010100101100100101010011000101001011101011101011110110110101011010010101 
1 
2 
3 
Logical File 
Splits 
1 
Cluster 
3 
2 
Map 
Map 
Map 
Reduce 
Result
SQL on Hadoop and Hive 
•Hadoop can process data of any kind (as long as it's splittable, etc) 
•A very common scenario: 
•Tabular data 
•Programs that “query” the data 
•Java Hadoop APIs are the wrong tool for this 
•Too low level, steep learning curve 
•Require strong programming expertise 
•Universally accepted solution: SQL 
•Enter Hive ... 
1.Impose relational structure on plain files 
2.Translate SELECT statements to MapReduce jobs 
3.Hide all the low level details
Why SQL on Hadoop? 
Hadoop stores large volumes and varieties of data 
SQL gets information and insight out of Hadoop 
SQL leverages existing IT skills resulting in quicker time to value and lower cost
Hive 
•One of the most popular Hadoop-related technologies 
•Ships with all major Hadoop distributions 
•Hive opens up Hadoop to anyone with SQL skills 
•Simplified and shortened development cycle 
•Little Java/MapReduce knowledge required 
•Three key concepts 
•Hive SerDe 
•Hive Table 
•Hive Metastore
Hive SerDes 
•SerDe = Serializer + Deserializer 
•Deserializer = Java code that implements mapping from Hadoop “record” to Hive “row” 
•A Hadoop record is just a byte array 
•A Hive row has columns with names and data types 
•Serializer maps Hive row to Hadoop record (for writing) 
•Many built-in SerDes 
•Delimited text files 
•JSON 
•XML 
•REGEX 
•AVRO 
•Can add your own custom serdes
Hive Tables 
•A Hive table imposes a relational “schema” (list of column names and types) on a file 
•Schema is purely logical 
•Data in the file is not altered in any way 
•“Schema on read” (as opposed to SOW of traditional RDBMSs) 
•Hive table = Metadata + Data 
•CREATE TABLE statement (metadata) 
•A directory containing one or more files (data) 
CREATE TABLE logEvents 
(ipaddress STRING, eventtime TIMESTAMP, message STRING) ROW FORMAT SERDE 'org.apache.hive…LazySimpleSerde' 
WITH SERDEPROPERTIES ( 'field.delim' = '|' ) 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat' 
LOCATION '/user/hive/warehouse/sample.db/logevents';
Hive MetaStore 
•The Hive metastore stores metadata about all the tables 
•Usually backed by a conventional relational db (not on HDFS) 
•Default: Derby 
•MySQL, DB2, Oracle 
•Table metadata 
•Schema (column names and types) 
•Location (directory on HDFS) 
•SerDe 
•Hadoop InputFormat/OutputFormat 
•Partition information 
•Properties (column and row delimiters, etc) 
•Security (access control)
Hadoop Latency and Hive SQL Features 
•Hive was not designed to be an RDBMS, but to hide the low-level details of MapReduce 
•But the inevitable questions came up … 
•Hadoop Latency 
•Why is my query so slow compared to XYZ? 
•Why does it take so long to retrieve a few rows? 
•Hive SQL Features 
•How do I define a view, stored procedure, …? 
•What’s wrong with this subquery ? 
•No DATE, DECIMAL, VARCHAR data types?
SQL-on-Hadoop landscape 
•The SQL-on-Hadoop landscape changes constantly! 
•Being relatively new to the SQL game, they have all generally meant compromising one or more of…. 
•Speed 
•Robust SQL 
•Enterprise features 
•Interoperability with the Hadoop ecosystem 
•IBM InfoSphere BigInsights for Hadoop with Big SQL is based upon tried and true IBM relational technology, addressing all of these areas
Introducing Big SQL 3.0 
•Goal: bring SQL on Hadoop to the next level 
•Low-latency HDFS-based parallelism 
•Move programs to the data 
•No MapReduce 
=> MPP engine 
•Avoid unnecessary temping 
=> Message passing 
•Avoid process startup/teardown 
=> Daemon processes 
•Full SQL support 
SQL-based 
Application 
Big SQL Engine 
HDFS 
IBM data server client 
SQL MPP Run-time 
CSV 
Seq 
Parquet 
RC 
ORC 
Avro 
Custom 
JSON
Big SQL 3.0 – Not just a faster, richer Hive
Big SQL highlights 
•Full support for subqueries 
•In SELECT, FROM, WHERE and HAVING clauses 
•Correlated and uncorrelated 
•Equality, non-equality subqueries 
•EXISTS, NOT EXISTS, IN, ANY, SOME, etc. 
•All standard join operations 
•Standard and ANSI join syntax 
•Inner, outer, and full outer joins 
•Equality, non-equality, cross join support 
•Multi-value join 
•UNION, INTERSECT, EXCEPT 
SELECT 
s_name, 
count(*) AS numwait 
FROM 
supplier, 
lineitem l1, 
orders, 
nation 
WHERE 
s_suppkey = l1.l_suppkey 
AND o_orderkey = l1.l_orderkey 
AND o_orderstatus = 'F' 
AND l1.l_receiptdate > l1.l_commitdate 
AND EXISTS ( 
SELECT 
* 
FROM 
lineitem l2 
WHERE 
l2.l_orderkey = l1.l_orderkey 
AND l2.l_suppkey <> l1.l_suppkey 
) 
AND NOT EXISTS ( 
SELECT 
* 
FROM 
lineitem l3 
WHERE 
l3.l_orderkey = l1.l_orderkey 
AND l3.l_suppkey <> l1.l_suppkey 
AND l3.l_receiptdate > 
l3.l_commitdate 
) 
AND s_nationkey = n_nationkey 
AND n_name = ':1' 
GROUP BY 
s_name 
ORDER BY 
numwait desc, 
s_name;
Big SQL in the Hadoop Ecosystem 
• Fully integrated with ecosystem 
– Hive Metastore 
– Hive Tables 
– Hive SerDes 
– Hive partitioning 
– Hive Statistics 
– Columnar formats 
• ORC 
• Parquet 
• RCFile 
• Completely open, without compromises 
• No proprietary storage format 
Hive 
Hive 
Metastore 
Hadoop 
Cluster 
Pig 
Hive APIs 
Sqoop 
Hive APIs 
Big SQL 
Hive APIs
Architected for performance 
•Architected from the ground up for low latency and high throughput 
•MapReduce replaced with a modern MPP architecture 
•Compiler and runtime are native code (not java) 
•Big SQL worker daemons live directly on cluster 
•Continuously running (no startup latency) 
•Processing happens locally at the data 
•Message passing allows data to flow directly between nodes 
•Operations occur in memory with the ability to spill to disk 
•Supports aggregations and sorts larger than available RAM 
Head Node 
Big SQL 
Head Node 
Hive Metastore 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
HDFS/GPFS
Extreme parallelism 
•Massively parallel SQL engine that replaces MR 
•Shared-nothing architecture that eliminates scalability and networking issues 
•Engine pushes processing out to data nodes to maximize data locality. Hadoop data accessed natively via C++ and Java readers and writers. 
•Inter- and intra-node parallelism where work is distributed to multiple worker nodes and on each node multiple worker threads collaborate on the I/O and data processing (scale out horizontally and scale up vertically) 
•Intelligent data partition elimination based on SQL predicates 
•Fault tolerance through active health monitoring and management of parallel data and worker nodes
A process model view of Big SQL 3.0
Big SQL 3.0 – Architecture (cont.) 
24 
•Big SQL's runtime execution engine is all native code 
•For common table formats a native I/O engine is utilized 
•e.g. delimited, RC, SEQ, Parquet, … 
•For all others, a java I/O engine is used 
•Maximizes compatibility with existing tables 
•Allows for custom file formats and SerDe's 
•All Big SQL built-in functions are native code 
•Customer built UDx's can be developed in C++ or Java 
•Maximize performance without sacrificing extensibility 
Mgmt Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Big SQL Worker 
Native I/O Engine 
Java I/O Engine 
SerDe 
I/O Fmt 
Runtime 
Java UDFs 
Native UDFs
Resource management 
•Big SQL doesn't run in isolation 
•Nodes tend to be shared with a variety of Hadoop services 
•Task tracker 
•Data node 
•HBase region servers 
•MapReduce jobs 
•etc. 
•Big SQL can be constrained to limit its footprint on the cluster 
•% of CPU utilization 
•% of memory utilization 
•Resources are automatically adjusted based upon workload 
•Always fitting within constraints 
•Self-tuning memory manager that re-distributes resources across components dynamically 
•default WLM concurrency control for heavy queries 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
HBase 
MR Task 
MR Task 
MR Task
Performance 
•Query rewrites 
•Exhaustive query rewrite capabilities 
•Leverages additional metadata such as constraints and nullability 
•Optimization 
•Statistics and heuristic driven query optimization 
•Query optimizer based upon decades of IBM RDBMS experience 
•Tools and metrics 
•Highly detailed explain plans and query diagnostic tools 
•Extensive number of available performance metrics 
SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) 
FROM PERIOD, DAILY_SALES, PRODUCT, STORE 
WHERE 
PERIOD.PERKEY=DAILY_SALES.PERKEY AND 
PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND 
STORE.STOREKEY=DAILY_SALES.STOREKEY AND 
CALENDAR_DATE BETWEEN AND 
'01/01/2012' AND '04/28/2012' AND 
STORE_NUMBER='03' AND 
CATEGORY=72 
GROUP BY ITEM_DESC 
Access plan generation 
Query transformation 
Dozens of query 
transformations 
Hundreds or thousands 
of access plan options 
Store 
Product 
Product 
Store 
NLJOIN 
Daily Sales 
NLJOIN 
Period 
NLJOIN 
Product 
NLJOIN 
Daily Sales 
NLJOIN 
Period 
NLJOIN 
Store 
HSJOIN 
Daily Sales 
HSJOIN 
Period 
HSJOIN 
Product 
Store 
ZZJOIN 
Daily Sales 
HSJOIN 
Period
•Table statistics: 
•Cardinality (count) 
•Number of Files 
•Total File Size 
•Column statistics (this applies to column group stats also): 
•Minimum value 
•Maximum value 
•Cardinality (non-nulls) 
•Distribution (Number of Distinct Values) 
•Number of null values 
•Average Length of the column value (for string columns) 
•Histogram 
•Frequent Values (MFV) 
Statistics are key to performance
Application portability and integration 
•Big SQL 3.0 adopts IBM's standard Data Server Client Drivers 
•Robust, standards compliant ODBC, JDBC, and .NET drivers 
•Same driver used for DB2 LUW, DB2/z and Informix 
•Expands support to numerous languages (Python, Ruby, Perl, etc.) 
•Putting the story together…. 
•Big SQL shares a common SQL dialect with DB2 
•Big SQL shares the same client drivers with DB2 
•Data warehouse augmentation just got significantly easier 
Compatible SQL 
Compatible Drivers 
Portable Application
Application portability and integration (cont.) 
•This compatibility extends beyond your own applications 
•Open integration across Business Analytic Tools 
•IBM Optim Data Studio performance tool portfolio 
•Superior enablement for IBM Software – e.g. Cognos 
•Enhanced support by 3rd party software – e.g. Microstrategy
Query federation 
•Data never lives in isolation 
•Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active data warehouses 
•Big SQL provides the ability to query heterogeneous systems 
•Join Hadoop to other relational databases 
•Query optimizer understands capabilities of external system 
•Including available statistics 
•As much work as possible is pushed to each system to process 
Head Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL 
Compute Node 
Task Tracker 
Data Node 
Big SQL
Enterprise security 
•Users may be authenticated via 
•Operating system 
•Lightweight directory access protocol (LDAP) 
•Kerberos 
•User authorization mechanisms include 
•Full GRANT/REVOKE based security 
•Group and role based hierarchical security 
•Object level, column level, or row level (fine-grained) access controls 
•Auditing 
•You may define audit policies and track user activity 
•Transport layer security (TLS) 
•Protect integrity and confidentiality of data between the client and Big SQL
Monitoring 
•Comprehensive runtime monitoring infrastructure that helps answer the question: what is going on in my system? 
•SQL interfaces to the monitoring data via table functions 
•Ability to drill down into more granular metrics for problem determination and/ or detailed performance analysis 
•Runtime statistics collected during the execution of the section for a (SQL) access plan 
•Support for event monitors to track specific types of operations and activities 
•Protect against and discover unknown or unacceptable behaviors by monitoring data access via Audit facility. 
Reporting Level (Example: Service Class) 
Big SQL 3.0 
Worker Threads 
Connection Control Blocks 
Worker Threads Collect Locally Push Up Data Incrementally 
Extract Data Directly From Reporting level 
Monitor Query
•Performance matters to customers 
•Benchmarking appeals to Engineers to drive product innovation 
•Benchmarketing used to convey performance in a memorable and appealing way 
•SQL over Hadoop is in the “Wild West” of Benchmarketing 
•100x claims! Compared to what? Conforming to what rules? 
•The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations 
•Formed in August, 1988 
•TPC-H and TPC-DS are the most relevant to SQL over Hadoop 
–R/W nature of workload not suitable for HDFS 
•Big Data Benchmarking Community (BDBC) formed 
Performance, Benchmarking, Benchmarketing
Power of Standard SQL 
•Everyone loves performance numbers, but that's not the whole story 
•How much work do you have to do to achieve those numbers? 
•A portion of our internal performance numbers are based upon industry standard benchmarks 
•Big SQL is capable of executing 
•All 22 TPC-H queries without modification 
•All 99 TPC-DS queries without modification 
SELECT s_name, count(*) AS numwait 
FROM supplier, lineitem l1, orders, nation 
WHERE s_suppkey = l1.l_suppkey 
AND o_orderkey = l1.l_orderkey 
AND o_orderstatus = 'F' 
AND l1.l_receiptdate > l1.l_commitdate 
AND EXISTS ( 
SELECT * 
FROM lineitem l2 
WHERE l2.l_orderkey = l1.l_orderkey 
AND l2.l_suppkey <> l1.l_suppkey) 
AND NOT EXISTS ( 
SELECT * 
FROM lineitem l3 
WHERE l3.l_orderkey = l1.l_orderkey 
AND l3.l_suppkey <> l1.l_suppkey 
AND l3.l_receiptdate > l3.l_commitdate) 
AND s_nationkey = n_nationkey 
AND n_name = ':1' 
GROUP BY s_name 
ORDER BY numwait desc, s_name 
JOIN 
(SELECT s_name, l_orderkey, l_suppkey 
FROM orders o 
JOIN 
(SELECT s_name, l_orderkey, l_suppkey 
FROM nation n 
JOIN supplier s 
ON s.s_nationkey = n.n_nationkey 
AND n.n_name = 'INDONESIA' 
JOIN lineitem l 
ON s.s_suppkey = l.l_suppkey 
WHERE l.l_receiptdate > l.l_commitdate) l1 
ON o.o_orderkey = l1.l_orderkey 
AND o.o_orderstatus = 'F') l2 
ON l2.l_orderkey = t1.l_orderkey) a 
WHERE (count_suppkey > 1) or ((count_suppkey=1) 
AND (l_suppkey <> max_suppkey))) l3 
ON l3.l_orderkey = t2.l_orderkey) b 
WHERE (count_suppkey is null) 
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c 
GROUP BY s_name 
ORDER BY numwait DESC, s_name 
SELECT s_name, count(1) AS numwait 
FROM 
(SELECT s_name FROM 
(SELECT s_name, t2.l_orderkey, l_suppkey, 
count_suppkey, max_suppkey 
FROM 
(SELECT l_orderkey, 
count(distinct l_suppkey) as count_suppkey, 
max(l_suppkey) as max_suppkey 
FROM lineitem 
WHERE l_receiptdate > l_commitdate 
GROUP BY l_orderkey) t2 
RIGHT OUTER JOIN 
(SELECT s_name, l_orderkey, l_suppkey 
FROM 
(SELECT s_name, t1.l_orderkey, l_suppkey, 
count_suppkey, max_suppkey 
FROM 
(SELECT l_orderkey, 
count(distinct l_suppkey) as count_suppkey, 
max(l_suppkey) as max_suppkey 
FROM lineitem 
GROUP BY l_orderkey) t1 
Original Query 
Re-written for Hive
35 
Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries 
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
Big SQL is 10x faster than Hive 0.12 
(total workload elapsed time) 
36 
Comparing Big SQL and Hive 0.12 for Decision Support Queries 
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). 
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
How many times faster is Big SQL than Hive 0.12? 
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). 
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 
Max Speedup 
of 74x 
37 
Queries sorted by speed up ratio (worst to best) 
Avg Speedup 
of 20x
Conclusion 
•Today, it seems, performance numbers are the name of the game 
•But in reality there is so much more… 
•How rich is the SQL? 
•How difficult is it to (re-)use your existing SQL? 
•How secure is your data? 
•Is your data still open for other uses on Hadoop? 
•Can your queries span your enterprise? 
•Can other Hadoop workloads co-exist in harmony? 
•… 
•With Big SQL 3.0 performance doesn't mean compromise
Try it now! InfoSphere for BigInsights Quick Start 
Free, no limit, non-production version of BigInsights 
Features Big SQL, BigSheets, Text Analytics, Big R, management console, development tools 
Tutorials and education available 
ibm.co/QuickStart
Please Note 
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. 
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. 
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. 
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Темы для следующих митапов 
•R on Hadoop 
•Файловые системы 
•Движки MapReduce/Spark/etc 
•Hadoop Security 
•Spreadsheet analysis 
•Text analysis 
•?

Contenu connexe

Tendances

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 

Tendances (20)

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 

En vedette

Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)
Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)
Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)bddmoscow
 
Spark overview (18.06.2015)
Spark overview (18.06.2015)Spark overview (18.06.2015)
Spark overview (18.06.2015)bddmoscow
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworksKimihiko Kitase
 
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)NTT DATA OSS Professional Services
 
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」オラクルエンジニア通信
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれからYifeng Jiang
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 

En vedette (11)

Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
 
Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)
Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)
Online learning - Apache Spark alternatives: Vowpal Wabbit. (18.06.2015)
 
Spark overview (18.06.2015)
Spark overview (18.06.2015)Spark overview (18.06.2015)
Spark overview (18.06.2015)
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks
最新事例から学ぶビッグデータの活用法 #ocif16 #hortonworks
 
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
HDFS新機能総まとめin 2015 (日本Hadoopユーザー会 ライトニングトーク@Cloudera World Tokyo 2015 講演資料)
 
20161125 Asakusa Framework Day オラクル講演資料
20161125 Asakusa Framework Day オラクル講演資料20161125 Asakusa Framework Day オラクル講演資料
20161125 Asakusa Framework Day オラクル講演資料
 
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
Hadoop Conference Japan_2016 セッション「顧客事例から学んだ、 エンタープライズでの "マジな"Hadoop導入の勘所」
 
20160323 道玄坂LT祭り オラクル資料
20160323 道玄坂LT祭り オラクル資料20160323 道玄坂LT祭り オラクル資料
20160323 道玄坂LT祭り オラクル資料
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれから
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 

Similaire à Big Data Developers Moscow Meetup 1 - sql on hadoop

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 

Similaire à Big Data Developers Moscow Meetup 1 - sql on hadoop (20)

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hive
HiveHive
Hive
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 

Dernier

Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native BuildpacksVish Abrams
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageDista
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.Sharon Liu
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxJoão Esperancinha
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 

Dernier (20)

Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Streamlining Your Application Builds with Cloud Native Buildpacks
Streamlining Your Application Builds  with Cloud Native BuildpacksStreamlining Your Application Builds  with Cloud Native Buildpacks
Streamlining Your Application Builds with Cloud Native Buildpacks
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales CoverageSales Territory Management: A Definitive Guide to Expand Sales Coverage
Sales Territory Management: A Definitive Guide to Expand Sales Coverage
 
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
20240319 Car Simulator Plan.pptx . Plan for a JavaScript Car Driving Simulator.
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Fields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptxFields in Java and Kotlin and what to expect.pptx
Fields in Java and Kotlin and what to expect.pptx
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 

Big Data Developers Moscow Meetup 1 - sql on hadoop

  • 1. Big Data Developers Meetup #1 Aug 2014 Andrey.vykhodtsev@ru.ibm.com Central & Eastern Europe BigData Tech Sales
  • 2. Первый Meetup 2014 •Про SQL on hadoop •По возможности объективный обзор и конструктивный диалог •Основан на уважении к другим технологиям, в т.ч конкурирующим •Без holywar •Скромные закуски – угощайтесь •Время с 19-00 до 22-00, в 21-00 заканчиваем программу, в 22-00 нужно покинуть здание
  • 3. Agenda •What is this Hadoop thing? •Why SQL on Hadoop? •What is Hive? •SQL-on-Hadoop landscape •InfoSphere BigInsights for Hadoop with Big SQL •What is it? •SQL capabilities •Architecture •Application portability and integration •Enterprise capabilities •Performance •Conclusion
  • 4. Big Data Scenarios Span Many Industries – and rely on Hadoop •Optimize existing EDW environment – size, performance, and TCO •Capture, off load, analyze massive amounts of data to get new insights Data Warehouse Modernization •Text analytics on social media commentary around life events •Link social media profiles to actual customers 360 View of the Customer •Analyze massive volumes of data that can’t be handled by existing SIEM systems •Internet drug trafficking, prostitution, monitoring all the web, email traffic to identify potential threats Cyber Security
  • 5. The Goal of Hadoop Manage large volumes of data Scalable to any volume Off-load from the warehouse Identify unique customers Reduce Costs Commodity hardware Common tools In-house skills Analyze new data types Improve business decisions Understand sentiment Analyze data-in-motion
  • 6. What is Hadoop? 6 split 0 split 1 split 2 split 3 split 4 split 5 Map Map Map Reduce Reduce Reduce C Client output 0 output 1 output 2 M Master Input Files Map Phase Intermediate Files Reduce Phase Output Files •Framework to process big data in parallel on a cluster •What's new/different? •Free, open source •Uses commodity hardware •“Move programs to the data” •Scale both processing and storage by simply adding nodes •Makes big data processing accessible to everyone •Two key things to understand Hadoop: •How files are stored •How files are processed
  • 7. How files are stored: HDFS •Key ideas: •Divide big files in blocks and store blocks randomly across cluster •Provide API to ask: where are the pieces of this file? •=> Programs can be shipped to nodes for parallel distributed processing 101101001010010011100111111001010011101001010010110010010101001100010100101110101110101111011011010101101001010100101010101011100100110101110100 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 3 4 4 4
  • 8. How Files are Processed: MapReduce •Common pattern in data processing: apply a function, then aggregate grep "World Cup” *.txt | wc -l •User simply writes two pieces of code: “mapper” and “reducer” •Mapper code executes on every split of every file •Reducer consumes/aggregates mapper outputs •The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, temping of intermediate results, storage of final result on HDFS) 1011010010100100111001111110010100111010010100101100100101010011000101001011101011101011110110110101011010010101 1 2 3 Logical File Splits 1 Cluster 3 2 Map Map Map Reduce Result
  • 9. SQL on Hadoop and Hive •Hadoop can process data of any kind (as long as it's splittable, etc) •A very common scenario: •Tabular data •Programs that “query” the data •Java Hadoop APIs are the wrong tool for this •Too low level, steep learning curve •Require strong programming expertise •Universally accepted solution: SQL •Enter Hive ... 1.Impose relational structure on plain files 2.Translate SELECT statements to MapReduce jobs 3.Hide all the low level details
  • 10. Why SQL on Hadoop? Hadoop stores large volumes and varieties of data SQL gets information and insight out of Hadoop SQL leverages existing IT skills resulting in quicker time to value and lower cost
  • 11. Hive •One of the most popular Hadoop-related technologies •Ships with all major Hadoop distributions •Hive opens up Hadoop to anyone with SQL skills •Simplified and shortened development cycle •Little Java/MapReduce knowledge required •Three key concepts •Hive SerDe •Hive Table •Hive Metastore
  • 12. Hive SerDes •SerDe = Serializer + Deserializer •Deserializer = Java code that implements mapping from Hadoop “record” to Hive “row” •A Hadoop record is just a byte array •A Hive row has columns with names and data types •Serializer maps Hive row to Hadoop record (for writing) •Many built-in SerDes •Delimited text files •JSON •XML •REGEX •AVRO •Can add your own custom serdes
  • 13. Hive Tables •A Hive table imposes a relational “schema” (list of column names and types) on a file •Schema is purely logical •Data in the file is not altered in any way •“Schema on read” (as opposed to SOW of traditional RDBMSs) •Hive table = Metadata + Data •CREATE TABLE statement (metadata) •A directory containing one or more files (data) CREATE TABLE logEvents (ipaddress STRING, eventtime TIMESTAMP, message STRING) ROW FORMAT SERDE 'org.apache.hive…LazySimpleSerde' WITH SERDEPROPERTIES ( 'field.delim' = '|' ) INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.TextOutputFormat' LOCATION '/user/hive/warehouse/sample.db/logevents';
  • 14. Hive MetaStore •The Hive metastore stores metadata about all the tables •Usually backed by a conventional relational db (not on HDFS) •Default: Derby •MySQL, DB2, Oracle •Table metadata •Schema (column names and types) •Location (directory on HDFS) •SerDe •Hadoop InputFormat/OutputFormat •Partition information •Properties (column and row delimiters, etc) •Security (access control)
  • 15. Hadoop Latency and Hive SQL Features •Hive was not designed to be an RDBMS, but to hide the low-level details of MapReduce •But the inevitable questions came up … •Hadoop Latency •Why is my query so slow compared to XYZ? •Why does it take so long to retrieve a few rows? •Hive SQL Features •How do I define a view, stored procedure, …? •What’s wrong with this subquery ? •No DATE, DECIMAL, VARCHAR data types?
  • 16. SQL-on-Hadoop landscape •The SQL-on-Hadoop landscape changes constantly! •Being relatively new to the SQL game, they have all generally meant compromising one or more of…. •Speed •Robust SQL •Enterprise features •Interoperability with the Hadoop ecosystem •IBM InfoSphere BigInsights for Hadoop with Big SQL is based upon tried and true IBM relational technology, addressing all of these areas
  • 17. Introducing Big SQL 3.0 •Goal: bring SQL on Hadoop to the next level •Low-latency HDFS-based parallelism •Move programs to the data •No MapReduce => MPP engine •Avoid unnecessary temping => Message passing •Avoid process startup/teardown => Daemon processes •Full SQL support SQL-based Application Big SQL Engine HDFS IBM data server client SQL MPP Run-time CSV Seq Parquet RC ORC Avro Custom JSON
  • 18. Big SQL 3.0 – Not just a faster, richer Hive
  • 19. Big SQL highlights •Full support for subqueries •In SELECT, FROM, WHERE and HAVING clauses •Correlated and uncorrelated •Equality, non-equality subqueries •EXISTS, NOT EXISTS, IN, ANY, SOME, etc. •All standard join operations •Standard and ANSI join syntax •Inner, outer, and full outer joins •Equality, non-equality, cross join support •Multi-value join •UNION, INTERSECT, EXCEPT SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 20. Big SQL in the Hadoop Ecosystem • Fully integrated with ecosystem – Hive Metastore – Hive Tables – Hive SerDes – Hive partitioning – Hive Statistics – Columnar formats • ORC • Parquet • RCFile • Completely open, without compromises • No proprietary storage format Hive Hive Metastore Hadoop Cluster Pig Hive APIs Sqoop Hive APIs Big SQL Hive APIs
  • 21. Architected for performance •Architected from the ground up for low latency and high throughput •MapReduce replaced with a modern MPP architecture •Compiler and runtime are native code (not java) •Big SQL worker daemons live directly on cluster •Continuously running (no startup latency) •Processing happens locally at the data •Message passing allows data to flow directly between nodes •Operations occur in memory with the ability to spill to disk •Supports aggregations and sorts larger than available RAM Head Node Big SQL Head Node Hive Metastore Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL HDFS/GPFS
  • 22. Extreme parallelism •Massively parallel SQL engine that replaces MR •Shared-nothing architecture that eliminates scalability and networking issues •Engine pushes processing out to data nodes to maximize data locality. Hadoop data accessed natively via C++ and Java readers and writers. •Inter- and intra-node parallelism where work is distributed to multiple worker nodes and on each node multiple worker threads collaborate on the I/O and data processing (scale out horizontally and scale up vertically) •Intelligent data partition elimination based on SQL predicates •Fault tolerance through active health monitoring and management of parallel data and worker nodes
  • 23. A process model view of Big SQL 3.0
  • 24. Big SQL 3.0 – Architecture (cont.) 24 •Big SQL's runtime execution engine is all native code •For common table formats a native I/O engine is utilized •e.g. delimited, RC, SEQ, Parquet, … •For all others, a java I/O engine is used •Maximizes compatibility with existing tables •Allows for custom file formats and SerDe's •All Big SQL built-in functions are native code •Customer built UDx's can be developed in C++ or Java •Maximize performance without sacrificing extensibility Mgmt Node Big SQL Compute Node Task Tracker Data Node Big SQL Big SQL Worker Native I/O Engine Java I/O Engine SerDe I/O Fmt Runtime Java UDFs Native UDFs
  • 25. Resource management •Big SQL doesn't run in isolation •Nodes tend to be shared with a variety of Hadoop services •Task tracker •Data node •HBase region servers •MapReduce jobs •etc. •Big SQL can be constrained to limit its footprint on the cluster •% of CPU utilization •% of memory utilization •Resources are automatically adjusted based upon workload •Always fitting within constraints •Self-tuning memory manager that re-distributes resources across components dynamically •default WLM concurrency control for heavy queries Compute Node Task Tracker Data Node Big SQL HBase MR Task MR Task MR Task
  • 26. Performance •Query rewrites •Exhaustive query rewrite capabilities •Leverages additional metadata such as constraints and nullability •Optimization •Statistics and heuristic driven query optimization •Query optimizer based upon decades of IBM RDBMS experience •Tools and metrics •Highly detailed explain plans and query diagnostic tools •Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generation Query transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily Sales NLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product Store ZZJOIN Daily Sales HSJOIN Period
  • 27. •Table statistics: •Cardinality (count) •Number of Files •Total File Size •Column statistics (this applies to column group stats also): •Minimum value •Maximum value •Cardinality (non-nulls) •Distribution (Number of Distinct Values) •Number of null values •Average Length of the column value (for string columns) •Histogram •Frequent Values (MFV) Statistics are key to performance
  • 28. Application portability and integration •Big SQL 3.0 adopts IBM's standard Data Server Client Drivers •Robust, standards compliant ODBC, JDBC, and .NET drivers •Same driver used for DB2 LUW, DB2/z and Informix •Expands support to numerous languages (Python, Ruby, Perl, etc.) •Putting the story together…. •Big SQL shares a common SQL dialect with DB2 •Big SQL shares the same client drivers with DB2 •Data warehouse augmentation just got significantly easier Compatible SQL Compatible Drivers Portable Application
  • 29. Application portability and integration (cont.) •This compatibility extends beyond your own applications •Open integration across Business Analytic Tools •IBM Optim Data Studio performance tool portfolio •Superior enablement for IBM Software – e.g. Cognos •Enhanced support by 3rd party software – e.g. Microstrategy
  • 30. Query federation •Data never lives in isolation •Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active data warehouses •Big SQL provides the ability to query heterogeneous systems •Join Hadoop to other relational databases •Query optimizer understands capabilities of external system •Including available statistics •As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 31. Enterprise security •Users may be authenticated via •Operating system •Lightweight directory access protocol (LDAP) •Kerberos •User authorization mechanisms include •Full GRANT/REVOKE based security •Group and role based hierarchical security •Object level, column level, or row level (fine-grained) access controls •Auditing •You may define audit policies and track user activity •Transport layer security (TLS) •Protect integrity and confidentiality of data between the client and Big SQL
  • 32. Monitoring •Comprehensive runtime monitoring infrastructure that helps answer the question: what is going on in my system? •SQL interfaces to the monitoring data via table functions •Ability to drill down into more granular metrics for problem determination and/ or detailed performance analysis •Runtime statistics collected during the execution of the section for a (SQL) access plan •Support for event monitors to track specific types of operations and activities •Protect against and discover unknown or unacceptable behaviors by monitoring data access via Audit facility. Reporting Level (Example: Service Class) Big SQL 3.0 Worker Threads Connection Control Blocks Worker Threads Collect Locally Push Up Data Incrementally Extract Data Directly From Reporting level Monitor Query
  • 33. •Performance matters to customers •Benchmarking appeals to Engineers to drive product innovation •Benchmarketing used to convey performance in a memorable and appealing way •SQL over Hadoop is in the “Wild West” of Benchmarketing •100x claims! Compared to what? Conforming to what rules? •The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations •Formed in August, 1988 •TPC-H and TPC-DS are the most relevant to SQL over Hadoop –R/W nature of workload not suitable for HDFS •Big Data Benchmarking Community (BDBC) formed Performance, Benchmarking, Benchmarketing
  • 34. Power of Standard SQL •Everyone loves performance numbers, but that's not the whole story •How much work do you have to do to achieve those numbers? •A portion of our internal performance numbers are based upon industry standard benchmarks •Big SQL is capable of executing •All 22 TPC-H queries without modification •All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written for Hive
  • 35. 35 Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  • 36. Big SQL is 10x faster than Hive 0.12 (total workload elapsed time) 36 Comparing Big SQL and Hive 0.12 for Decision Support Queries * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
  • 37. How many times faster is Big SQL than Hive 0.12? * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Max Speedup of 74x 37 Queries sorted by speed up ratio (worst to best) Avg Speedup of 20x
  • 38. Conclusion •Today, it seems, performance numbers are the name of the game •But in reality there is so much more… •How rich is the SQL? •How difficult is it to (re-)use your existing SQL? •How secure is your data? •Is your data still open for other uses on Hadoop? •Can your queries span your enterprise? •Can other Hadoop workloads co-exist in harmony? •… •With Big SQL 3.0 performance doesn't mean compromise
  • 39. Try it now! InfoSphere for BigInsights Quick Start Free, no limit, non-production version of BigInsights Features Big SQL, BigSheets, Text Analytics, Big R, management console, development tools Tutorials and education available ibm.co/QuickStart
  • 40. Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 41. Темы для следующих митапов •R on Hadoop •Файловые системы •Движки MapReduce/Spark/etc •Hadoop Security •Spreadsheet analysis •Text analysis •?