SlideShare une entreprise Scribd logo
1  sur  22
Apache Sqoop

A Data Transfer Tool for Hadoop




         Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
What is Sqoop?

● Allows easy import and export of data from structured
  data stores:
   ○ Relational Database
   ○ Enterprise Data Warehouse
   ○ NoSQL Datastore

● Allows easy integration with Hadoop based systems:
   ○ Hive
   ○ HBase
   ○ Oozie
Agenda

● Motivation

● Importing and exporting data using Sqoop

● Provisioning Hive Metastore

● Populating HBase tables

● Sqoop Connectors

● Current Status and Road Map
Motivation

● Structured data stored in Databases and EDW is not easily
  accessible for analysis in Hadoop

● Access to Databases and EDW from Hadoop Clusters is
  problematic.

● Forcing MapReduce to access data from Databases/EDWs is
  repititive, error-prone and non-trivial.

● Data preparation often required for efficient consumption
  by Hadoop based data pipelines. 

● Current methods of transferring data are inefficient/ad-
  hoc.
Enter: Sqoop

    A tool to automate data transfer between structured     
    datastores and Hadoop.

Highlights

 ● Uses datastore metadata to infer structure definitions
 ● Uses MapReduce framework to transfer data in parallel
 ● Allows structure definitions to be provisioned in Hive
   metastore
 ● Provides an extension mechanism to incorporate high
   performance connectors for external systems. 
Importing Data

mysql> describe ORDERS;
+-----------------+-------------+------+-----+---------+-------+
| Field        | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| ORDER_NUMBER | int(11) | NO | PRI | NULL |                            |
| ORDER_DATE | datetime | NO | | NULL |                             |
| REQUIRED_DATE | datetime | NO | | NULL |                            |
| SHIP_DATE           | datetime | YES | | NULL |                 |
| STATUS           | varchar(15) | NO | | NULL |               |
| COMMENTS              | text     | YES | | NULL |             |
| CUSTOMER_NUMBER | int(11) | NO | | NULL |                               |
+-----------------+-------------+------+-----+---------+-------+
7 rows in set (0.00 sec)
Importing Data
$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password ****
 ...

INFO mapred.JobClient: Counters: 12
INFO mapred.JobClient:   Job Counters 
INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873
...
INFO mapred.JobClient:     Launched map tasks=4
INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
INFO mapred.JobClient:   FileSystemCounters
INFO mapred.JobClient:     HDFS_BYTES_READ=505
INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848
INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098
INFO mapred.JobClient:   Map-Reduce Framework
INFO mapred.JobClient:     Map input records=326
INFO mapred.JobClient:     Spilled Records=0
INFO mapred.JobClient:     Map output records=326
INFO mapred.JobClient:     SPLIT_RAW_BYTES=505
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
Importing Data

$ hadoop fs -ls
Found 32 items
....
drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS
....

$ hadoop fs -ls /user/arvind/ORDERS

arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS
Found 6 items
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs
... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000
... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001
... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002
... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
Exporting Data

$ sqoop export --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS_CLEAN --username test --password **** 
  --export-dir /user/arvind/ORDERS
...
INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec)
INFO mapreduce.ExportJobBase: Exported 326 records.
$



  ● Default Delimiters: ',' for fields, New-Lines for records
  ● Optionally Specify Escape sequence 
  ● Delimiters can be specified for both import and export
Exporting Data

Exports can optionally use Staging Tables

 ● Map tasks populate staging table

 ● Each map write is broken down into many transactions

 ● Staging table is then used to populate the target table in a
   single transaction

 ● In case of failure, staging table provides insulation from
   data corruption.
Importing Data into Hive

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** --hive-import
 ...

INFO mapred.JobClient: Counters: 12
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs
INFO hive.HiveImport: Loading uploaded data into Hive
...
WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in
Hive
WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type
in Hive
WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in
Hive
...
$
Importing Data into Hive

$ hive
hive> show tables;
OK
...
orders
...
hive> describe orders;
OK
order_number int
order_date string
required_date string
ship_date string
status string
comments string
customer_number int
Time taken: 0.236 seconds
hive>
Importing Data into HBase

$ bin/sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** 
  --hbase-create-table --hbase-table ORDERS --column-family mysql
...
INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS
...
INFO mapreduce.ImportJobBase: Retrieved 326 records.
$


  ● Sqoop creates the missing table if instructed
  ● If no Row-Key specified, the Primary Key column is used.
  ● Each output column placed in same column family
  ● Every record read results in an HBase put operation
  ● All values are converted to their string representation and
    inserted as UTF-8 bytes.
Importing Data into HBase

hbase(main):001:0> list
TABLE 
ORDERS 
1 row(s) in 0.3650 seconds

hbase(main):002:0>  describe 'ORDERS'
DESCRIPTION                             ENABLED
{NAME => 'ORDERS', FAMILIES => [                true
 {NAME => 'mysql', BLOOMFILTER => 'NONE',
  REPLICATION_SCOPE => '0', COMPRESSION => 'NONE',
  VERSIONS => '3', TTL => '2147483647',
  BLOCKSIZE => '65536', IN_MEMORY => 'false',
  BLOCKCACHE => 'true'}]}
1 row(s) in 0.0310 seconds

hbase(main):003:0>
Importing Data into HBase

hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 }
ROW COLUMN+CELL
10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264,
    value=363
10100 column=mysql:ORDER_DATE, timestamp=1316036948264,
    value=2003-01-06 00:00:00.0
10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264,
    value=2003-01-13 00:00:00.0
10100 column=mysql:SHIP_DATE, timestamp=1316036948264,
    value=2003-01-10 00:00:00.0
10100 column=mysql:STATUS, timestamp=1316036948264,
    value=Shipped
1 row(s) in 0.0130 seconds

hbase(main):012:0>
Sqoop Connectors

● Connector Mechanism allows creation of new connectors
  that improve/augment Sqoop functionality.

● Bundled connectors include:
   ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC
   ○ Direct MySQL, Direct PostgreSQL

● Regular connectors are JDBC based.

● Direct Connectors use native tools for high-performance
  data transfer implementation.
Import using Direct MySQL Connector

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
   --table ORDERS --username test --password **** --direct
...
manager.DirectMySQLManager: Beginning mysqldump fast
path import
...

Direct import works as follows:
 ● Data is partitioned into splits using JDBC
 ● Map tasks used mysqldump to do the import with conditional
   selection clause (-w 'ORDER_NUMBER' > ...)
 ● Header and footer information was stripped out

Direct Export similarly uses            mysqlimport   utility.
Third Party Connectors

● Oracle - Developed by Quest Software

● Couchbase - Developed by Couchbase

● Netezza - Developed by Cloudera

● Teradata - Developed by Cloudera

● Microsoft SQL Server - Developed by Microsoft

● Microsoft PDW - Developed by Microsoft

● Volt DB - Developed by VoltDB
Current Status

Sqoop is currently in Apache Incubator

  ● Status Page
     http://incubator.apache.org/projects/sqoop.html

  ● Mailing Lists
     sqoop-user@incubator.apache.org
     sqoop-dev@incubator.apache.org

  ● Release
     Current shipping version is 1.3.0
Hadoop World 2011


A gathering of Hadoop practitioners, developers,
business executives, industry luminaries and
innovative companies in the Hadoop ecosystem.

    ● Network: 1400 attendees, 25+ sponsors
    ● Learn: 60 sessions across 5 tracks for             November 8-9
         ○ Developers                              Sheraton New York Hotel &
         ○ IT Operations                                  Towers, NYC
         ○ Enterprise Architects
         ○ Data Scientists
         ○ Business Decision Makers                 Learn more and register at
                                                     www.hadoopworld.com
    ● Train: Cloudera training and certification
       (November 7, 10, 11)
Sqoop Meetup



      Monday, November 7 - 2011, 8pm - 9pm

                       at

     Sheraton New York Hotel & Towers, NYC
Thank you!

   Q&A

Contenu connexe

Tendances

Tendances (20)

SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Hadoop
HadoopHadoop
Hadoop
 
MySQL Architecture and Engine
MySQL Architecture and EngineMySQL Architecture and Engine
MySQL Architecture and Engine
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
MongoDB
MongoDBMongoDB
MongoDB
 
mysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancementmysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancement
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Oracle Architecture
Oracle ArchitectureOracle Architecture
Oracle Architecture
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Sql server
Sql serverSql server
Sql server
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Mongo DB Presentation
Mongo DB PresentationMongo DB Presentation
Mongo DB Presentation
 
Azure Data Studio Extension Development
Azure Data Studio Extension DevelopmentAzure Data Studio Extension Development
Azure Data Studio Extension Development
 

En vedette

สมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for Kidsสมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for KidsIMC Institute
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in ChinaIMC Institute
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015IMC Institute
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartIMC Institute
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/ProductionIMC Institute
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 
Kanban boards step by step
Kanban boards step by stepKanban boards step by step
Kanban boards step by stepGiulio Roggero
 

En vedette (13)

สมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for Kidsสมุดกิจกรรม Code for Kids
สมุดกิจกรรม Code for Kids
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Mobile User and App Analytics in China
Mobile User and App Analytics in ChinaMobile User and App Analytics in China
Mobile User and App Analytics in China
 
Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015Thai Software & Software Market Survey 2015
Thai Software & Software Market Survey 2015
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
ITSS Overview
ITSS OverviewITSS Overview
ITSS Overview
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Kanban boards step by step
Kanban boards step by stepKanban boards step by step
Kanban boards step by step
 

Similaire à Apache Sqoop: A Data Transfer Tool for Hadoop

MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015Dave Stokes
 
MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015Dave Stokes
 
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresM|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresMariaDB plc
 
Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Wagner Bianchi
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Drupal database Mssql to MySQL migration
Drupal database Mssql to MySQL migrationDrupal database Mssql to MySQL migration
Drupal database Mssql to MySQL migrationAnton Ivanov
 
How to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeHow to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeMariaDB plc
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-actionAssaf Gannon
 
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
MySQL Without the SQL -- Oh My!  Longhorn PHP ConferenceMySQL Without the SQL -- Oh My!  Longhorn PHP Conference
MySQL Without the SQL -- Oh My! Longhorn PHP ConferenceDave Stokes
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLBjoern Rost
 
NoSQL and MySQL: News about JSON
NoSQL and MySQL: News about JSONNoSQL and MySQL: News about JSON
NoSQL and MySQL: News about JSONMario Beck
 
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...Jürgen Ambrosi
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Optimizing your Database Import!
Optimizing your Database Import! Optimizing your Database Import!
Optimizing your Database Import! Nabil Nawaz
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 

Similaire à Apache Sqoop: A Data Transfer Tool for Hadoop (20)

Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015MySQL 5.7. Tutorial - Dutch PHP Conference 2015
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
 
MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015MySQL 5.7 Tutorial Dutch PHP Conference 2015
MySQL 5.7 Tutorial Dutch PHP Conference 2015
 
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresM|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
 
Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Drupal database Mssql to MySQL migration
Drupal database Mssql to MySQL migrationDrupal database Mssql to MySQL migration
Drupal database Mssql to MySQL migration
 
How to migrate from Oracle Database with ease
How to migrate from Oracle Database with easeHow to migrate from Oracle Database with ease
How to migrate from Oracle Database with ease
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
MySQL Without the SQL -- Oh My!  Longhorn PHP ConferenceMySQL Without the SQL -- Oh My!  Longhorn PHP Conference
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 
NoSQL and MySQL: News about JSON
NoSQL and MySQL: News about JSONNoSQL and MySQL: News about JSON
NoSQL and MySQL: News about JSON
 
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Optimizing your Database Import!
Optimizing your Database Import! Optimizing your Database Import!
Optimizing your Database Import!
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Apache Sqoop: A Data Transfer Tool for Hadoop

  • 1. Apache Sqoop A Data Transfer Tool for Hadoop Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
  • 2. What is Sqoop? ● Allows easy import and export of data from structured data stores: ○ Relational Database ○ Enterprise Data Warehouse ○ NoSQL Datastore ● Allows easy integration with Hadoop based systems: ○ Hive ○ HBase ○ Oozie
  • 3. Agenda ● Motivation ● Importing and exporting data using Sqoop ● Provisioning Hive Metastore ● Populating HBase tables ● Sqoop Connectors ● Current Status and Road Map
  • 4. Motivation ● Structured data stored in Databases and EDW is not easily accessible for analysis in Hadoop ● Access to Databases and EDW from Hadoop Clusters is problematic. ● Forcing MapReduce to access data from Databases/EDWs is repititive, error-prone and non-trivial. ● Data preparation often required for efficient consumption by Hadoop based data pipelines.  ● Current methods of transferring data are inefficient/ad- hoc.
  • 5. Enter: Sqoop     A tool to automate data transfer between structured          datastores and Hadoop. Highlights ● Uses datastore metadata to infer structure definitions ● Uses MapReduce framework to transfer data in parallel ● Allows structure definitions to be provisioned in Hive metastore ● Provides an extension mechanism to incorporate high performance connectors for external systems. 
  • 6. Importing Data mysql> describe ORDERS; +-----------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+-------------+------+-----+---------+-------+ | ORDER_NUMBER | int(11) | NO | PRI | NULL | | | ORDER_DATE | datetime | NO | | NULL | | | REQUIRED_DATE | datetime | NO | | NULL | | | SHIP_DATE | datetime | YES | | NULL | | | STATUS | varchar(15) | NO | | NULL | | | COMMENTS | text | YES | | NULL | | | CUSTOMER_NUMBER | int(11) | NO | | NULL | | +-----------------+-------------+------+-----+---------+-------+ 7 rows in set (0.00 sec)
  • 7. Importing Data $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password ****  ... INFO mapred.JobClient: Counters: 12 INFO mapred.JobClient:   Job Counters  INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873 ... INFO mapred.JobClient:     Launched map tasks=4 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0 INFO mapred.JobClient:   FileSystemCounters INFO mapred.JobClient:     HDFS_BYTES_READ=505 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098 INFO mapred.JobClient:   Map-Reduce Framework INFO mapred.JobClient:     Map input records=326 INFO mapred.JobClient:     Spilled Records=0 INFO mapred.JobClient:     Map output records=326 INFO mapred.JobClient:     SPLIT_RAW_BYTES=505 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records.
  • 8. Importing Data $ hadoop fs -ls Found 32 items .... drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS .... $ hadoop fs -ls /user/arvind/ORDERS arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS Found 6 items ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs ... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000 ... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001 ... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002 ... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
  • 9. Exporting Data $ sqoop export --connect jdbc:mysql://localhost/acmedb --table ORDERS_CLEAN --username test --password **** --export-dir /user/arvind/ORDERS ... INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec) INFO mapreduce.ExportJobBase: Exported 326 records. $ ● Default Delimiters: ',' for fields, New-Lines for records ● Optionally Specify Escape sequence  ● Delimiters can be specified for both import and export
  • 10. Exporting Data Exports can optionally use Staging Tables ● Map tasks populate staging table ● Each map write is broken down into many transactions ● Staging table is then used to populate the target table in a single transaction ● In case of failure, staging table provides insulation from data corruption.
  • 11. Importing Data into Hive $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hive-import  ... INFO mapred.JobClient: Counters: 12 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records. INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs INFO hive.HiveImport: Loading uploaded data into Hive ... WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in Hive ... $
  • 12. Importing Data into Hive $ hive hive> show tables; OK ... orders ... hive> describe orders; OK order_number int order_date string required_date string ship_date string status string comments string customer_number int Time taken: 0.236 seconds hive>
  • 13. Importing Data into HBase $ bin/sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hbase-create-table --hbase-table ORDERS --column-family mysql ... INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS ... INFO mapreduce.ImportJobBase: Retrieved 326 records. $ ● Sqoop creates the missing table if instructed ● If no Row-Key specified, the Primary Key column is used. ● Each output column placed in same column family ● Every record read results in an HBase put operation ● All values are converted to their string representation and inserted as UTF-8 bytes.
  • 14. Importing Data into HBase hbase(main):001:0> list TABLE  ORDERS  1 row(s) in 0.3650 seconds hbase(main):002:0>  describe 'ORDERS' DESCRIPTION ENABLED {NAME => 'ORDERS', FAMILIES => [ true {NAME => 'mysql', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0310 seconds hbase(main):003:0>
  • 15. Importing Data into HBase hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 } ROW COLUMN+CELL 10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264, value=363 10100 column=mysql:ORDER_DATE, timestamp=1316036948264, value=2003-01-06 00:00:00.0 10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264, value=2003-01-13 00:00:00.0 10100 column=mysql:SHIP_DATE, timestamp=1316036948264, value=2003-01-10 00:00:00.0 10100 column=mysql:STATUS, timestamp=1316036948264, value=Shipped 1 row(s) in 0.0130 seconds hbase(main):012:0>
  • 16. Sqoop Connectors ● Connector Mechanism allows creation of new connectors that improve/augment Sqoop functionality. ● Bundled connectors include: ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC ○ Direct MySQL, Direct PostgreSQL ● Regular connectors are JDBC based. ● Direct Connectors use native tools for high-performance data transfer implementation.
  • 17. Import using Direct MySQL Connector $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --direct ... manager.DirectMySQLManager: Beginning mysqldump fast path import ... Direct import works as follows: ● Data is partitioned into splits using JDBC ● Map tasks used mysqldump to do the import with conditional selection clause (-w 'ORDER_NUMBER' > ...) ● Header and footer information was stripped out Direct Export similarly uses mysqlimport utility.
  • 18. Third Party Connectors ● Oracle - Developed by Quest Software ● Couchbase - Developed by Couchbase ● Netezza - Developed by Cloudera ● Teradata - Developed by Cloudera ● Microsoft SQL Server - Developed by Microsoft ● Microsoft PDW - Developed by Microsoft ● Volt DB - Developed by VoltDB
  • 19. Current Status Sqoop is currently in Apache Incubator ● Status Page      http://incubator.apache.org/projects/sqoop.html ● Mailing Lists      sqoop-user@incubator.apache.org      sqoop-dev@incubator.apache.org ● Release      Current shipping version is 1.3.0
  • 20. Hadoop World 2011 A gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem. ● Network: 1400 attendees, 25+ sponsors ● Learn: 60 sessions across 5 tracks for November 8-9 ○ Developers Sheraton New York Hotel & ○ IT Operations Towers, NYC ○ Enterprise Architects ○ Data Scientists ○ Business Decision Makers Learn more and register at www.hadoopworld.com ● Train: Cloudera training and certification      (November 7, 10, 11)
  • 21. Sqoop Meetup Monday, November 7 - 2011, 8pm - 9pm at Sheraton New York Hotel & Towers, NYC
  • 22. Thank you! Q&A