SlideShare a Scribd company logo
1 of 36
Download to read offline
Hive Quick Start

  © 2010 Cloudera, Inc.
Background
•  Started at Facebook	
•  Data was collected
   by nightly cron
   jobs into Oracle DB	
•  “ETL” via hand-
   coded python	
•  Grew from 10s of
   GBs (2006) to 1 TB/
   day new data
   (2007), now 10x
   that.	


                  © 2010 Cloudera, Inc.
Hadoop as Enterprise Data
          Warehouse
•  Scribe and MySQL data loaded
   into Hadoop HDFS	
•  Hadoop MapReduce jobs to
   process data	
•  Missing components:	
 –  Command-line interface for “end users”
 –  Ad-hoc query support
   • … without writing full MapReduce jobs
 –  Schema information
                 © 2010 Cloudera, Inc.
Hive Applications
•  Log processing	
•  Text mining	
•  Document indexing	
•  Customer-facing business
   intelligence (e.g., Google
   Analytics)	
•  Predictive modeling, hypothesis
   testing	
             © 2010 Cloudera, Inc.
Hive Architecture




     © 2010 Cloudera, Inc.
Data Model
•  Tables	
  –  Typed columns (int, float, string, date,
     boolean)
  –  Also, array/map/struct for JSON-like data
•  Partitions	
  –  e.g., to range-partition tables by date
•  Buckets	
  –  Hash partitions within ranges (useful for
     sampling, join optimization)

                  © 2010 Cloudera, Inc.
Column Data Types
                        
CREATE TABLE t (
  s STRING,
  f FLOAT,
  a ARRAY<MAP<STRING, STRUCT<p1:INT,
  p2:INT>>);

SELECT s, f, a[0][‘foobar’].p2 FROM t;




                 © 2010 Cloudera, Inc.
Metastore
•  Database: namespace containing
   a set of tables	
•  Holds Table/Partition
   definitions (column types,
   mappings to HDFS directories)	
•  Statistics	
•  Implemented with DataNucleus
   ORM. Runs on Derby, MySQL, and
   many other relational databases	
             © 2010 Cloudera, Inc.
Physical Layout
•  Warehouse directory in HDFS	
    –  e.g., /user/hive/warehouse
•  Table row data stored in
   subdirectories of warehouse	
•  Partitions form subdirectories of
   table directories	
•  Actual data stored in flat files	
    –  Control char-delimited text, or
       SequenceFiles
    –  With custom SerDe, can use arbitrary
       format
                  © 2010 Cloudera, Inc.
Installing Hive
                         
From a Release Tarball:	

 $ wget http://archive.apache.org/dist/
 hadoop/hive/hive-0.5.0/hive-0.5.0-
 bin.tar.gz
 $ tar xvzf hive-0.5.0-bin.tar.gz
 $ cd hive-0.5.0-bin
 $ export HIVE_HOME=$PWD
 $ export PATH=$HIVE_HOME/bin:$PATH


                © 2010 Cloudera, Inc.
Installing Hive
                          
Building from Source:	
 $ svn co http://svn.apache.org/repos/asf/
  hadoop/hive/trunk hive
  $ cd hive
  $ ant package
  $ cd build/dist
  $ export HIVE_HOME=$PWD
  $ export PATH=$HIVE_HOME/bin:$PATH



                © 2010 Cloudera, Inc.
Installing Hive
                          
Other Options:	
•  Use a Git Mirror:	
  –  git://github.com/apache/hive.git


•  Cloudera Hive Packages	
  –  Redhat and Debian
  –  Packages include backported patches
  –  See archive.cloudera.com

                 © 2010 Cloudera, Inc.
Hive Dependencies
                      
•  Java 1.6	
•  Hadoop 0.17-0.20	
•  Hive *MUST* be able to find
   Hadoop:	
 –  $HADOOP_HOME=<hadoop-install-dir>
 –  Add $HADOOP_HOME/bin to $PATH




              © 2010 Cloudera, Inc.
Hive Dependencies
                             
•  Hive needs r/w access to /tmp
   and /user/hive/warehouse on
   HDFS:	

$   hadoop   fs   –mkdir   /tmp
$   hadoop   fs   –mkdir   /user/hive/warehouse
$   hadoop   fs   –chmod   g+w /tmp
$   hadoop   fs   –chmod   g+w /user/hive/warehouse



                       © 2010 Cloudera, Inc.
Hive Configuration
•  Default configuration in
   $HIVE_HOME/conf/hive-default.xml	
    –  DO NOT TOUCH THIS FILE!

•  Re(Define) properties in
   $HIVE_HOME/conf/hive-site.xml	

•  Use $HIVE_CONF_DIR to specify
   alternate conf dir location	

               © 2010 Cloudera, Inc.
Hive Configuration
•  You can override Hadoop
   configuration properties in
   Hive’s configuration, e.g:	
 – mapred.reduce.tasks=1




              © 2010 Cloudera, Inc.
Logging
•  Hive uses log4j	
•  Log4j configuration located in
   $HIVE_HOME/conf/hive-
   log4j.properties	
•  Logs are stored in /tmp/$
   {user.name}/hive.log	



             © 2010 Cloudera, Inc.
Starting the Hive CLI
•  Start a terminal and run	
 $ hive


•  Should see a prompt like:	
 hive>




                © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  Set a Hive or Hadoop conf prop:	
 – hive> set propkey=value;
•  List all properties and values:	
 – hive> set –v;
•  Add a resource to the DCache:	
 – hive> add [ARCHIVE|FILE|JAR]
   filename;


              © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  List tables:	
  – hive> show tables;
•  Describe a table:	
  – hive> describe <tablename>;
•  More information:	
  – hive> describe extended
    <tablename>;


               © 2010 Cloudera, Inc.
Hive CLI Commands
                      
•  List Functions:	
  – hive> show functions;
•  More information:	
  – hive> describe function
    <functionname>;




               © 2010 Cloudera, Inc.
Selecting data
hive> SELECT * FROM <tablename> LIMIT 10;

hive> SELECT * FROM <tablename>
  WHERE freq > 100 SORT BY freq ASC
  LIMIT 10;




                 © 2010 Cloudera, Inc.
Manipulating Tables
                         
•  DDL operations	
  –  SHOW TABLES
  –  CREATE TABLE
  –  ALTER TABLE
  –  DROP TABLE




                © 2010 Cloudera, Inc.
Creating Tables in Hive
                             
•  Most straightforward:	

CREATE TABLE foo(id INT, msg STRING);

•  Assumes default table layout	
  –  Text files; fields terminated with ^A, lines terminated with
     n




                        © 2010 Cloudera, Inc.
Changing Row Format
                       
• Arbitrary field, record
  separators are possible.
  e.g., CSV format:	

 CREATE TABLE foo(id INT, msg STRING)
 DELIMITED FIELDS TERMINATED BY ‘,’
 LINES TERMINATED BY ‘n’;



              © 2010 Cloudera, Inc.
Partitioning Data
                          
•  One or more partition columns may be
   specified:	

 CREATE TABLE foo (id INT, msg STRING)
 PARTITIONED BY (dt STRING);

•  Creates a subdirectory for each value of
   the partition column, e.g.:	
 	/user/hive/warehouse/foo/dt=2009-03-20/

•  Queries with partition columns in WHERE
   clause will scan through only a subset of
   the data	

                 © 2010 Cloudera, Inc.
Sqoop = SQL-to-Hadoop




       © 2010 Cloudera, Inc.
Sqoop: Features
                             
•  JDBC-based interface (MySQL, Oracle, PostgreSQL,
   etc…)	
•  Automatic datatype generation	
    –  Reads column info from table and generates Java classes
    –  Can be used in further MapReduce processing passes
•  Uses MapReduce to read tables from database	
    –  Can select individual table (or subset of columns)
    –  Can read all tables in database
•  Supports most JDBC standard types and null values	




                        © 2010 Cloudera, Inc.
Example input
                               
mysql> use corp;
Database changed
mysql> describe employees;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| firstname | varchar(32) | YES |       | NULL    |                |
| lastname   | varchar(32) | YES |      | NULL    |                |
| jobtitle   | varchar(64) | YES |      | NULL    |                |
| start_date | date        | YES |      | NULL    |                |
| dept_id    | int(11)     | YES |      | NULL    |                |
+------------+-------------+------+-----+---------+----------------+




                          © 2010 Cloudera, Inc.
Loading into HDFS


$ sqoop --connect jdbc:mysql://db.foo.com/corp 
     --table employees



•  Imports “employees” table into HDFS directory	




                     © 2010 Cloudera, Inc.
Hive Integration
$ sqoop --connect jdbc:mysql://db.foo.com/
  corp --hive-import --table employees

•  Auto-generates CREATE TABLE / LOAD DATA
   INPATH statements for Hive	
•  After data is imported to HDFS, auto-executes
   Hive script	
•  Follow-up step: Loading into partitions	




                   © 2010 Cloudera, Inc.
Hive Project Status
•  Open source, Apache 2.0 license	
•  Official subproject of Apache
   Hadoop	
•  Current version is 0.5.0	
•  Supports Hadoop 0.17-0.20	




             © 2010 Cloudera, Inc.
Conclusions

•  Supports rapid iteration of ad-
   hoc queries	
•  High-level Interface (HiveQL)
   to low-level infrastructure
   (Hadoop).	
•  Scales to handle much more data
   than many similar systems	

             © 2010 Cloudera, Inc.
Hive Resources
                      
Documentation	
•  wiki.apache.org/hadoop/Hive	

Mailing Lists	
•  hive-user@hadoop.apache.org	

IRC	
•  ##hive on Freenode	
             © 2010 Cloudera, Inc.
Carl Steinbach
carl@cloudera.com
Hive Quick Start Tutorial

More Related Content

What's hot

Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

What's hot (19)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
6.hive
6.hive6.hive
6.hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Session 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic CommandsSession 03 - Hadoop Installation and Basic Commands
Session 03 - Hadoop Installation and Basic Commands
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hadoop 24/7
Hadoop 24/7Hadoop 24/7
Hadoop 24/7
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 

Viewers also liked (20)

Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Workshop using Cloudera on Amazon EC2
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 

Similar to Hive Quick Start Tutorial

Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Henk van der Valk
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 

Similar to Hive Quick Start Tutorial (20)

Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
מיכאל
מיכאלמיכאל
מיכאל
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Hive Quick Start Tutorial

  • 1. Hive Quick Start © 2010 Cloudera, Inc.
  • 2. Background •  Started at Facebook •  Data was collected by nightly cron jobs into Oracle DB •  “ETL” via hand- coded python •  Grew from 10s of GBs (2006) to 1 TB/ day new data (2007), now 10x that. © 2010 Cloudera, Inc.
  • 3. Hadoop as Enterprise Data Warehouse •  Scribe and MySQL data loaded into Hadoop HDFS •  Hadoop MapReduce jobs to process data •  Missing components: –  Command-line interface for “end users” –  Ad-hoc query support • … without writing full MapReduce jobs –  Schema information © 2010 Cloudera, Inc.
  • 4. Hive Applications •  Log processing •  Text mining •  Document indexing •  Customer-facing business intelligence (e.g., Google Analytics) •  Predictive modeling, hypothesis testing © 2010 Cloudera, Inc.
  • 5. Hive Architecture © 2010 Cloudera, Inc.
  • 6. Data Model •  Tables –  Typed columns (int, float, string, date, boolean) –  Also, array/map/struct for JSON-like data •  Partitions –  e.g., to range-partition tables by date •  Buckets –  Hash partitions within ranges (useful for sampling, join optimization) © 2010 Cloudera, Inc.
  • 7. Column Data Types CREATE TABLE t ( s STRING, f FLOAT, a ARRAY<MAP<STRING, STRUCT<p1:INT, p2:INT>>); SELECT s, f, a[0][‘foobar’].p2 FROM t; © 2010 Cloudera, Inc.
  • 8. Metastore •  Database: namespace containing a set of tables •  Holds Table/Partition definitions (column types, mappings to HDFS directories) •  Statistics •  Implemented with DataNucleus ORM. Runs on Derby, MySQL, and many other relational databases © 2010 Cloudera, Inc.
  • 9. Physical Layout •  Warehouse directory in HDFS –  e.g., /user/hive/warehouse •  Table row data stored in subdirectories of warehouse •  Partitions form subdirectories of table directories •  Actual data stored in flat files –  Control char-delimited text, or SequenceFiles –  With custom SerDe, can use arbitrary format © 2010 Cloudera, Inc.
  • 10. Installing Hive From a Release Tarball: $ wget http://archive.apache.org/dist/ hadoop/hive/hive-0.5.0/hive-0.5.0- bin.tar.gz $ tar xvzf hive-0.5.0-bin.tar.gz $ cd hive-0.5.0-bin $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera, Inc.
  • 11. Installing Hive Building from Source: $ svn co http://svn.apache.org/repos/asf/ hadoop/hive/trunk hive $ cd hive $ ant package $ cd build/dist $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH © 2010 Cloudera, Inc.
  • 12. Installing Hive Other Options: •  Use a Git Mirror: –  git://github.com/apache/hive.git •  Cloudera Hive Packages –  Redhat and Debian –  Packages include backported patches –  See archive.cloudera.com © 2010 Cloudera, Inc.
  • 13. Hive Dependencies •  Java 1.6 •  Hadoop 0.17-0.20 •  Hive *MUST* be able to find Hadoop: –  $HADOOP_HOME=<hadoop-install-dir> –  Add $HADOOP_HOME/bin to $PATH © 2010 Cloudera, Inc.
  • 14. Hive Dependencies •  Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS: $ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse © 2010 Cloudera, Inc.
  • 15. Hive Configuration •  Default configuration in $HIVE_HOME/conf/hive-default.xml –  DO NOT TOUCH THIS FILE! •  Re(Define) properties in $HIVE_HOME/conf/hive-site.xml •  Use $HIVE_CONF_DIR to specify alternate conf dir location © 2010 Cloudera, Inc.
  • 16. Hive Configuration •  You can override Hadoop configuration properties in Hive’s configuration, e.g: – mapred.reduce.tasks=1 © 2010 Cloudera, Inc.
  • 17. Logging •  Hive uses log4j •  Log4j configuration located in $HIVE_HOME/conf/hive- log4j.properties •  Logs are stored in /tmp/$ {user.name}/hive.log © 2010 Cloudera, Inc.
  • 18. Starting the Hive CLI •  Start a terminal and run $ hive •  Should see a prompt like: hive> © 2010 Cloudera, Inc.
  • 19. Hive CLI Commands •  Set a Hive or Hadoop conf prop: – hive> set propkey=value; •  List all properties and values: – hive> set –v; •  Add a resource to the DCache: – hive> add [ARCHIVE|FILE|JAR] filename; © 2010 Cloudera, Inc.
  • 20. Hive CLI Commands •  List tables: – hive> show tables; •  Describe a table: – hive> describe <tablename>; •  More information: – hive> describe extended <tablename>; © 2010 Cloudera, Inc.
  • 21. Hive CLI Commands •  List Functions: – hive> show functions; •  More information: – hive> describe function <functionname>; © 2010 Cloudera, Inc.
  • 22. Selecting data hive> SELECT * FROM <tablename> LIMIT 10; hive> SELECT * FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10; © 2010 Cloudera, Inc.
  • 23. Manipulating Tables •  DDL operations –  SHOW TABLES –  CREATE TABLE –  ALTER TABLE –  DROP TABLE © 2010 Cloudera, Inc.
  • 24. Creating Tables in Hive •  Most straightforward: CREATE TABLE foo(id INT, msg STRING); •  Assumes default table layout –  Text files; fields terminated with ^A, lines terminated with n © 2010 Cloudera, Inc.
  • 25. Changing Row Format • Arbitrary field, record separators are possible. e.g., CSV format: CREATE TABLE foo(id INT, msg STRING) DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’; © 2010 Cloudera, Inc.
  • 26. Partitioning Data •  One or more partition columns may be specified: CREATE TABLE foo (id INT, msg STRING) PARTITIONED BY (dt STRING); •  Creates a subdirectory for each value of the partition column, e.g.: /user/hive/warehouse/foo/dt=2009-03-20/ •  Queries with partition columns in WHERE clause will scan through only a subset of the data © 2010 Cloudera, Inc.
  • 27. Sqoop = SQL-to-Hadoop © 2010 Cloudera, Inc.
  • 28. Sqoop: Features •  JDBC-based interface (MySQL, Oracle, PostgreSQL, etc…) •  Automatic datatype generation –  Reads column info from table and generates Java classes –  Can be used in further MapReduce processing passes •  Uses MapReduce to read tables from database –  Can select individual table (or subset of columns) –  Can read all tables in database •  Supports most JDBC standard types and null values © 2010 Cloudera, Inc.
  • 29. Example input mysql> use corp; Database changed mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+ © 2010 Cloudera, Inc.
  • 30. Loading into HDFS $ sqoop --connect jdbc:mysql://db.foo.com/corp --table employees •  Imports “employees” table into HDFS directory © 2010 Cloudera, Inc.
  • 31. Hive Integration $ sqoop --connect jdbc:mysql://db.foo.com/ corp --hive-import --table employees •  Auto-generates CREATE TABLE / LOAD DATA INPATH statements for Hive •  After data is imported to HDFS, auto-executes Hive script •  Follow-up step: Loading into partitions © 2010 Cloudera, Inc.
  • 32. Hive Project Status •  Open source, Apache 2.0 license •  Official subproject of Apache Hadoop •  Current version is 0.5.0 •  Supports Hadoop 0.17-0.20 © 2010 Cloudera, Inc.
  • 33. Conclusions •  Supports rapid iteration of ad- hoc queries •  High-level Interface (HiveQL) to low-level infrastructure (Hadoop). •  Scales to handle much more data than many similar systems © 2010 Cloudera, Inc.
  • 34. Hive Resources Documentation •  wiki.apache.org/hadoop/Hive Mailing Lists •  hive-user@hadoop.apache.org IRC •  ##hive on Freenode © 2010 Cloudera, Inc.