A complete hadoop stack

•Télécharger en tant que PPTX, PDF•

3 j'aime•673 vues

A brief presentation where I talk about an E2E Hadoop and open source data warehouse and BI stack - bringing in the power of hadoop and online dashboards.

Technologie

A Complete Hadoop
Stack
Abhra Pal
April - 2016

Background
I have been working and experimenting with Hadoop for sometime now. Now that I know and
understand, it reminds me how I started and struggled getting the right direction in designing a
complete Hadoop stack. While Hadoop gives you a lot of flexibility, it also creates a confusion
because there are many directions you can choose and integrate with your core solution. So
here, I present my version of it. This is also my first publication on slideshare – so apologies if
your find mistakes.
I have only used Apache Hadoop (version 2.6.2) and all open source projects. For you to be able
to understand the technical solution, you need to understand Hadoop core concepts, NOSQL,
Columnar Database, grid, CGI, Python, JSON and Javascript.

Solution : My Hadoop Technology Stack
Data Acquisition Data
Transformation
Data Exploration &
Prediction
Data Lake Output Data Exploitation
File Based data
using Hadoop shell
PIG for loading R (but there are
challenges of using
R with full capacity
of Hadoop)
MONGODB with
mongo-Hadoop
connectors to store
the output from data
transformations
Python CGI to create
a web application
Web Service Data
using FLUME
HIVE for query
based operations on
structured data
Python libraries like
NUMPY / SCIPY
for machine learning
PIG will move the
data from HIVE to
MONGO
JQUERY to bind
JSON responses
with D3
Structured Data
using SQOOP
HBASE for
columnar data
The whole stack / cluster can be built on Ubuntu 64 bit OS – 14.04 and Hadoop 2

Architecture / Interface Diagram
HADOOP Cluster MongoDB Cluster Application Server
SQOOP
Flume
Hadoop
Shell
HDFS
HIVE
PIG
Python Machine
Learning Router / Query Servers
(≥2)
Config Server (≥2)
Shard / Replica (≥2)
HCATALOG
Python CGI
JQUERYMongo
REST
D3
PiMONGO
Mongo
Hadoop
Mongo
Hadoop

Conclusion, Alternatives & Further Reading
I wanted to keep this simple and straight forward – so I have kept the details. If you like to know
further details – I would love to explain to do so with examples.
Spark coming in a big way ahead – would definitely provide us more solutions on designing the
stack – specially the Spark R libraries, but spark is compatible with Python.
Different adapters are improving for R-Hadoop and sure this will improve in time to come.
The main idea of my solution is that HIVE alone can not suffice for a real time dashboard and
hence the output of processed data and exploration needs to be taken out of Hadoop.
The reason for using MONGODB in my stack is the flexibility and that MONGO offers an
automatic REST API on all its collections.

Recommandé

Hadoop online training courseKamal A

Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network

PySparkの勘所（20170630 sapporo db analytics showcase） Ryuji Tamagawa

Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsGeorge Ang

Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

Recommandé

Hadoop online training courseKamal A

Introduction to Apache Hivemall v0.5.2 and v0.6Makoto Yui

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network

PySparkの勘所（20170630 sapporo db analytics showcase） Ryuji Tamagawa

Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsGeorge Ang

Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit

20171012 found IT #9 PySparkの勘所Ryuji Tamagawa

20170210 sapporotechbar7Ryuji Tamagawa

Querying Network Packet Captures with Spark and DrillVince Gonzalez

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Best Hadoop and Amazon Online TrainingSamatha Kamuni

Hadoop online training by certified trainersriram0233

Hadoopsiva shankari

Hadoop online trainingSmartittrainings

Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network

F07-Cloud-Hadoop-BAMBioinformatics Open Source Conference

Apache Hama at Samsung Open Source ConferenceEdward Yoon

Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston

Introduction of Apache Hama - 2011Edward Yoon

myHadoop 0.30Glenn K. Lockwood

Hadoop - Simple. Scalable.elliando dias

Big Data Hadoop Trainingstratapps

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData

H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Combining Big Data and HPC in a GRIDScalar Environmentinside-BigData.com

Apache spark installation [autosaved]Shweta Patnaik

A Generative Method for Infrastructure Emergencewhichlight

Streamline Hadoop DevOps with Apache AmbariJayush Luniya

Contenu connexe

Tendances

20170210 sapporotechbar7Ryuji Tamagawa

Querying Network Packet Captures with Spark and DrillVince Gonzalez

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Best Hadoop and Amazon Online TrainingSamatha Kamuni

Hadoop online training by certified trainersriram0233

Hadoopsiva shankari

Hadoop online trainingSmartittrainings

Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network

F07-Cloud-Hadoop-BAMBioinformatics Open Source Conference

Apache Hama at Samsung Open Source ConferenceEdward Yoon

Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston

Introduction of Apache Hama - 2011Edward Yoon

myHadoop 0.30Glenn K. Lockwood

Hadoop - Simple. Scalable.elliando dias

Big Data Hadoop Trainingstratapps

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData

H2O World - Munging, modeling, and pipelines using Python - Hank RoarkSri Ambati

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Combining Big Data and HPC in a GRIDScalar Environmentinside-BigData.com

Apache spark installation [autosaved]Shweta Patnaik

Tendances (20)

20170210 sapporotechbar7

Querying Network Packet Captures with Spark and Drill

Cassandra + Hadoop @ApacheCon

Best Hadoop and Amazon Online Training

Hadoop online training by certified trainer

Hadoop

Hadoop online training

Hive integration: HBase and Rcfile__HadoopSummit2010

F07-Cloud-Hadoop-BAM

Apache Hama at Samsung Open Source Conference

Massively Parallel Processing with Procedural Python (PyData London 2014)

Introduction of Apache Hama - 2011

myHadoop 0.30

Hadoop - Simple. Scalable.

Big Data Hadoop Training

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...

H2O World - Munging, modeling, and pipelines using Python - Hank Roark

Hw09 Hadoop Development At Facebook Hive And Hdfs

Combining Big Data and HPC in a GRIDScalar Environment

Apache spark installation [autosaved]

En vedette

A Generative Method for Infrastructure Emergencewhichlight

Streamline Hadoop DevOps with Apache AmbariJayush Luniya

"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3 Cask Data

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...Cask Data

Introduction to Spark R with R studio - Mr. Pragith Sigmoid

Webinar: What's new in CDAP 3.5?Cask Data

The DAP - Where YARN, HBase, Kafka and Spark go to ProductionDataWorks Summit/Hadoop Summit

Getting started with replica set in MongoDBKishor Parkhe

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.

Transactions Over Apache HBaseCask Data

Mongodb shardingxiangrong

Mongo DBKaran Kukreja

En vedette (13)

A Generative Method for Infrastructure Emergence

Streamline Hadoop DevOps with Apache Ambari

"Who Moved my Data? - Why tracking changes and sources of data is critical to...

Introducing Athena: 08/19 Big Data Application Meetup, Talk #3

ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...

Introduction to Spark R with R studio - Mr. Pragith

Webinar: What's new in CDAP 3.5?

The DAP - Where YARN, HBase, Kafka and Spark go to Production

Getting started with replica set in MongoDB

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...

Transactions Over Apache HBase

Mongodb sharding

Mongo DB

Similaire à A complete hadoop stack

Hadoop online training Keylabs

Big Data Training in MohaliE2MATRIX

Big Data Training in AmritsarE2MATRIX

Big Data Training in LudhianaE2MATRIX

Best hadoop-online-trainingGeohedrick

Hadoop online trainingsrikanthhadoop

spark_v1_2Frank Schroeter

Webinar: Ways to Succeed with Hadoop in 2015Edureka!

Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter

Hadoop contentHadoop online training

Resume_KarthickKarthick Selvaraj

Playing with Hadoop (NPW2013)Søren Lund

Hadoop Tutorial for Beginnersbusiness Corporate

SparkPaperSuraj Thapaliya

Spark vs HadoopOlesya Eidam

Hadoop and Big Data: RevealedSachin Holla

Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoopabinash bindhani

Hadoop demo pptPhil Young

Capital onehadoopintroDoug Chang

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.

Similaire à A complete hadoop stack (20)

Hadoop online training

Big Data Training in Mohali

Big Data Training in Amritsar

Big Data Training in Ludhiana

Best hadoop-online-training

Hadoop online training

spark_v1_2

Webinar: Ways to Succeed with Hadoop in 2015

Hadoop Vs Spark — Choosing the Right Big Data Framework

Hadoop content

Resume_Karthick

Playing with Hadoop (NPW2013)

Hadoop Tutorial for Beginners

SparkPaper

Spark vs Hadoop

Hadoop and Big Data: Revealed

Senior systems engineer at Infosys with 2.4yrs of experience on Bigdata & hadoop

Hadoop demo ppt

Capital onehadoopintro

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Dernier

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

How to write a Business Continuity PlanDatabarracks

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

2024 April Patch TuesdayIvanti

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

A Framework for Development in the AI AgeCprime

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda

Manual 508 Accessibility Compliance AuditSkynet Technologies

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Dernier (20)

Long journey of Ruby standard library at RubyConf AU 2024

How to write a Business Continuity Plan

DevEX - reference for building teams, processes, and platforms

The Ultimate Guide to Choosing WordPress Pros and Cons

Genislab builds better products and faster go-to-market with Lean project man...

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

2024 April Patch Tuesday

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

A Framework for Development in the AI Age

Potential of AI (Generative AI) in Business: Learnings and Insights

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...

Manual 508 Accessibility Compliance Audit

Generative Artificial Intelligence: How generative AI works.pdf

What is DBT - The Ultimate Data Build Tool.pdf

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...

Take control of your SAP testing with UiPath Test Suite

Scale your database traffic with Read & Write split using MySQL Router

Generative AI for Technical Writer or Information Developers

A complete hadoop stack

1. A Complete Hadoop Stack Abhra Pal April - 2016

2. Background I have been working and experimenting with Hadoop for sometime now. Now that I know and understand, it reminds me how I started and struggled getting the right direction in designing a complete Hadoop stack. While Hadoop gives you a lot of flexibility, it also creates a confusion because there are many directions you can choose and integrate with your core solution. So here, I present my version of it. This is also my first publication on slideshare – so apologies if your find mistakes. I have only used Apache Hadoop (version 2.6.2) and all open source projects. For you to be able to understand the technical solution, you need to understand Hadoop core concepts, NOSQL, Columnar Database, grid, CGI, Python, JSON and Javascript.

3. Solution : My Hadoop Technology Stack Data Acquisition Data Transformation Data Exploration & Prediction Data Lake Output Data Exploitation File Based data using Hadoop shell PIG for loading R (but there are challenges of using R with full capacity of Hadoop) MONGODB with mongo-Hadoop connectors to store the output from data transformations Python CGI to create a web application Web Service Data using FLUME HIVE for query based operations on structured data Python libraries like NUMPY / SCIPY for machine learning PIG will move the data from HIVE to MONGO JQUERY to bind JSON responses with D3 Structured Data using SQOOP HBASE for columnar data The whole stack / cluster can be built on Ubuntu 64 bit OS – 14.04 and Hadoop 2

4. Architecture / Interface Diagram HADOOP Cluster MongoDB Cluster Application Server SQOOP Flume Hadoop Shell HDFS HIVE PIG Python Machine Learning Router / Query Servers (≥2) Config Server (≥2) Shard / Replica (≥2) HCATALOG Python CGI JQUERYMongo REST D3 PiMONGO Mongo Hadoop Mongo Hadoop

5. Conclusion, Alternatives & Further Reading I wanted to keep this simple and straight forward – so I have kept the details. If you like to know further details – I would love to explain to do so with examples. Spark coming in a big way ahead – would definitely provide us more solutions on designing the stack – specially the Spark R libraries, but spark is compatible with Python. Different adapters are improving for R-Hadoop and sure this will improve in time to come. The main idea of my solution is that HIVE alone can not suffice for a real time dashboard and hence the output of processed data and exploration needs to be taken out of Hadoop. The reason for using MONGODB in my stack is the flexibility and that MONGO offers an automatic REST API on all its collections.