A brief presentation where I talk about an E2E Hadoop and open source data warehouse and BI stack - bringing in the power of hadoop and online dashboards.
2. Background
I have been working and experimenting with Hadoop for sometime now. Now that I know and
understand, it reminds me how I started and struggled getting the right direction in designing a
complete Hadoop stack. While Hadoop gives you a lot of flexibility, it also creates a confusion
because there are many directions you can choose and integrate with your core solution. So
here, I present my version of it. This is also my first publication on slideshare – so apologies if
your find mistakes.
I have only used Apache Hadoop (version 2.6.2) and all open source projects. For you to be able
to understand the technical solution, you need to understand Hadoop core concepts, NOSQL,
Columnar Database, grid, CGI, Python, JSON and Javascript.
3. Solution : My Hadoop Technology Stack
Data Acquisition Data
Transformation
Data Exploration &
Prediction
Data Lake Output Data Exploitation
File Based data
using Hadoop shell
PIG for loading R (but there are
challenges of using
R with full capacity
of Hadoop)
MONGODB with
mongo-Hadoop
connectors to store
the output from data
transformations
Python CGI to create
a web application
Web Service Data
using FLUME
HIVE for query
based operations on
structured data
Python libraries like
NUMPY / SCIPY
for machine learning
PIG will move the
data from HIVE to
MONGO
JQUERY to bind
JSON responses
with D3
Structured Data
using SQOOP
HBASE for
columnar data
The whole stack / cluster can be built on Ubuntu 64 bit OS – 14.04 and Hadoop 2
5. Conclusion, Alternatives & Further Reading
I wanted to keep this simple and straight forward – so I have kept the details. If you like to know
further details – I would love to explain to do so with examples.
Spark coming in a big way ahead – would definitely provide us more solutions on designing the
stack – specially the Spark R libraries, but spark is compatible with Python.
Different adapters are improving for R-Hadoop and sure this will improve in time to come.
The main idea of my solution is that HIVE alone can not suffice for a real time dashboard and
hence the output of processed data and exploration needs to be taken out of Hadoop.
The reason for using MONGODB in my stack is the flexibility and that MONGO offers an
automatic REST API on all its collections.