Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Big Data Open Source Tools and Trends: Enable Real-
Time Business Intelligence from Machine Logs
Eric Roch, Principal &
Ben Hahn, Senior Technical Architect

Perficient is a leading information technology consulting firm serving clients throughout
North America.
We help clients implement business-driven technology solutions that integrate business
processes, improve worker productivity, increase customer loyalty and create a more agile
enterprise to better respond to new business opportunities.
About Perficient

• Founded in 1997
• Public, NASDAQ: PRFT
• 2013 revenue $373 million
• Major market locations throughout North America
• Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus,
Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis,
Los Angeles, Minneapolis, New Orleans, New York City,
Northern California, Philadelphia, Southern California,
St. Louis, Toronto and Washington, D.C.
• Global delivery centers in China, Europe and India
• >2,100 colleagues
• Dedicated solution practices
• ~90% repeat business rate
• Alliance partnerships with major technology vendors
• Multiple vendor/industry technology and growth awards
Perficient Profile

BUSINESS SOLUTIONS
Business Intelligence
Business Process Management
Customer Experience and CRM
Enterprise Performance Management
Enterprise Resource Planning
Experience Design (XD)
Management Consulting
TECHNOLOGY SOLUTIONS
Business Integration/SOA
Cloud Services
Commerce
Content Management
Custom Application Development
Education
Information Management
Mobile Platforms
Platform Integration
Portal & Social
Our Solutions Expertise

Eric Roch
Principal
Eric leads Perficient's
national connected solutions
practice
• Includes focus on SOA/integration,
cloud, mobile and Big Data
• Author & industry speaker
• 25 years+ of experience in various
aspects of information technology
including:
• Executive-level management
• Enterprise architecture
• Application development
Speakers
Ben Hahn
Sr. Technical Architect
Ben Hahn is a Sr.
Technical Architect
• Includes focus on transactions, logging &
exceptions processing
• Author & speaker
• 20+ years of experience in various
aspects of information technology
including:
• Software solutions
• Enterprise infrastructure
• Product management
• Open Source software community
contributor

• Often defined as data that exceeds the capacities of
conventional database systems because it’s too large
and moves too fast for traditional database systems to
handle in an architecturally cohesive way. The three V’s
of Big Data are:
• Volume
• Most companies have 100 TB of data
• Facebook ingests 500 TB in a single day
• 40 ZettaBytes (that’s 43 trillion GB) of data by
2020
• Velocity
• NYSE captures 4-5 TB of data in a single day
• A Boeing 737 generates 243 TB in a single flight
• The Google self-driving car generates 750MB of
data per second!
• Variety
• Twitter, Clickstreams, Audio, Video
• GPS, Sensor data, Facebook content
• Infrastructure and application logs
What is Big Data?

POLL QUESTION:
What is your current adoption level for big data?
• Evaluation
• Prototype
• Production

But Not Everyone is Google!
Where’s the Big Data coming from?

POLL QUESTION
Have you used open source software for big data solutions?
• Yes
• No

Machine Data definitely has the three V’s of Big Data
Machine Data is Big Data

What Can We Gain From Machine Data?
Valuable information can be mined from
machine data, including:
• Transaction monitoring
• Error detection
• Behavior trends
• Audit logging
• Infrastructure states
• Anomaly detection
• Geospatial analysis
• Network analysis

Log Analysis vs. Business Analytics
• Ingest - Versus ETL
• Big Data - Bidirectional integration with Hadoop
• Query language - MapReduce function on unstructured
data
• Drill anywhere - Investigate on all the data versus a
predefined schema or cube
• Information discovery - Discover relationships based on
patterns in the data
• Ad-hoc versus dimensional - Log analysis is not based a
predefined structure based a point-in-time set of
requirements
• Explicit logging - Versus implicit correlation

Polling Question:
Do you mine machine data for business
insights?
• Yes
• No

Innovations From Cloud and OSS
• Hadoop and MapReduce - Derived from Google's
MapReduce and Google File System
• Storm – Distributed event processor open sourced by
Twitter
• Presto - Facebook has released as open source a SQL
query engine built to work with petabyte-sized data
warehouses
• Google BigQuery - Run SQL-like queries against terabytes
of data in seconds
• Amazon DynamoDB - NoSQL database service to store
and retrieve any amount of data, and serve any level of
request traffic
• Elasticsearch – Distributed full-text search OSS community

POLLING QUESTION
Do you plan to use cloud based solutions for
big data?
• Yes
• No

• 2004 - Google published a paper called MapReduce: Simplified Data
Processing on Large Clusters characterized by:
• Map and shuffle key-values data pairs and then aggregate/reduce these
intermediate data pairs
• Origins in map and reduce primitives in functional languages
• Massive parallelism and elasticity via commodity hardware
• Fault tolerance via master-worker nodes
Big Data Processing: MapReduce
2

• Based on Lambda (λ) calculus
• ALL computational functions and data can be expressed as
a series of functions and predicates of functions
• Declarative language rather than imperative
• First-order functions – Functions can be passed just like
values as arguments and returned as arguments. This also
allows currying and partial functions.
• Call by name – Function expressions are not evaluated
until they are actually used.
• Recursion – Functions evaluate to itself potentially in an
endless loop.
• Immutable state and values – Pure functional programming
does not consider variables but rather immutable values as
they appear in any moment in time. This has big effects on
scalability and concurrency.
• Referential Transparency - Functions can be replaced by
their values with no side effects.
• Pattern matching – Data type matching as well as data
structure composition and deep object type matching
• Erlang, Haskell, Lisp, Clojure, Scala
What are functional languages?
And MapReduce is Better with
Functional Languages
2

Imperative Model: Pascal, C. Basic, etc.
Evolution (or Devolution?) of Databases
2

Object Oriented Programming Model: Java,
C++,C#.
2

Functional Programming Model:
Scala, Clojure, F#
2
• Because commodity hardware in the cloud is infinitely
elastic, resource needs to query and run transactions
can be scaled in response to the data volumes at the
store level.
• Data is stored using functional programming concept of
immutability by only appending data as point-in-time
values.
• MapReduce functions can be balanced and distributed
across machines as nodes fail or new nodes are added.
• First-class functions and call by name allows function,
lambda expressions to be passed into MapReduce calls
as arguments allowing ad-hoc functionality to be added.
• Pattern matching allows very complex pattern matches
on complex structures like XML.
• Transactions use functional expressions like compare
and swap operations to ensure ACIDity.
• SQL or query expressions can be reduced to
MapReduce functions or lambda expressions and/or
patterns and distributed in parallel across the nodes.
• Using recursion, complex structures like XML can be
mapped and reduced from a single expression.

MapReduce Machine Data:
What Do We Need?
• A dynamic process for parsing
and mapping unstructured data
to structured data in real-time
• Wide range of data formats
(text, XML, JSON, CSV, EDI,
etc.)
• Need intelligent pattern
matching capabilities
• Ability to correlate meaningful
transactional data and metrics
from disparate data (reducing)
• Machine data is static and
immutable. Append-only fast
writes with eventual
consistency is ideal
• Need fast filter, search, query
capabilities to display results

Open Source Big Data Landscape
Source: www.bigdata‐startup.com

Apache Hadoop: The Elephant in the Room
• What about Apache Hadoop?
• Apache Hadoop comprises HDFS and the
Hadoop MapReduce both based on Google’s GFS
and MapReduce
• Batch oriented MapReduce jobs through
Schedulers and JobTrackers
• Require real‐time MapReduce processes
• Need index, query, search on data in real‐time
with a well‐defined interface
• We can use for secondary storage of long‐term
persistent logs – Lambda Architecture (Batch vs
Speed Layer)

Apache Storm: Use Real-time
MapReduce for Machine Data Streams
• Developed by Backtype and acquired by Twitter
• Distributed computational framework that allows real-
time MapReduce functionality from any data source
streams using concept of Spouts and Bolts
• Read From Any Data Stream using Spouts (Kafka,
JMS, HTTP, etc.)
• Transactional and guaranteed message processing
• Parallelism and scalability
• Fault Tolerance (Master-Worker for MapReduce)
• MapReduce Topologies
• Offers Real-time MapReduce jobs (Or Bolts)
• Other tools: Apache Spark

Apache Storm: Use Real-time
MapReduce for Machine Data Streams
MapReduce - Declarative and simplicity of functional languages within
Storm

Elasticsearch: Distributed
Document Search
• Distributed search server engine using Apache Lucene
• It’s a Schema-less document store using JSON as it’s
document format. New fields can be added dynamically.
All fields are indexed by default
• Uses index shards to distribute queries and searches
across clusters. Queries and searches are run in parallel
• Cluster can host multiple indexes and can be queried as
a group or singly. Index aliases allows indexes to be
added or dropped dynamically
• Append-only model using versioning. Writes very fast
depending on wait model (wait for all shards to be written
or a quorom or none)
• Well-defined RESTful API interface. Very powerful query
features
• Other tools: Apache Solr

Elasticsearch: Distributed
Document Search
Elasticsearch: Distributed Query and searches using index shards and replicas

A Really Cool UI to Show This Off
• Kibana – Works seamlessly with Elasticsearch, queries Elasticsearch
directly from Javascript
• Everything is user driven, very little coding except some configuration
settings in yaml
• Very dynamic screen interface
• Screen layout, queries, filters, graphs, histograms are saved directly to
Elasticsearch
• Great design and user interface

As a reminder, please submit your
questions in the chat box
We will get to as many as possible!
4/1/2014

Daily unique content
about content
management, user
experience, portals
and other enterprise
information technology
solutions across a
variety of industries.
Perficient.com/SocialMedia
Facebook.com/Perficient
Twitter.com/Perficient

Thank you for your participation today.
Please fill out the survey at the close of this session.
4/1/2014

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Similar to Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs (20)

More from Perficient, Inc.

More from Perficient, Inc. (20)

Recently uploaded

Recently uploaded (20)

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs