2. 2 ScaleOut Software, Inc.
• The Need for Operational Intelligence (OI)
• Operational Intelligence vs. Business Intelligence
• Implementing OI Using In-Memory Computing:
• In-Memory Data Grid
• Data-Parallel Computation: “Parallel Method Invocation”
• Implementing MapReduce Unchanged on an IMDG
• A Detailed Example in Financial Services
• Video Demo
• Examples of Applications in Operational Intelligence
Agenda
3. 3 ScaleOut Software, Inc.
Goal: Provide immediate feedback to a system handling live data.
A few examples:
• Equity trading: to minimize risk during a trading day
• Ecommerce: to optimize real-time shopping activity
• Reservations systems: to identify issues, reroute, etc.
• Credit cards & wire transfers: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues
Online Systems Need Operational
Intelligence
4. 4 ScaleOut Software, Inc.
• To keep up with fast
growing “live” workloads &
maintain fast response times:
• Ex.: Handle incoming data
streams in real time.
• Ex. Process updates to data
set based on incoming data.
• To identify and respond to
trends in fast-changing data:
• Ex. Evaluate data set changes in
real time.
• Ex. Respond to identified
patterns within seconds.
Challenges for Operational Intelligence
0
50
100
150
200
250
300
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Millions
Growth in Web Servers
Source:
Netcraft
0
500
1000
1500
2000
2500
3000
3500
4000
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Exebytes
Growth in “Big Data”
“More data has been
created in the past three
years than in the past
40,000.”
5. 5 ScaleOut Software, Inc.
Big Data Analytics
Real-Time vs. Batch Analytics
Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:
• Analyzing
warehoused data
• Mining for long-
term trends
Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
• Providing immediate
feedback
IMDGs
Spark
Storm
CEP
Hadoop
IBM
Teradata
Oracle
SAP
Real-Time Batch
Real-time
“Operational Intelligence”
Batch
“Business Intelligence”
6. 6 ScaleOut Software, Inc.
• Traditional Hadoop MapReduce
platforms analyze offline data:
• Very large, disk-based datasets
• Data repeatedly copied from disk to
memory.
• Batch-scheduled (multi-tenant)
• IMDGs store and analyze live data:
• Fast-changing, operational data
integrated with live updates
• Data kept memory-resident (data
motion is minimized)
• Inline-scheduled (single tenant)
Design Goals for Hadoop vs. IMDGs
7. 7 ScaleOut Software, Inc.
• Operational intelligence can co-exist with business intelligence:
• Processes streaming data close to its sources.
• Provides real-time, “tactical” feedback (e.g., recommendations, alerts).
• Translates data for storage in the data warehouse (ETL).
• Data warehouse provides “strategic” guidance.
• Using the same tool set (i.e., Hadoop MapReduce) lowers TCO:
• Leverages common skill set.
• Simplifies design (e.g., loading data into HDFS).
Integrated View of Analytics
8. 8 ScaleOut Software, Inc.
• In-memory data
grid (IMDG) holds
active entities
undergoing state
changes in memory.
• IMDG updates entities
with incoming stream
of state changes.
• Backing store
optionally holds large
population of entities.
• Analytics engine
examines entities in
real time and
generates alerts
within seconds as
needed.
In-Memory Architecture for
Operational Intelligence
9. 9 ScaleOut Software, Inc.
In-Memory Data Grid (IMDG) stores “live” data in a cluster:
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores unstructured collections of
Java/.NET objects.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
In-Memory Data Grid for Live Data
10. 10 ScaleOut Software, Inc.
• IMDG’s collections of objects behave like
in-memory collections:
• Unstructured, typically instances of a class
(stored as serialized blobs)
• Individually accessible / update-able
• IMDG adds attributes:
• Accessible by global key
• Query-able by properties
• Highly available
• Optional timeouts
• Distributed locking
• Integration with a backing store
• Optional dependency relationships
• Asynchronous event handling
IMDG Stores “Live” Data
Basic “CRUD” APIs:
• Create(key, obj, tout)
• Read(key)
• Update(key, obj)
• Delete(key)
and…
• Lock(key)
• Unlock(key)
Object
key
11. 11 ScaleOut Software, Inc.
IMDG Analyzes Live Data
• Integrated execution engine: “Parallel Method Invocation” (PMI)
• Object-oriented version of HPC
data-parallel computing model
• Serves as a platform for implementing
MapReduce and other data-parallel
operators.
• Runs user-defined methods in
parallel across the cluster.
• Globally merges results.
• Benefits:
• Simple, well understood model
• Fast startup time
• Fast global barrier
• Minimum data motion
• Automatic code shipping
Analyze Data
(Eval)
Combine Results
(Merge)
12. 12 ScaleOut Software, Inc.
PMI Enables Linear Speedup
Avoids data motion (network or disk I/O) which limits throughput:
13. 13 ScaleOut Software, Inc.
Spark / Spark Streaming from U.C.
Berkeley amplab:
• In-memory computing to accelerate and extend
Hadoop MapReduce using data-parallel
operators in Scala.
• Stores data as “resilient
distributed datasets” (RDDs):
• Distributed across cluster
• Immutable
• Hold data from/output to HDFS.
• Store data stream as a sequence of RDDs.
• Comparison to IMDG:
• Not designed for “live” data:
• Lacks CRUD on individual objects.
• Lacks high availability.
• Designed for “data parallel” transformations.
Comparison: IMDGs to Spark
14. 14 ScaleOut Software, Inc.
Run MapReduce as two PMI
phases:
• Data can be input from either the
IMDG or an external data source.
• Works with any input/output format
compatible with the Apache
distribution.
• IMDG uses its data-parallel
execution engine (PMI) to invoke
the mappers and the reducers.
• Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
• Minimizes data motion between the
mappers and reducers.
• Allows optional sorting.
• Output of a single reducer/combiner
optionally can be globally merged.
Implementing MapReduce on IMDG
15. 15 ScaleOut Software, Inc.
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf,
"wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
// This job will run using ScaleOut hServer:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new HServerJob(conf,
"wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
Configuring Application for the IMDG
• Without YARN, just subclass the Hadoop Job class with a one-line
change:
16. 16 ScaleOut Software, Inc.
Running Under YARN
• With YARN, just replace the MapReduce execution framework:
• Example of running MapReduce on
IMDG using Hortonworks YARN:
• YARN directs jobs
to IMDG.
• IMDG accelerates
execution.
$ hadoop jar hadoop-mapreduce-examples.jar wordcount
-Dmapreduce.framework.name=hserver-yarn in out
17. 17 ScaleOut Software, Inc.
• With YARN, IMDG can run Apache or
other Hive distribution unchanged.
• Accelerates queries for datasets hosted
in HDFS or the IMDG.
• Limitation: Intermediate data must fit
within the IMDG.
• Implementation note:
• Hive not thread-safe
• Requires multiple JVMs per server for one
Hive query
• Currently seeing 3X speedup (tuning in
progress)
• More optimizations possible, but…
• Limited by “unchanged” approach
Using YARN to Run Hive on IMDG
18. 18 ScaleOut Software, Inc.
• A Hadoop distribution does not have to be installed unless HDFS is used.
• The developer starts MapReduce applications from a remote workstation.
• The IMDG automatically builds a reusable “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• Results are stored in the IMDG, HDFS, or optionally globally merged and
returned to the remote workstation.
Running MapReduce on an IMDG
19. 19 ScaleOut Software, Inc.
The invocation grid can be re-used across MapReduce jobs:
Accelerating Start-Up Times
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar(“mylib.jar").
// Add classes as IG dependencies
addClass(MyMap.class). addClass(MyRed.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed to the job.
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
20. 20 ScaleOut Software, Inc.
• IMDG adds grid input format for
accessing key/value pairs held in the
IMDG.
• MapReduce programs optionally can
output results to IMDG with grid
output format.
• Grid Record Reader optimizes access
to key/value pairs to eliminate
network overhead.
• Applications can access and update
key/value pairs as operational data
during analysis.
Accessing In-Memory Data
21. 21 ScaleOut Software, Inc.
IMDG needs multiple in-memory
storage models:
• Named cache, optimized for
rich semantics on large
objects:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis (e.g., MapReduce):
• Highly efficient object storage
• Pipelined, bulk-access
mechanisms
• Follows Java Named Map
semantics.
Optimizing In-Memory Storage for M/R
22. 22 ScaleOut Software, Inc.
In-Memory Named Map:
• Stores key/value pairs in chunks.
• Allows CRUD operations on kvps.
• Automatically organizes chunks into
splits.
• Uses per-split hash table to access
keys and manage multi-valued
keys.
• Stores shuffled data set between
mappers and reducers.
• Pipelines chunks to mappers and
from reducers.
• Optionally uses memory mapped
files to reduce access latency.
• Provides support for sorting keys.
Named Map Optimizations
23. 23 ScaleOut Software, Inc.
• IMDG adds Dataset Record Reader (wrapper) to cache HDFS data
during program execution.
• Hadoop automatically retrieves data from IMDG on subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
Optional Caching of HDFS Data
24. 24 ScaleOut Software, Inc.
• IMDG caches “chunks” of key/value pairs instead of HDFS records.
• Serves key/value pairs directly to mappers on cache access.
• Avoids overhead of reparsing records.
Details of HDFS Caching
“Record” Phase “Playback” Phase
25. 25 ScaleOut Software, Inc.
• Measured performance:
• Startup times reduced to a few milliseconds
• Word count benchmark shows 20X speedup.
• Real-world example shows >40X speedup.
• MapReduce optimizations:
• Optional sorting
• Optional multicast of parameters to mappers
• Optional O(logN) global combining (avoids
single reducer)
• Optional HDFS caching
• Optional reuse of JVMs across jobs
• Current limitations:
• No specific security for multi-tenancy
• Intermediate data must fit in the IMDG
Performance & Optimizations
26. 26 ScaleOut Software, Inc.
In-Memory MapReduce:
• Enables use of Hadoop MapReduce for operational intelligence.
• Accelerates data access by holding data in memory.
• Analyzes and updates “live” data.
• Reduces overheads of standard
Hadoop distributions:
• Batch scheduling
• Disk access
• Data shuffling
• Mandatory key sorting
• Avoids vendor-specific APIs:
• Leverages Hadoop skill sets.
Summary of Benefits
27. 27 ScaleOut Software, Inc.
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated data-parallel
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
• Measured >40X speedup over Apache 1.2.
Example in Financial Services
28. 28 ScaleOut Software, Inc.
The Challenge: Quickly evaluate and respond to sub-second
market changes:
• Hedge fund tracks a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• Challenge: The hedge fund must be able to quickly update and
analyze its hedging strategies and provide alerts to traders.
Demo of the Finserv Application
29. 29 ScaleOut Software, Inc.
• Delivers a stream of alerts to traders
within a few seconds.
• Enables the trader to examine strategy details in real time:
Output: Real-Time Alerts
31. 31 ScaleOut Software, Inc.
• Measured a similar financial services application (back testing stock
trading strategies on stock histories)
• Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock
history data in memory
• IMDG handled a continuous stream of updates (1.1 GB/s)
• Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling
Example of Performance Scaling
32. 32 ScaleOut Software, Inc.
Fast map/reduce reconciles inventory and order systems for an
online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use MapReduce to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
Example in Ecommerce: Inventory
Management
33. 33 ScaleOut Software, Inc.
• IMDG holds customer
information for active
Web users.
• IMDG saves/retrieves
customer information
from backing store.
• Web browsers send
activity information to
analytics engine.
• IMDG updates customer history and
preferences.
• Analytics engine identifies browsing and
buying patterns.
• Analytics engine makes suggestions in
real-time. Also sends email follow-ups.
Example: Web Shopping
34. 34 ScaleOut Software, Inc.
• Track
connectivity
issues.
• Obtain time-
sensitive
business data.
• Offer enhanced
services.
• Increase
security.
Example: Telecommunications
Optimize Operations
Customer Experience
Historical queries
for real-time data
enrichment
Stream
persistence for
future analysis
Network
Elements
35. 35 ScaleOut Software, Inc.
• Online systems need operational
intelligence on “live” data for
immediate feedback.
• Operational intelligence can be
implemented using Hadoop
MapReduce unchanged.
• In-memory data grid provides
an excellent platform for MR-
based operational intelligence:
• Hosts and updates “live” data.
• Implements high availability.
• Offers fast MapReduce execution
for immediate results.
• Leverages Hadoop skill sets.
Recap
37. 37 ScaleOut Software, Inc.
• Storm implements pipelined, task-parallel execution by “bolts” on
incoming data streams.
• Streams can be distributed to bolts with configurable mappings.
• Developer controls the number of tasks per bolt.
• Storm uses a centralized master node and Zookeeper for fault-
tolerance.
• Key strength: continuous
processing of input
streams
• Issues:
• Complexity / tuning
• Minimizing data motion
• Managing global state
Comparison to Storm
38. 38 ScaleOut Software, Inc.
• Create method to analyze a queried stock object and another
method to pair-wise merge the results:
Java Example: Parallel Method Invocation
public class StockAnalysis implements
Invokable<Stock, StockCalcParams, Double>
{
public Double eval(Stock stock, StockCalcParams param)
throws InvokeException {
return stock.getPrice() * stock.getTotalShares();
}
public Double merge(Double first, Double second)
throws InvokeException {
return first + second;
}
}
39. 39 ScaleOut Software, Inc.
• Run a parallel method invocation on a queried set of objects:
Java Example: Parallel Method Invocation
NamedCache cache = CacheFactory.getCache("Stocks");
InvokeResult valueOfSelectedStocks =
cache.invoke(
StockAnalysis.class,
Stock.class,
or(equal("ticker", "GOOG"), equal("ticker", "ORCL")),
new StockCalcParams());
System.out.println("The value of selected stocks is" +
valueOfSelectedStocks.getResult());
40. 40 ScaleOut Software, Inc.
• IMDG ships user’s code and libraries to its servers.
• IMDG automatically schedules analysis operations across all grid
servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
PMI: Running the Analysis
41. 41 ScaleOut Software, Inc.
• The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
invoking application as
one object.
PMI: Merging the Results