Though insights from Big Data gives a breakthrough to make better business decision, it poses its own set of challenges. This paper addresses the gap of Variety problem and suggest a way to seamlessly handle data processing even if there is change in data type/processing algorithm. It explores the various map reduce design patterns and comes out with a unified working solution (library). The library has the potential to ‘adapt’ itself to any data processing need which can be achieved by Map Reduce saving lot of man hours and enforce good practices in code.
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
1. WHITE paper
www.hcltech.com
Abstract
Abbreviations
Market Trends and Challenges
Solution
Case Study
Revenue Benchmarking
MR Latency Benchmarking
Word Count with Combiner
Word Count without Combiner
Best Practices
Conclusion
Reference
Author Info
2
2
2
3
5
5
5
6
7
7
7
7
7
TABLE OF CONTENTS
Adaptive Map
Reduce
WHITE PAPER
2. This paper explores the various map reduce design patterns and comes out with a unified working solution (library). The
library has the potential to ‘adapt’ itself to any data processing need which can be achieved by Map Reduce. This would
not only enable HCL and Clients save a lot of man hours but as well as enforces the ‘good practices’ of map reduce
design pattern in the code by default. HCL Technologies has been actively working with multiple clients for the last couple
of years in verticals such as ISPs, Aero, Banking & Finance and Media & Entertainment delivering them service and
solutions in the Big Data/Data Analytics domain. One of the fundamental problems that all of these leading companies
came up with to HCLT was processing big data which is in different data formats, spread across multiple sources and with
one or more co-relational mapping parameters. There fuels a need for a unified library which can act as a bridge for solving
these varied cross domain problems and utilize the good practices of Map Reduce.
Hadoop efficiently solved the Volume and Velocity of Big Data; however there is a gap which calls for a solution which
will make use of existing frameworks to solve the Variety problem. The solution of the 3rd V (Variety) actually boils down
to seamlessly handling of data processing even if the data type/processing algorithm gets modified. The clients gener-
ally come up with ad-hoc data source/processing/mapping problems and we have to implement with the appropriate
MR programs. However, due to isolated problems/data sources solo programs are written resulting in redundant effort
in/across teams and project. Most of the times clients initially lack clear visibility of the entire requirements and midways
may request to include a data source. In most of the cases there calls for a lot of rework involved which results in scope
change from project management perspective and clients generally don’t want to reschedule much. The project which
we are currently implementing for the largest Aerospace Company is a pre-prod application which will expand into a
full time production environment in the near future. We currently have visibility into only 3 data source and in production
the number of data sources would be at-least 5 times more. The task that the client has asked us to deliver is that there
should be minimum code changes and no change at all in the architecture. This challenge is in line with the problem
described in paragraphs above.
Fig1.MPP Report highlighting the efforts in man-days
Data processor/MR job for
Data processor/MR job for
Data processor/MR job for
Unit test with representative data
Report & Dashboard development
Tool evaluation for reports and dashboard
Develop reports (3 reports)
Develop dashboard
5 days
5 days
5 days
5 days
35 days
5 days
15 days
10 days
49
50
51
52
53
54
55
56
ID Task Name Duration Start Finish Predecess
Tue 11/25/14
Tue 11/25/14
Tue 12/2/14
Tue 12/9/14
Mon 11/3/14
Mon 11/3/14
Mon 11/10/14
Mon 12/1/14
Mon 12/1/14
Mon 12/1/14
Mon 12/8/14
Mon 12/15/14
Fri 12/19/14
Fri 11/7/14
Fri 11/28/14
Fri 12/12/14
33
33
49
51
54
55
Sl. No. Acronyms Full form
1 AMR Adaptive Map Reduce
Market Trends and Challenges
Abbreviations
Abstract
3. As we can clearly see in the diagram above to support each Data Processing Algorithm we need to spend about 5
Man-Days for the development alone. Now with use of AMR the need for such cycles can be eliminated
As in any programing paradigm MR has a set of design patterns too. The design patterns are generally based out of ‘good
practices’ which evolves out of years of research and implementation in the industry. Currently when MR programs are
written these patterns are not used always. However it has been noticed that there is a considerable improvement in
performance when patterns are used. By introducing a library/framework we would enforce the projects to follow the
good practices of MR. This would also enable projects to quickly map the processing logic to a pattern without much
research and would ease the development effort a lot.
HCLT Analytics group have a lot of customizable solutions off the shelves for Data Ingestion, Data Persistence and Multi
Tenancy however we don’t have a framework/library for core Data Processing of Hadoop.
The diagram depicts the fact that the degree to which software is customized does play an important role in project acquisi-
tions. Hence a highly customizable solution in Big Data processing module can be of a great value addition to HCLT as a
company. It will enable us to go for project acquisitions with overall solutions for every aspect of Data Analytics.
We decided to approach this problem first by analysing the Map Reduce design patterns. There are 23 patterns as of now.
Fig2. Major Variables affecting Software Acquisition
Join
Meta
Patterns
Input and
Output
Summarization Filtering
Data
Organitation
Reduce Side
Join
Replicated
Join
Composite
Join
Cartesian
Products
Job Chaining
Chain Folding
Top Ten Items
Job Marging
Generating
Data
External
Source Data
External
Source Input
Numerical
Summarization
Inverted Index
Summarization
Counting with
Counters
Filtering
Bloom
Filering
Top Ten Items
Distinet
Structuredto
Hierarchical
Parttioning
Binning
Total Order
Sorting
Shuffling
Partition
Pruning
Solution
Entirely
Off-the-Shelf
Software
Off-the-Shelf
Software
Partly Customized
(a) Degree to which Acquired Software is Customized
(b) Scale of Acquisition, or Degree to which the
overall Acquisition is Acquired as Separated Components
Entirely
Custom
Software
Full
System
Several
Components
Single
Component
4. The idea was to identify the commonality across these patterns and also to understand the level of dependencies among
the implementation details for each pattern. We found out that each pattern require at least
Input and Output Paths: Which dataset to process? Where should be the output written?
Class of Action required for example: Filtering, Aggregation etc.
Processing Details: Which set of fields are required? How?
Input and Output Data Types: What to process ?
Here as depicted in the diagram, different shapes are created using the Factory Pattern. The shapes are created using
‘Concrete Classes’, the Factory is passed on with the information to create the objects, the Factory instantiate the concrete
class according to the information passed and a shape object is created.
The question that we asked ourselves was how to create a library/framework which can be used to instantiate the MR Job
objects required serving any MR pattern. The well-oiled ‘Factory’ Design Pattern was used for this purpose.
Fig2. Major Variables affecting Software Acquisition
5. In AMR we created concrete MR classes for every MR design pattern. The information of which class to instantiate is
passed on to the Factory using the xml configuration file as shown in the diagram above. When the data comes into the
system the appropriate object is instantiated according to the rules set in regards to the source/algorithm and the MR Job
is started.
The design pattern used is in its nascent stages, though we are currently using Factory we can slowly evolve into a Builder
Pattern when we would want to achieve greater granularity in the data processing. As of now the generic version of the
library is WIP. * We cannot reveal the original Class Diagrams and Full Config file details currently due to NDA.
Quantitative benefits which can be achieved by AMR are mostly measurable however the framework/ library have the
potential to get us some project acquisitions too. Currently we have not taken the solution to our sales teams who are likely
to give us those figures. Through latency and cost benchmarking we can illustrate the measurable parameters as follows:
The MR Job above without Combiner takes about 40 min to complete as evident in the screenshot above. The CPU Time
Taken is about 1964120 ms. One can notice that the Combine Input/Records are present in the screen shot below.
Case Study
Revenue Benchmarking
Let us assume an average of 5 man-days effort for on boarding a data source. With proposed AMR if we are proposing
to reduce it to 4 days (average) per data source, we can claim 20% reduction in development effort to on board a new
data source.
MR Latency Benchmarking
The showcased example is the simplest example of Word Count in MR, but the benchmarks clearly highlight the
advantages of using a design pattern.
Data Set:
NY Times news articles: Source: ldc.upenn.edu
Documents =300000
No. of Words =102660
Size of Data = 1 GB
Word Count with Combiner
6. The MR Job above without Combiner takes about 42.5 min to complete. The CPU Time Taken is about 1853760 ms.
One can notice that the Combine Input/Records are 0 in the screen shot below.
We can deduce the following from the above
There is a gain of about 2.5 min in processing latency
There is an increase of about 6% CPU time utilization and 2% Physical Memory utilization. It shows greater
consumption of the machine resources. More consumption of the machine resources is always preferable in a
distributed environment.
Now as control measure we comment out the Combiner class as depicted above and run the program again.
Word Count without Combiner
7. We are utilizing the best practices of industry and bringing it all under an umbrella. These would result in huge qualitative
benefits in terms of program code and processes.
The quality principle/objective of HCL as an organization is “We shall satisfy our customers by delivering quality products and
services that meet their requirements on time, every time”. AMR as a framework ensures highest level of quality in the
product/service we develop for implementing Data Processing for Big Data.
We also belief “The quality of a product is largely determined by the quality of the process that is used to develop and
maintain it”. By introducing AMR we would be able to enforce a standardized process of MR across the organization which
is based of industry’s best practices in terms of design patterns thus ensuring highest level of quality in the process itself.
“On time Delivery, Cost Control, Enhance Customer Satisfaction and Continual Service Improvement” are the key quality
objectives of HCLT; AMR would allow us to realize most of the goals effectively One of the core principles of quality is REUSE
which AMR promotes by reusing MR code.
The tools used for developing the library are free open sources tools none of which is proprietary to the client or any compa-
ny. However it may be noted that the AMR concept and the library developed are proprietary to HCLT as a whole.
Key Domains where Big Data is in use today are Aero, Auto, Manufacturing, Public Sector, Governance, Health Care and
Media, the list goes on. Now all of these domains have unique processing needs for each of the data sources and the
algorithm which can be addressed by AMR. Also if one notes closely the solution is domain independent. The modification
that is required is only in form of the configuration file which is required to run the program. The solution can be used as-is
as a library for any scenarios where we have to use MR for processing data.
The solution is not library version or tool dependent. It can support any upgrades or modifications in the supporting libraries
as long as there is no major change in the implementation of Map Reduce algorithms itself. We are currently using it with
Cloudera Hadoop 4/5 releases as well as vanilla Apache Hadoop.
http://www.byzantinereality.com/2009/4/History-of-MapReduce-Part-2w
http://www.maxwideman.com/papers/acquisition/involve.htm
http://www.slideshare.net/zhengwenshen/20130201-mapreduce-design-patterns
https://qualitydiva.hcl.com/Other_Links/OMS_Overview.ppt
http://www.tutorialspoint.com/design_pattern/factory_pattern.htm
Author Info
Kinnar Kumar Sen
HCL Engineering and R&D Services
Hello there! I am an Ideapreneur. I believe that sustainable business outcomes are driven by relationships nurtured through values
like trust, transparency and flexibility. I respect the contract, but believe in going beyond through collaboration, applied innovation
and new generation partnership models that put your interest above everything else. Right now 110,000 Ideapreneurs are in a
Relationship Beyond the Contract™ with 500 customers in 31 countries. How can I help you?
TM
Best Practices
Conclusion
Reference