Contenu connexe Similaire à Is your big data journey stalling? Take the Leap with Capgemini and Cloudera (20) Plus de Cloudera, Inc. (20) Is your big data journey stalling? Take the Leap with Capgemini and Cloudera1. 1© Cloudera, Inc. All rights reserved.
|
Is your Big Data journey stalling?
Take the Leap with Capgemini
and Cloudera
Industrializing your transition to the Modern Data Landscape
|
2. 2© Cloudera, Inc. All rights reserved.
|
Speakers
Andrea Capodicasa
Senior Solution Architect
Insights & Data
Goutham Belliappa
Big Data practice leader
Insights & Data
Alex Gutow
Senior Manager,
Product Marketing
3. 3© Cloudera, Inc. All rights reserved.
|
Agenda
• The Case for Change
• Industrializing the Change
• Adoption
• Q&A
4. 4© Cloudera, Inc. All rights reserved.
|
Capgemini Insights & Data Global Practice
Global reach with over 13,000 professionals across 40+ countries
with over 500 Big Data & Data
Science professionals, including
100+ Hadoop certified
consultants
We employ >13,000 information
management specialist
practitioners, deployed across
Capgemini’s global network
We were recognised again by
Gartner as one of the 4 leading
information service providers
globally
Capgemini Insights & Data Global
Practice since 2015, delivering
business & IT Insights and data
services
Capgemini has a global reach and
local presence in 44 Countries and
over 100 Languages
Canada
USA
Mexico
Centers of
Excellence in
Mumbai and
Bangalore
Brazil
Argentina
Saudi
Arabia
South Africa
China
Australia
4500
400
70300
1200
5000
Western Europe
Eastern Europe
Middle East & Africa
Latin America
North America
Asia Pacific
India
Morocco
EUROPE
• Austria
• Finland
• France
• Italy
• Germany
• Norway
• Sweden
• Netherlands
• Poland
• Spain
• Switzerland
• UK
6. 6© Cloudera, Inc. All rights reserved.
|
Information Trends: What are seeing in the market place?
Recent years have brought unprecedented changes to the Information landscape. Each of these “disruptors” have
individual momentum and collectively represent significant opportunity to improve
an organization’s effectiveness.
Successful CIOs and leaders consciously take these trends into consideration when planning
the evolution of their information architecture.
Empower the business by focusing from the “user down”, not the “system up”.
Modeling business requirements months or even years
in advance and IT delivering a multi year plan to rollout
a solution that may not apply in a fast changing
business environment are long gone
Ms. Agility killed Mr. Waterfall
The availability of “finished” business functions within
the cloud provides organizations with tremendous
opportunities while increasing IT information
challenges
Cloud Computing
Open source architecture provides substantial
development and complexity cost savings vs. legacy
software packages.
Open Source
Software as a Service offerings in Big Data,
Data Transformation & finished analytics are removing
the infrastructure bottle necks of servers, software and
maintenance from obstructing
speed to market
As a Service
The proliferation of web-connected IP devices creates
a “hyper-evolving” cyber breach potential for
organizations; privacy laws create compliance
challenges with mobile devices
Security & Privacy
Traditionally data dictionaries have been single
purpose and technically focused. As data becomes
more valuable and the same information is used in
multiple ways, then the need for Business Meta-data
will become critical
Business Meta-Data
Has resulted in data where segments are loosely
connected and correlations are at times
non-intuitive, requiring new ways to mine
and derive insights
Social Computing
Massive in-memory databases with intensely complex
analytics are highly scalable -- change anything,
anytime, and simultaneously compare the results of
multiple scenarios in seconds
In Memory Analytics
Describes the transition from historical or hind-sight
indicators to insight and foresight indicators and
visualizations.
“Real” Analytics
8. 8© Cloudera, Inc. All rights reserved.
|
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform
• One place for unlimited
data
• Unified, multi-
framework data access
Cloudera makes it
• Fast for business
• Easy to manage
• Secure without
compromisePublic Cloud
Private Cloud
Hybrid Environments
Hybrid Deployment
Flexibility
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH OTHER
OTHERFILESYSTEM RELATIONAL
9. 9© Cloudera, Inc. All rights reserved.
|
The traditional approach to BI & Analytics is a bottleneck
in the operational value chain
Traditional BI & Analytics approach • Centralised BI teams too monolithic and divorced
from the business operations
• Insights latency
• Reporting on the past, limited ability to predict
and prescribe what is needed now
• Each new business question asked = more time
required to crunch the right data
• Heavy duplication in operational data throughout
the BI layers & systems
• Diluted data quality & governance create risks of
security breach, compliance issues & risk exposure
• Significant costs – infrastructure and people.
• Limited ability to scale - either from organic data
volumes growth or increasing data complexity
10. 10© Cloudera, Inc. All rights reserved.
|
The Insights-driven enterprise puts information at the centre
and insights “at the point of action”
Next Generation approach • Next-generation data management platform enabling a
pervasive, real-time “insights & data fabric” serving
operations
• Standardized & cost effective data management, allowing
high agility on insights and the ability to “ask any
questions”
• Operational applications provide data and integrate
insights back in a continuous improvement loop
• Operations integrate predicted best outcomes to optimise
business processes, automatically where possible
• Ability to detect and catch events on the fly that will
require immediate action (e.g. fraud detection) for
optimal reaction or proactive action
• Coherent management of platforms & data management
processes, with insights & data science skills embedded
directly in the operational units for maximum impact
• Optimized total cost of ownership (TCO) with a
rationalized and simplified data landscape
11. 11© Cloudera, Inc. All rights reserved.
|
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE
Key challenges blur the vision on both the target and
the journey to the Insights-driven enterprise
Challenges addressed
“Which data should we
retain and/or which data
could we archive?”
“I don’t know how to
drive value from my
data”
“Can I decrease costs by
moving my data
(landscape) to the cloud
or As-A-Service”
“How mature is my data
landscape in comparison
to the best industrial
trends?”
“I have been told to“
do something” about big
data analytics but don’t
know where to start”
“Can the Business
Intelligence landscape be
optimized to derive the
maximum value out of it?”
“Our data landscape is
scattered, complex and
very expensive, can we
fix it?”
Value created
A modern data strategy will enable:
Reduced complexity: Rationalizing the
data strategy to meet demand
Lower cost: Reduce the operating cost of
your data strategy
Increased agility and better time to
market: More speed in the development
of new information applications
More/Better insights and return on
intelligence: Ease to derive meaningful
insights and enable business
transformation
Less risk: Reduce complexity of the data
strategy
Data security & privacy: Make your data
strategy compliant with rules and
regulations
13. 13© Cloudera, Inc. All rights reserved.
|
Misura Diligent Idem Blend Papillon Virtu
Capgemini’s Leap Data Transformation Framework
Modules overview
Essence
(Semantic Layer consolidation)
Analyze existing semantic layer of architecture
Identify potential functional overlap and produce
recommendations for consolidation
Data concierge
Business Information Catalog
Self service ingestion, distillation, analytics
Data Operations Services
Estimation Discovery Design/Build Testing
Agile environment provisioning
Continuous Integration lifecycle
One-Click leap
Optimize/reduce
transformation scope
Optimize
reporting design
Optimize SQL Industrialize end to
end testing
Estimate the
transformation effort
Optimize ETL semantic
design
14. 14© Cloudera, Inc. All rights reserved.
|
Diligent / Blend Applications
Business Problem
Large and complex DW estates have been built over the last
20 years or, so and the infrastructure hosting them might need
update
A number of reports and underlying tables will be duplicated
or not utilised anymore – they can be decommissioned saving
valuable resources
Users are reluctant to give up “their” reports/data when
migrations programmes occur
Solution
Scope reduction through identifying current BO reports that are not used. Up to 40% discovered with a customer of ours
Scope reduction in identifying reports that are duplicates or share a number of data items.
Automated method to migrate BO reports to Pentaho, hence reduced workload and reduced errors.
A scientific and objective approach to measure which data are
actually used
Diligent BO Audit data explorer to identify interactions
between users and Universes / Reports and tables
Diligent BO Meta data gathering Module to extract Universe
and report information.
Blend Report merger to identify reports reduction
Blend XML Generator to create Pentaho reporting cubes from
Diligent gathered metadata.
Diligent Blend
Accelerator Results
15. 15© Cloudera, Inc. All rights reserved.
|
IDEM-DA
Business Problem
The customer has very strict security and normalisation
requirements when loading their data, they need different
obfuscation types for different “semantic types pre” e.g.
names, phone numbers, social security numbers. Etc.
Left it as a manual activity, this would imply a laborious and
time consuming identification of hundred of thousands of
columns – a costly and error prone activity
Solution
Automated identification of tables columns for encryption,
and standardisation
Automated creation of ETL meta-data spreadsheets which
drive Data Acquisitions Pentaho jobs for data migration
Accelerator Results
Manual generation of meta-data
spreadsheet: Several Days - Weeks
IDEM-DA: 15mins - 2 hours
Manual eyeballing of data – human errors.
Can take hours to several days
IDEM-DA: Approximately 70% reduction
and more accurate identification of known
types
Project manager of Data Migration
project: “IDEM-DA is the only way
forward”
Idem
16. 16© Cloudera, Inc. All rights reserved.
|
Example table
IDEM-DA
Column Name Dataset
mob_no 07710232931,07083210302
email example@hotmail.com,
hello@gmail.com
free_text_field My address is 12 lucky street,
London, E12 2TF
serial_id 11234, 22313, 3231313
Semantic Type
MOBILE_NO
EMAIL
Address
UNKNOWN
IDEM-DA
IDEM-DA is a Module used to support the ETL from legacy data warehouses into Modern architecture
Idem
17. 17© Cloudera, Inc. All rights reserved.
|
IDEM-ES
Business Problem
The customer has a load pattern called “cutover+delta” –
historical tables are updated with daily files
Although many tables have most of the columns with
similar names, Left it as a manual activity, this would
imply a time consuming identification of hundred of
thousands of columns – a error prone activity
Solution
Machine learning based solution to automatically identify
similarity between columns (humanly supervised)
Column name similarity (ngrams)
Column content similarity (ngrams)
Column content agnostic distribution (hist)
Open architecture to automatically evaluate best
model (tested 600+)
Automated creation of INSERT INTO ETL scripts
Accelerator Results
- Acceleration expected around 30-50% Can automatically generate SQL insert statements to create
the current view
Idem
20. 20© Cloudera, Inc. All rights reserved.
|
Virtu – Data testing Framework
Business Problem
Testing data migrations – and in general integrity of data
transformations in large scale BI/DW estates is complicated
Thousands of objects moved across during the migration –
and when in production loaded every day might lead to
hundred of defects – without an automated system to keep
track of all of them can become a daunting task
Continuously monitoring of the DQ performance and
regression error history is essential to maintain acceptable
levels of quality
Solution
Benefits
• Customer can easily plan and execute a large amount of checks – completely controlling their lifecycle (creation, modification,
decommissioning)
• Configurable engine to store details of defects to have maximum visibility and transparency on errors and their resolutions
• Native connection to modern defect management systems (Jira) – and easily expandable to any systems with reachable API
• DQ dashboard gives real time and drillable information on current DQ state
• Compatible with 3 system types – Oracle, Impala & MySQL
A complete e2e testing framework that accelerates the
configuration, execution and evaluation of tests for large scale BI
domains
Comprised of Web UI for maximum user friendliness in
configuration
Scheduler engine to launch configurable batches of tests
Real time Defect manager for timely defects issuing and
progress check
DQ dashboard for monitoring state and progress
24. 24© Cloudera, Inc. All rights reserved.
|
Leap Data Transformation Framework is the result of a client
co-innovation process and delivered efficiencies on large projects
Capgemini client in Public Sector is building a Business Data Lake (BDL) to
support all digital channels interactions as well as rationalize/optimize its IT
Business Intelligence legacy landscape on top of the new Big Data architecture
In the scope of the IT Rationalization project, 10+ data warehouses, hundreds of
analytical business services, and thousands of BO reports must be moved on top
of the BDL, for thousands of business users throughout the organization.
In this context, Leap Data Transformation Framework was used on a 1st business
scope
Leap is a framework consisting of a transformation methodology and
accelerators across the transformation lifecycle which can operate at scale:
The methodology is modular and covering all phases of transformations
Elements of the Discovery phase were automated
Design and Build process automation (metadata driven) and application
deployment controls delivered development efficiencies and scalability
A metadata driven test automation framework reduced initial test effort
and subsequent regression test activities
A Continuous Development process
Platform application stack deployment efficiencies
Approach Key Outcomes
Accelerator Results
An end to end, fact-based transformation framework to deliver IT Rationalization on top of Big Data architectures
40% reduction of the transformation
scope
Diligent
15% efficiency in the design/build
process through use of:
• Semi-Automated ETL code optimizer
• Semi-Automated SQL optimizer
• Semi-Automated report optimizer
Idem Papillon Blend
10% efficiency in the test development
process (1st pass) & 30% efficiency in
regression testing through:
• Automated test & assurance
framework
Virtu
25. 25© Cloudera, Inc. All rights reserved.
|
Use cases for Capgemini’s Leap Data Transformation
Framework for optimized business data lakes
For advanced clients embracing the potential of modern
architectures
Opportunity to transform, simplify and rationalize an
organization’s data landscape for optimized TCO
Leap Data Transformation full suite enables risk and cost
reduction working well in an agile approach
Replatforming
For clients in need of better visibility of their current data
assets before moving to Big Data
Leap Data Transformation Framework can help optimize
current data management processes, reduce substantially
transformation scope, identify the optimal platform for
the workloads and shape a future project for success
Legacy Discovery/DW optimization
Capgemini takes over current BI estate and modernizes it
through its NextGen BISC approach
For clients with redundant and expensive DW estates
concerned about risks to move to modern architectures
Leap Data Transformation Framework full suite is a key
element to optimize the TCO and ensuring quality in the
transformation process
Managing existing BI &
move to modern architectures
For clients needing to automate their data testing in big
data environments or large relational environments
Tools can automate the testing lifecycle for both big data
and traditional relational DW estates
Testing
26. 26© Cloudera, Inc. All rights reserved.
|
Replatforming legacy BI applications requires strong strategies
for user adoption and decommissioning
Strong user adoption strategy
End users understand the new value
they will get out of the new system
They are empowered to use it
Their success is spreading to new
initiatives
• They forget all about the old & slow
stuff fairly quickly
Weak user adoption strategy
End users fear the new system will
impact their capacity to do their jobs
The known is safer than the new
First tests on the new systems
disappoint, any failure goes viral
Evolutions still run on the old system,
“just in case”
Strong kill strategy
Systems are killed according to
roadmap, costs linked to unused HW
& SW are recovered
IT & Business impacts are
anticipated, managed and
communicated
The energy is focused on the new
Weak kill strategy
First systems are shut down ignoring
business constraints, impacting
operations
Endless hours spent to compare the
old and the new and explain
differences
Unprepared board escalations when
unplanned impacts arise
THE USER
ADOPTION
STRATEGY
THE KILL
STRATEGY
27. 27© Cloudera, Inc. All rights reserved.
|
Sample Table of contents for the output of a 4 week Data
Warehouse Optimization roadmap based on LEAP
Data Extract & Staging
Data Management & EDW
Semantic Layer
Sandbox & Analytics
Operational Analytics
Data Virtualization Layer
Master Data Management
Metadata Management
Data Distribution Layer
Our Understanding
Big Data Trends in Heavy Equipment /farm Industry
Technology Principles
Reference Architecture
– Conceptual Architecture
– Architecture Components
Technology Choice Points
– ETL tool comparison
– EMR vs. Hadoop
ETL & Data Offloading Plan
– Project Structure, Sequence, Sprints
– Assumptions
– Collaborative Planning & Prep
Logical Architecture
Business Value Proposition
Current State Architecture
End State Architecture
Current State + 6 months Architecture
Current State + 12 months
Architecture
Current State + 18 months
Architecture
Data Distribution Layer
29. 29© Cloudera, Inc. All rights reserved.
|
Contact our experts
Schedule a discovery session with our
experts
Schedule a first assessment of the value of
Leap for your organization
Goutham Belliappa
Goutham.belliappa@capgemini.com
https://www.linkedin.com/in/gouthambelliappa
Andrea CAPODICASA
Andrea.capodicasa@capgemini.com
Duane Garrett
duane@cloudera.com
Notes de l'éditeur Speaker: Goutham 6 Speaker: Goutham Speaker: Alexandra
Let’s talk a bit about this new architecture that complements and extends existing investments.
An enterprise data hub can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one unified platform.
Cloudera’s enterprise data hub, powered by Apache Hadoop, the popular open source distributed data platform, is differentiated in several crucial areas. We provide:
Leading query performance.
The enterprise management and governance that you require of all of your mission-critical infrastructure.
Comprehensive, transparent, compliance-ready security at the core.
An open source platform that is also built of open standards – projects that are supported by multiple vendors to ensure sustainability, portability, and compatibility.
Our platform offers flexible deployment options, whether on-premises or in the cloud.
===
Cheat Sheet version: Our enterprise data hub is:
One place for unlimited data
Accessible to anyone
Connected to the systems you already depend on
Secure, governed, managed & compliant
Built on open source and open standards
Deployed however you want
Coupled with the support and enablement you need to succeed.
Important Note: Our EDH emphasizes “unified analytics” over “unified data”: It’s not practical or probable that customers will actually unify all their data. Much of it lives in the cloud or on storage (e.g. Isilon), in remote datacenters, is of uncertain value vs. cost of moving it to a hub, or security mandates preclude collocation. We enable customers to gather unlimited data, while bringing diverse processing and analytics to that data.
Speaker: Alexandra Speaker: Alexandra
Value drivers! Speaker: Alexandra
How can I get value from data
What data do I keep
Lots of separate, complex, expensive systems – do I need them
Is my business set up to be competitive?
Compliant and productionalize using real data Speaker: Goutham Speaker: Andrea Speaker: Andrea
Speaker: Andrea Speaker: Andrea Speaker: Andrea Speaker: Andrea Speaker: Andrea Speaker: Andrea Speaker: Andrea Speaker: Goutham Speaker: Goutham Speaker: Goutham Speaker: Goutham