See what's new in our latest version - http://www.talend.com/products
Why do you need a data lake? How do you keep it governed? Data lakes are enterprise value-added data management platforms that are transforming the data is analyzed and access. Here's how you can take full advantage.
3. #TalendConnect
Why Do We Need a Data Lake?
“Data lakes are enterprise-wide data management platforms for analyzing disparate
sources of data in its native format.”, Gartner.
BusinessValue
Reducing cost
Generating new opportunities
• ETL offload
• EDW offload/optimization
• Data archiving
• Customer acquisition, retention..
• Real-time engagement
• Pricing optimization
• Demand forecasting
• Risk and fraud
• Predictive maintenance
• Smart products…
4. #TalendConnect
But Data Lakes Bring New Challenges
The rest
of us
Data Lakes Bring New Challenges
High-end
users
Complexity, poor governance and control, no reuse
5. #TalendConnect
Data Lake – Conceptual Architecture
Acquire
Ingest
Understand
& Improve
Curate &
Govern
Deliver
Self-service
SCALE
6. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
Continuously refreshed data Continuous data delivery and data processes
7. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
Wide connectivity
Batch & streaming ubiquity
Scale with volume and variety
Pitfalls:
o Hand coding
o Fragmented tools
8. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
Add context on data (provenance,
semantics…)
Optimize data with curation,
stewardship, preparation…
Use a collaborative process
Pitfalls:
o Authoritative governance
o Inconsistent framework
9. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
Pervasive DQ, masking…
Consistent operationalization
Single platform for all use cases
& personas
Pitfalls:
o Fragmented tools
o Hand coding
o Shadow IT
10. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
Make data accessible
Governed self-service
Scalable operationalization
Pitfalls:
o Unmanaged autonomy
o Self-service tools for the tech
savvy
11. #TalendConnect
Best Practices to a Successful Data Lake
Accelerate
Data
Ingestion
Understand
& Govern
Your Data
Remove Silos
Unify Data
Management
Deliver Data
to a Wide
Audience
GET READY FOR CHANGE
12. #TalendConnect
Ingestion Best Practices
Transactions
Messages & Events
10110
11100
10
10110
11100
10
Logs
Sensors
Data Analytics & Data Science
Real-time Data Visualization
Real-time Indicators / Scorecard
Collect - Distribute
Track
Streaming
Windowing
Alert
NYC Taxi Data Streaming
14. #TalendConnect
• The future features described in this presentation are under consideration by
Talend and are not commitments for future products, technologies, or services.
• The roadmap is subject to change and Talend does not guarantee the features
or release dates.
Disclaimer
15. #TalendConnect
Roadmap 2017
Addressing the needs of large enterprises
Big Data
1st on Spark 2.0
&
Data Prep on Big
Data
Data Prep
&
Data Ingestion
Cloud Self-service
Data Stewardship
&
Self-service
connectors
Governance
Apache Atlas
16. #TalendConnect
Analyze way more data to find more opportunities for innovations
and transformations
Real-time data streaming brings increased agility
To unleash data lakes, data governance is essential
Key Take Aways
17. #TalendConnect
Free Trial: Talend Big Data Sandbox
• A ready-to-run Docker environment
• A step-by-step expert guide
• Real-world scenarios using Spark, Kafka,
MapReduce & NoSQL
www.talend.com/BigDataSandbox
Hit the Easy Button for Hadoop, Spark and Machine Learning
#TalendConnect
Event theme: Unlock your data for unlimited possibilities
Oftentimes, an enterprise data lake is viewed as a panacea for all data ills, including being viewed as the ‘holy grail’ for those trying to spur digital transformation.
Yet many IT teams are still struggling to see the payoffs from such data lake investments. In this session we will present best practices and reference architecture for building a data lake with Talend Big Data. We will talk about some new big data streaming capabilities coming in Talend Winter.
As a reference JM’s session to avoid too much overlap.
Data Prep: How-to enable sustainable IT & Business collaboration around self-service dataData is everywhere and everyone needs it.Only a modern data platform, that combines self-service data access, with Big Data and the Data Lake, can turn data into a "liquid asset" that anyone can consume and use. But major roadblocks exist, particularly when it comes to data governance and security. To tackle these challenges, traditional authoritative governance approaches are being morphed into collaborative practices, that turn existing business users, into "data workers". Come to this session to learn how Talend Data Fabric addresses these issues and also get a sneak peak at our new self-service data capabilities coming in Talend Winter.
Unleashing the data lake to 22K people around 80+ countries to perform machine and equipment health, reliability management, and maintenance optimization.
Convey that DW is diff from DL
Refer to: http://www.ge.com/digital/industries/power-utility/power-generation
Real-time: revenue generation
Prolong lifetime value of gas turbines
Keep it running for the next 10y, depreciate in a better way
Sell the NRJ it creates
Batch meets RT world
Changing the GE culture
Before the data lake, they were only able to analyze 2% of the gas turbines data
GE is Massive company, 7 of their departments use Talend
GE Power is one of them
The Challenge
GE Power needed to operationalize and optimize their business, operations and asset performance management (machine & equipment health, reliability management, and maintenance optimization)
Only 2% on the turbines were used for analytics, 98% of the data were never tapped into
Needed a new data strategy, cost of doing traditional data rising
Why Talend
Provide data as a service, cafeteria style
Integrate diverse data sets and compute at Big Data scale
Lower cost to operate and reduced development efforts
The Result
130+ applications feeding the system, 7 ERPs, >12M transactions/day
68 change data capture real-time streaming systems (from Sales, ERP systems) for real-time analytics
22,000 users on Big Data in 86+ countries, in self-service mode
Convey the meaning
You built the lake and you can get the value but you’re struggling
Value of the DL
The Data Lake metaphor arose because ‘lakes’ are a great concept to explain one of the basic tenets of Big Data. That is, the need to collect all of the data in the ecosystem ready to analyze it for pertinent patterns using all kinds of analysis, including autonomous machine learning. This is because, one of the basic tenets of data science is .. the more data you can get the better your analysis will ultimately be.
Data Lake vs data warehouse
Less construct vs more construct
Cost reduction:
EDW: 100 versus 1 for Hadoop
For the same cost, organizations can now store 50 times as much data as in a Hadoop data lake than in a data warehouse.
The IT team can only do so much (data ingestion, security, DQ..)
True value when biz can access the data (self-service, improvemts in DQ, lineage, governance)
Hadoop vs Data Lake (less construct, freedom to store anything, more volume, history, velocity)
Provide a definition
“Data lakes typically begin as ungoverned data stores. Meeting the needs of wider audiences requires curated repositories with governance, semantic consistency, and access controls.” Gartner
Idle and overgrown, the data lake quickly will become a stagnant data swamp. But organizations can avoid data swamps by adding semantics to a data lake.
semantics provides us with a highly usable and consistent taxonomy model for data lakes
But .. The data lake brings new challenges….
This has lead to a new set of problems/challenges .. That focus on Trustworthiness and Ubiquity …
Overwhelming amount of data
Concerns about the data being accessed by individuals that shouldn’t due to the lack of tools
Confusion around what data lies where
Limited number of people able to access the data
Limited understanding of where the data came from or what has been done with it
Limited data quality
Lead to data gridlock causing data lacks to fail at delivering on the true potential of the data lake
Example: Betvictor = real-time customer engagement
MUST HAVES:
Wide connectivity
Ubiquity of batch and streaming
Uncomplicated management for a wide range of data types
For discovery & prep we have Data Preparation. For curation we will add Data Stewardship in Winter.
We can handle all formats including the most complex hierarchical ones with the Talend Data Mapper, which runs on Spark.
PITFALLS:
Hand coding dev cost 20%, maintenance 40%, support 40% blog by Ashley, based on “Does Custom-Coded Data Integration Stack Up to Tools?”[1], (Sept 5)
ALSO: 20% initial but 200% increase of maintenance cost
Standalone “specialized” disconnected tools
Air France = DQ + Talend Metadata Manager
MUST HAVES:
Capture metadata, provenance & lineage
Automate data tagging = data semantics & ML
Collaborative stewardship & curation
Preparation & improvement of the data
Control data accessibility
PITFALLS:
Top-down governance
Fragmented tooling = inconsistent governance framework
Must have integration with the distros & Apache
Top-down governance = hard & slow, unpractical
Too fragmented tooling = inconsistent governance framework
Major regulatory obligations e.g. GDPR
Data accessibility = security -- Beyond Kerberos: kerberos is a given. We of course support Kerberos everywhere. But kerberos is not enough. You must plan for different granularity such as with Sentry. Or policy-based rules with Ranger. Data encryption such as with HDFS encryption. Or data masking of PIIs or sensitive data. TALEND SUPPORTS ALL OF THE ABOVE!
The Profiler and Data Prep both use semantic analysis to understand the ‘meaning’ of the data and help identify sensitive data
For auditability and lineage 1) our Studio is and always has been metadata driven 2) we are fully integrated with Navigator and Atlas 3) we can do Enterprise MM beyond Hadoop with TMM
MUST HAVES:
Unified framework for all data management tasks
Single point of operationalization
Scalable business model
PITFALLS:
Fragmented tools & hand-coding
Isolated initiatives, shadow IT
Unpredictable and exponential costs
Ring Central = they don’t use Talend Data Prep (yet) but they deliver data in self service
MUST HAVES:
Data accessibility for everyone
Self-service tools for everyone
Scalable operationalization
PITFALLS:
Isolated, unmanaged tools
Self-service tools only for the tech savvy
GET READY FOR CHANGE
Use future proof frameworks
Continuous delivery
Hybrid cloud
Pitfalls:
Hand-coding (again!) tied to a particular language / framework
Fragmented, tactical approaches
MUST HAVES:
Abstraction layer to manage and leverage diversity, evolutions, innovations
Continuous delivery of data and processes
Hybrid on-prem / cloud
PITFALLS:
Hand-coding tied to a particular technology or language
Fragmented approaches to leverage the diverse big data frameworks
Big Data
Big data and cloud innovations including Spark 2.0 = toward unification of batch & streaming
Staying on the cutting edge of big data innovation, processing big data at the fastest speeds possible.
Operationalize ML on Spark
Data Preparation for Big Data lets anyone access and improve data
Enables the information worker to turn data into insight at scale
Enables the entire organization to access “trusted” data in the lake
Cloud
Data Prep as a Service
Democratize Data ingestion via tools accessible to Data scientists, with a similar UX to what we presented earlier
Self-Service
New Data Stewardship App helps users make decision on data + orchestrate data governance between IT and business.
It empowers the business to ensure data integrity at the source.
Data Prep Self-service connectors: Big Data, Cloud, but applicative connectors too (SFDC, MKTO, …)
Governance
We talked a lot about it
In Big Data, you have a lot of data, from which you have very little knowledge. Atlas integration provides traceability & lineage
Free, zero risk, environment
Evaluate
The pros & cons of the various technologies
Spark Batch & Spark Streaming
Talend vs. hand coding
Real-world scenarios
Help plan for your data lake