SlideShare une entreprise Scribd logo
1  sur  39
Data Science Models on Big
Data Platforms
Engineering Patterns for Implementing
Hisham Arafat
Digital Transformation Lead Consultant
Solutions Architect, Technology Strategist & Researcher
Riyadh, KSA – 31 January 2017
http://www.visualcapitalist.com/what-happens-internet-minute-2016/
Big Data…Practical Definition!
• Big Data is the challenge not the solution
• Big Data technologies address that
challenge
• Practically:
• Massive Streams
• Unstructured
• Complex Processing
Let’s Have a Use Case…Social Marketing
Social Marketing…Looks Simple!
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
It’s NOT as Easy as it’s Looks Like!
Not Only Building Appropriate Model, but
More Into
Designing a Solution…Engineering Factors
• Interfacing with sources: REST APIs, source HTML,… (text is assumed)
• Parsing to extract: queries, Regular Expressions,…
• Crawling frequency: every 1 minute, 1 hour, on event,…
• Document structure: post, post + comments, #, Reach, Retweets,…
• Metadata: time, date, source, tags, authoritativeness,…
• Transformations: canonicalization, weights, tokenization,…
- Size: average size of 2 KB / doc
- Initial load: 1.5B doc
- Frequency: every 5 minutes
- Throughput: 2 KB * 60,000 doc
= 120 MB / load
- Grows per day ~ 34 GB
Engineering Factors
• Input format: text, encoded text,…
• Document representation: bag of words, ontology…
• Corpus structures: indexes, reverse indexes,…
• Corpus metrics: doc frequency, inverse doc frequency,…
• Preprocessing: annotation, tagging,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Processing window: 60K per 3 mins
- Processing rate: 20K doc per min
- Final doc size = 2KB * 5 ~ 10KB
- Scan rate: 20k * 10KB min ~ 200MB/min
- Many overheads need to be added
Engineering Factors
• Dimensionality reduction: stemming, lemmatization, noisy words…
• Type of applications: search/retrieval, sentiment analysis…
• Modeling methods: classifiers, topic modeling, relevance…
• Model efficiency: confusion metrics, precision, recall…
• Overheads: intermediate processing, pre-aggregation,…
• Files structure: tables, text files, files-day,…
- No of docs: 1.5B + 17M / day
- Search for “LemaTea sweet taste”
- No of tf to calculate ~ 1.5B * 3 ~ 4.5B
- No of idf to calculate ~ 1.5B
- Total calculations for 1 search ~ 6 B
- Consider daily growth
Engineering Factors
• Files structure: tables, text files, files-day,…
• Files formats: HDFS, parquet, avro…
• Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,…
• Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML…
• Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming…
• Ingestion pattern: real-time, micro batches,…
- Overall Storage
- Processing capacity per node
- No of nodes
- Tables  Hive, Hbase, Greenplum
- Individual files  Spark, Flink
- Files-day  Hadoop HDFS
Engineering Factors
• Workload: no of requests, request size,…
• Application performance: response time, concurrent requests…
• Applications interfacing: RESET APIs, native, messaging,…
• Application implementation: integration, model scoring,…
• Security model: application level, platform level,…
- For 3 search terms ~ 6B calculations
- For 5 search terms ~ 9B calculations
- For 10 concurrent requests ~ 75B
- Resource queuing / prioritization
- Search options like date range
- Access control model
Engineering Factors
Ongoing Process…Growing Requirements
What if?
• New sources are included
• Wider parsing Criteria
• Advanced modeling: POS, Word Co-
occurrence, Co-referencing, Named
Entity, Relationship Extraction,…
• Better response time is needed
• More frequent ingestion
Dynamic
Platform
Ingestion
Corpus
Processing
Model
Processing
Requests
Processing
• Larger number of docs
• Increased processing requirements
• Platform expansion
• Overall architecture reconsidered
Some Building Blocks
What is a Data Science Model?
• Type & format of inputs date
• Data ingestion
• Transformations and feature engineering
• Modeling methods and algorithms
• Model evaluation and scoring
• Applications implantations considerations
• In-Memory vs. In-Database
Key Challenges for Data Science Models
Volume
Stationary
Batches
Structured
Insights
Growth
Streams
Real-time
Unstructured
Responsive
Scale out Performance
Data Flow Engines
Event Processing
Complex Formats
Perspective / Deep Models
Traditional Data Management Systems
• Shared I/O
• Shared Processing
• Limited Scalability
• Service Bottlenecks
• High Cost Factor
SharedBuffers
Data Files
Database
Cluster
I/O
I/O
I/O
Network
DatabaseService
Abstraction of Big Data Platforms Data Nodes
Master Nodes
I/O
Network
Interconnect
• Parallel Processing
• Shared Nothing
• Linear Scalability
• Distributed Services
• Lower Cost Factor
I/O
I/O
I/O
…
Metadata
1
2
3
n
Metadata
User data / Replicas
User data / Replicas
User data / Replicas
User data / Replicas
In a Nutshell
Source:
http://dataconomy.com
/2014/06/understandi
ng-big-data-ecosystem/
• Very huge.
• Overlaps.
• Overloading.
• You need to
start with a use
case to be able
to get your
solutions well
engineered.
Engineered Systems
• Packaged: Hortonworks – Pivotal – Cloudera
• Appliances: EMC DCA – Dell DSSD – Dell VxRack
• Cloud offerings: Azure – AWS – IBM – Google Cloud
Engineering Patterns in
Implementation
Lambda Architecture…Social Marketing
• Generic, scalable and
fault-tolerant data
processing architecture.
• Keeps a master
immutable dataset
while serving low
latency requests.
• Aims at providing linear
scalability.
Source: http://lambda-architecture.net/
Social Marketing…Revisted
Ingest Social
Feeds
Build Corpus
Metrics
Design Text
Mining Model
Deploy All to
a Big Data
Platform
Application
for Marketing
Users
What people are saying about our new brand “LemaTea”?
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Lambda Architecture (cont.)
Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Sequence Files
Apache Spark / MLlib
• In memory distributed
Processing
• Scala, Python, Java and R
• Resilient Distributed
Dataset (RDD)
• Mllib – Machine Learning
Algorithms
• SQL and Data Frames /
Pipelines
• Streaming
• Big Graph analytics
Spark Cluster Mesos HDFS/YARN
Apache Spark
• Supports different
types of Cluster
Managers
• HDFS / YARN,
Mesos, Amazon S3,
Stand Alone,
Hbase, Casandra…
• Interactive vs
Application Mode
• Memory
Optimization
Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
Apache Spark
Apache Spark MLlib
Apache Spark…The Big Picture
Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
Greenplum / MADLib
• Massively Parallel
Processing
• Shared Nothing
• Table distribution
• By Key
• By Round Robin
• Massively Parallel
Data Loading
• Integration with
Hadoop
• Native MapReduce
Apache MADLib
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-
massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Image Processing…Unusual Way
Massively Parallel, In-Database Image Processing
Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
Take Aways
• A Data Science is not just the algorithms but it includes and end-to-end
solution.
• The implementation should consider engineering factors and quantify them
so appropriate components can be selected.
• The Big Data technology land scape is really huge and growing – start with a
solid use case to identify potential components.
• Abstraction of specific technology will enable you to put your hands on the
pros and cons.
• Creativity in solutions design and technology selection case by case.
• Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL
Kafka, Hadoop / Yarn, Greenplum, MADLib.
Q & A
Email: hiarafat@hotmail.com
Skype: hichawy
LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230
Thank You

Contenu connexe

Tendances

MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Zaloni
 
Architecture of Big Data Solutions
Architecture of Big Data SolutionsArchitecture of Big Data Solutions
Architecture of Big Data SolutionsGuido Schmutz
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsNir Rubinstein
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
 
The importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital TransformationThe importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital TransformationMongoDB
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...Avinash Ramineni
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Dataconomy Media
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 

Tendances (20)

MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe? Webinar: Is Spark Hadoop's Friend or Foe?
Webinar: Is Spark Hadoop's Friend or Foe?
 
Architecture of Big Data Solutions
Architecture of Big Data SolutionsArchitecture of Big Data Solutions
Architecture of Big Data Solutions
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
Architecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data AnalyticsArchitecture for Real-Time and Batch Big Data Analytics
Architecture for Real-Time and Batch Big Data Analytics
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
The importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital TransformationThe importance of efficient data management for Digital Transformation
The importance of efficient data management for Digital Transformation
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 

En vedette

Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012Aki Balogh
 
Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights Joe Lamantia
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...ETCenter
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data scienceBrad Klingenberg
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeCloudera, Inc.
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens
 
DataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetupDataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetupFrancisco Liwa
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsAki Balogh
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Pivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical OverviewPivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical OverviewVMware Tanzu
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionDavid Pittman
 
Analytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionAnalytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionDeloitte United States
 

En vedette (19)

Complex Models for Big Data
Complex Models for Big DataComplex Models for Big Data
Complex Models for Big Data
 
Building new business models through big data dec 06 2012
Building new business models through big data   dec 06 2012Building new business models through big data   dec 06 2012
Building new business models through big data dec 06 2012
 
Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
DataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetupDataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetup
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and Analytics
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Pivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical OverviewPivotal Cloud Foundry: A Technical Overview
Pivotal Cloud Foundry: A Technical Overview
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 
Analytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolutionAnalytics Trends 2016: The next evolution
Analytics Trends 2016: The next evolution
 

Similaire à Engineering patterns for implementing data science models on big data platforms

The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 

Similaire à Engineering patterns for implementing data science models on big data platforms (20)

Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Apache drill
Apache drillApache drill
Apache drill
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 

Dernier

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Dernier (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 

Engineering patterns for implementing data science models on big data platforms

  • 1. Data Science Models on Big Data Platforms Engineering Patterns for Implementing Hisham Arafat Digital Transformation Lead Consultant Solutions Architect, Technology Strategist & Researcher Riyadh, KSA – 31 January 2017
  • 2. http://www.visualcapitalist.com/what-happens-internet-minute-2016/ Big Data…Practical Definition! • Big Data is the challenge not the solution • Big Data technologies address that challenge • Practically: • Massive Streams • Unstructured • Complex Processing
  • 3. Let’s Have a Use Case…Social Marketing
  • 4. Social Marketing…Looks Simple! Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 5. Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users
  • 6. It’s NOT as Easy as it’s Looks Like!
  • 7. Not Only Building Appropriate Model, but More Into Designing a Solution…Engineering Factors
  • 8. • Interfacing with sources: REST APIs, source HTML,… (text is assumed) • Parsing to extract: queries, Regular Expressions,… • Crawling frequency: every 1 minute, 1 hour, on event,… • Document structure: post, post + comments, #, Reach, Retweets,… • Metadata: time, date, source, tags, authoritativeness,… • Transformations: canonicalization, weights, tokenization,… - Size: average size of 2 KB / doc - Initial load: 1.5B doc - Frequency: every 5 minutes - Throughput: 2 KB * 60,000 doc = 120 MB / load - Grows per day ~ 34 GB Engineering Factors
  • 9. • Input format: text, encoded text,… • Document representation: bag of words, ontology… • Corpus structures: indexes, reverse indexes,… • Corpus metrics: doc frequency, inverse doc frequency,… • Preprocessing: annotation, tagging,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Processing window: 60K per 3 mins - Processing rate: 20K doc per min - Final doc size = 2KB * 5 ~ 10KB - Scan rate: 20k * 10KB min ~ 200MB/min - Many overheads need to be added Engineering Factors
  • 10. • Dimensionality reduction: stemming, lemmatization, noisy words… • Type of applications: search/retrieval, sentiment analysis… • Modeling methods: classifiers, topic modeling, relevance… • Model efficiency: confusion metrics, precision, recall… • Overheads: intermediate processing, pre-aggregation,… • Files structure: tables, text files, files-day,… - No of docs: 1.5B + 17M / day - Search for “LemaTea sweet taste” - No of tf to calculate ~ 1.5B * 3 ~ 4.5B - No of idf to calculate ~ 1.5B - Total calculations for 1 search ~ 6 B - Consider daily growth Engineering Factors
  • 11. • Files structure: tables, text files, files-day,… • Files formats: HDFS, parquet, avro… • Platform technology: Hadoop/YARN, Spark, Greenplum, Flink,… • Model deployment: Java/Scala, Mahoot, Mllib, MADlib, PL/R, FlinkML… • Data ingestion: Spring XD, Flume, Sqoop, G. Data Flow, Kafka/Streaming… • Ingestion pattern: real-time, micro batches,… - Overall Storage - Processing capacity per node - No of nodes - Tables  Hive, Hbase, Greenplum - Individual files  Spark, Flink - Files-day  Hadoop HDFS Engineering Factors
  • 12. • Workload: no of requests, request size,… • Application performance: response time, concurrent requests… • Applications interfacing: RESET APIs, native, messaging,… • Application implementation: integration, model scoring,… • Security model: application level, platform level,… - For 3 search terms ~ 6B calculations - For 5 search terms ~ 9B calculations - For 10 concurrent requests ~ 75B - Resource queuing / prioritization - Search options like date range - Access control model Engineering Factors
  • 13. Ongoing Process…Growing Requirements What if? • New sources are included • Wider parsing Criteria • Advanced modeling: POS, Word Co- occurrence, Co-referencing, Named Entity, Relationship Extraction,… • Better response time is needed • More frequent ingestion Dynamic Platform Ingestion Corpus Processing Model Processing Requests Processing • Larger number of docs • Increased processing requirements • Platform expansion • Overall architecture reconsidered
  • 15. What is a Data Science Model? • Type & format of inputs date • Data ingestion • Transformations and feature engineering • Modeling methods and algorithms • Model evaluation and scoring • Applications implantations considerations • In-Memory vs. In-Database
  • 16. Key Challenges for Data Science Models Volume Stationary Batches Structured Insights Growth Streams Real-time Unstructured Responsive Scale out Performance Data Flow Engines Event Processing Complex Formats Perspective / Deep Models
  • 17. Traditional Data Management Systems • Shared I/O • Shared Processing • Limited Scalability • Service Bottlenecks • High Cost Factor SharedBuffers Data Files Database Cluster I/O I/O I/O Network DatabaseService
  • 18. Abstraction of Big Data Platforms Data Nodes Master Nodes I/O Network Interconnect • Parallel Processing • Shared Nothing • Linear Scalability • Distributed Services • Lower Cost Factor I/O I/O I/O … Metadata 1 2 3 n Metadata User data / Replicas User data / Replicas User data / Replicas User data / Replicas
  • 19. In a Nutshell Source: http://dataconomy.com /2014/06/understandi ng-big-data-ecosystem/ • Very huge. • Overlaps. • Overloading. • You need to start with a use case to be able to get your solutions well engineered.
  • 20. Engineered Systems • Packaged: Hortonworks – Pivotal – Cloudera • Appliances: EMC DCA – Dell DSSD – Dell VxRack • Cloud offerings: Azure – AWS – IBM – Google Cloud
  • 22. Lambda Architecture…Social Marketing • Generic, scalable and fault-tolerant data processing architecture. • Keeps a master immutable dataset while serving low latency requests. • Aims at providing linear scalability. Source: http://lambda-architecture.net/
  • 23. Social Marketing…Revisted Ingest Social Feeds Build Corpus Metrics Design Text Mining Model Deploy All to a Big Data Platform Application for Marketing Users What people are saying about our new brand “LemaTea”?
  • 24. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 25. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  • 26. Lambda Architecture (cont.) Source: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark Sequence Files
  • 27. Apache Spark / MLlib • In memory distributed Processing • Scala, Python, Java and R • Resilient Distributed Dataset (RDD) • Mllib – Machine Learning Algorithms • SQL and Data Frames / Pipelines • Streaming • Big Graph analytics Spark Cluster Mesos HDFS/YARN
  • 28. Apache Spark • Supports different types of Cluster Managers • HDFS / YARN, Mesos, Amazon S3, Stand Alone, Hbase, Casandra… • Interactive vs Application Mode • Memory Optimization Source: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
  • 31. Apache Spark…The Big Picture Source” https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
  • 32. Greenplum / MADLib • Massively Parallel Processing • Shared Nothing • Table distribution • By Key • By Round Robin • Massively Parallel Data Loading • Integration with Hadoop • Native MapReduce
  • 34. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to- massively-parallel-in-database-image-processing-part-1
  • 35. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 36. Image Processing…Unusual Way Massively Parallel, In-Database Image Processing Source: https://content.pivotal.io/blog/data-science-how-to-massively-parallel-in-database-image-processing-part-1
  • 37. Take Aways • A Data Science is not just the algorithms but it includes and end-to-end solution. • The implementation should consider engineering factors and quantify them so appropriate components can be selected. • The Big Data technology land scape is really huge and growing – start with a solid use case to identify potential components. • Abstraction of specific technology will enable you to put your hands on the pros and cons. • Creativity in solutions design and technology selection case by case. • Lambda Architecture, Spark, Spark MLlib, Spark Streaming, Spark SQL Kafka, Hadoop / Yarn, Greenplum, MADLib.
  • 38. Q & A
  • 39. Email: hiarafat@hotmail.com Skype: hichawy LinkedIn: https://eg.linkedin.com/in/hisham-arafat-a7a69230 Thank You