SlideShare une entreprise Scribd logo
1  sur  20
Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
About Me About Cloudera Bayesian (Probabilistic) Networks BN Inference 101 CPCS Network Why BN Inference Inference with MR Results Conclusions 2 Session Agenda
Worked on BN Inference in 1995-1998 (for Ph.D.) Published the fastest implementation at the time Worked on DM/BI field since then Recently joined Cloudera, Inc. Started looking at how to solve world’s hardest problems 3 About Me
Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4 About Cloudera
Nodes Edges Probabilities 5 Bayesian Networks Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418
Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) Medicine Document classification, information retrieval Image processing Data fusion Gaming Law On-line advertising! 6 Applications
7 A Simple BN Network T F Rain Rain T F 0.4 0.6 F 0.2 0.8 0.1 0.9 T Sprinkler Sprinkler, Rain T F 0.01 0.99 F, F 0.8 0.2 F, T Wet Driveway 0.9		 0.1 T, F 0.99 0.01 T, T Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain)
8 Asia Network Pr(Visit to Asia) Pr(Smoking) Pr(Lung Cancer | Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB =       SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9 BN Inference 101 (in Hive)
10 Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) AB EFG Pr(Visit to Asia) Pr(F) B Pr(C | BE ) EG Pr(H | CG ) BCE CE CEGH C Pr(Lung Cancer | Dyspnea) = Pr(E|H) CD Pr(D| C)
11 CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases
12 CPCS Networks
Choose the right tool for the right job! ,[object Object]
Easy to incorporate human insight and intuitions
Very general, no specific ‘label’ node
Easy to do ‘what-if’, strength of influence, value of information, analysis
Immune to Gaussian assumptionsIt’s all just a joint probability distribution 13 Why Bayesian Network Inference?
Map & Reduces 14 B1C1E1 Keys Map B1C1E2 A1B1 B1 Reduce B1C2E1 A2B1 B1C2E2 A1B2 B2C1E1 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2 B2C1E2 A2B2 B2C2E1 B2C2E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) Aggregation 2 (x) B1C1E1 B1C1E2 C1D1 B1C2E1 C1 Aggregation 1 (+) C2D1 B1C2E2 C1D2 B2C1E1 B2C1E2 C2 BCE C2D2 B2C2E1 B2C2E2
for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15 MapReduce Implementation
[object Object]

Contenu connexe

Tendances

Water jug problem ai part 6
Water jug problem ai part 6Water jug problem ai part 6
Water jug problem ai part 6Kirti Verma
 
Lecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingLecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingVARUN KUMAR
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
The Digital Image Processing Q@A
The Digital Image Processing Q@AThe Digital Image Processing Q@A
The Digital Image Processing Q@AChung Hua Universit
 
Planning in Artificial Intelligence
Planning in Artificial IntelligencePlanning in Artificial Intelligence
Planning in Artificial Intelligencekitsenthilkumarcse
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded searchHema Kashyap
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networksswapnac12
 
State Space Representation and Search
State Space Representation and SearchState Space Representation and Search
State Space Representation and SearchHitesh Mohapatra
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1Srinivasan R
 
Image Representation & Descriptors
Image Representation & DescriptorsImage Representation & Descriptors
Image Representation & DescriptorsPundrikPatel
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 

Tendances (20)

Planning
PlanningPlanning
Planning
 
Water jug problem ai part 6
Water jug problem ai part 6Water jug problem ai part 6
Water jug problem ai part 6
 
Lecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image ProcessingLecture 16 KL Transform in Image Processing
Lecture 16 KL Transform in Image Processing
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
The Digital Image Processing Q@A
The Digital Image Processing Q@AThe Digital Image Processing Q@A
The Digital Image Processing Q@A
 
Planning in Artificial Intelligence
Planning in Artificial IntelligencePlanning in Artificial Intelligence
Planning in Artificial Intelligence
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded search
 
Advanced topics in artificial neural networks
Advanced topics in artificial neural networksAdvanced topics in artificial neural networks
Advanced topics in artificial neural networks
 
Turbo prolog 2.0 basics
Turbo prolog 2.0 basicsTurbo prolog 2.0 basics
Turbo prolog 2.0 basics
 
Ai lab manual
Ai lab manualAi lab manual
Ai lab manual
 
State Space Representation and Search
State Space Representation and SearchState Space Representation and Search
State Space Representation and Search
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Image Representation & Descriptors
Image Representation & DescriptorsImage Representation & Descriptors
Image Representation & Descriptors
 
AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)AI Lecture 7 (uncertainty)
AI Lecture 7 (uncertainty)
 
And or graph
And or graphAnd or graph
And or graph
 
predicate logic example
predicate logic examplepredicate logic example
predicate logic example
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
convex hull
convex hullconvex hull
convex hull
 
Chapter 1 (final)
Chapter 1 (final)Chapter 1 (final)
Chapter 1 (final)
 
strong slot and filler
strong slot and fillerstrong slot and filler
strong slot and filler
 

En vedette

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesGilad Barkan
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachCOST action BM1006
 
Bayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopBayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopOfer Mendelevitch
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 

En vedette (10)

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummies
 
Lecture11 xing
Lecture11 xingLecture11 xing
Lecture11 xing
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network Approach
 
Bayesian Networks with R and Hadoop
Bayesian Networks with R and HadoopBayesian Networks with R and Hadoop
Bayesian Networks with R and Hadoop
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 

Similaire à Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Alex Kozlov
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013StampedeCon
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programminginside-BigData.com
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Kolja Kleineberg
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsEmiliano De Cristofaro
 
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...Ravi Kiran B.
 
Programming with Relaxed Synchronization
Programming with Relaxed SynchronizationProgramming with Relaxed Synchronization
Programming with Relaxed Synchronizationracesworkshop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterExperiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterAntti Vanne
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein AlignmentCloudera, Inc.
 
How to Prepare Weather and Climate Models for Future HPC Hardware
How to Prepare Weather and Climate Models for Future HPC HardwareHow to Prepare Weather and Climate Models for Future HPC Hardware
How to Prepare Weather and Climate Models for Future HPC Hardwareinside-BigData.com
 
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...Roberto Casadei
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLuba Elliott
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Thang Nguyen
 

Similaire à Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010 (20)

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013Thinking in MapReduce - StampedeCon 2013
Thinking in MapReduce - StampedeCon 2013
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...
 
Privacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and ApplicationsPrivacy-preserving Information Sharing: Tools and Applications
Privacy-preserving Information Sharing: Tools and Applications
 
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
 
Programming with Relaxed Synchronization
Programming with Relaxed SynchronizationProgramming with Relaxed Synchronization
Programming with Relaxed Synchronization
 
Network predictive analysis
Network predictive analysisNetwork predictive analysis
Network predictive analysis
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Experiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC clusterExperiences of numerical simulations on a PC cluster
Experiences of numerical simulations on a PC cluster
 
Hw09 Protein Alignment
Hw09   Protein AlignmentHw09   Protein Alignment
Hw09 Protein Alignment
 
How to Prepare Weather and Climate Models for Future HPC Hardware
How to Prepare Weather and Climate Models for Future HPC HardwareHow to Prepare Weather and Climate Models for Future HPC Hardware
How to Prepare Weather and Climate Models for Future HPC Hardware
 
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...
Pulverisation in Cyber-Physical Systems: Engineering the Self-Organising Logi...
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Dernier

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

  • 1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
  • 2. About Me About Cloudera Bayesian (Probabilistic) Networks BN Inference 101 CPCS Network Why BN Inference Inference with MR Results Conclusions 2 Session Agenda
  • 3. Worked on BN Inference in 1995-1998 (for Ph.D.) Published the fastest implementation at the time Worked on DM/BI field since then Recently joined Cloudera, Inc. Started looking at how to solve world’s hardest problems 3 About Me
  • 4. Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4 About Cloudera
  • 5. Nodes Edges Probabilities 5 Bayesian Networks Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418
  • 6. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) Medicine Document classification, information retrieval Image processing Data fusion Gaming Law On-line advertising! 6 Applications
  • 7. 7 A Simple BN Network T F Rain Rain T F 0.4 0.6 F 0.2 0.8 0.1 0.9 T Sprinkler Sprinkler, Rain T F 0.01 0.99 F, F 0.8 0.2 F, T Wet Driveway 0.9 0.1 T, F 0.99 0.01 T, T Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain)
  • 8. 8 Asia Network Pr(Visit to Asia) Pr(Smoking) Pr(Lung Cancer | Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
  • 9. JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9 BN Inference 101 (in Hive)
  • 10. 10 Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) AB EFG Pr(Visit to Asia) Pr(F) B Pr(C | BE ) EG Pr(H | CG ) BCE CE CEGH C Pr(Lung Cancer | Dyspnea) = Pr(E|H) CD Pr(D| C)
  • 11. 11 CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases
  • 13.
  • 14. Easy to incorporate human insight and intuitions
  • 15. Very general, no specific ‘label’ node
  • 16. Easy to do ‘what-if’, strength of influence, value of information, analysis
  • 17. Immune to Gaussian assumptionsIt’s all just a joint probability distribution 13 Why Bayesian Network Inference?
  • 18. Map & Reduces 14 B1C1E1 Keys Map B1C1E2 A1B1 B1 Reduce B1C2E1 A2B1 B1C2E2 A1B2 B2C1E1 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2 B2C1E2 A2B2 B2C2E1 B2C2E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) Aggregation 2 (x) B1C1E1 B1C1E2 C1D1 B1C2E1 C1 Aggregation 1 (+) C2D1 B1C2E2 C1D2 B2C1E1 B2C1E2 C2 BCE C2D2 B2C2E1 B2C2E2
  • 19. for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15 MapReduce Implementation
  • 20.
  • 21. Clique parallelism: divide computation of each clique into maps/reducers
  • 22. Fall back into optimal factoring if a corresponding subtree is small
  • 24. Reduce replication level16 Cliques, Trees, and Parallelism C6 C5 C4 C3 C2 C1 Cliques may be larger than they appear!
  • 25. CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17 CPCS Inference
  • 26. 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18 Results
  • 27. Exact probabilistic inference is finally in sight for the full 422 node CPCS network Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19 Conclusions
  • 30. Conditioning nodes (evidence) – do not need to be summed Bare child nodes’ values sum to one (barren node) – can be dropped from the network 22 Optimizing BN Inference 101 Noisy-OR (conditional independence of parents) Context specific independence (based on the specific value of one of the parents) T F 0.01 0.99 FF 0.8 0.2 FT Wet grass 0.9 0.1 TF 0.99 0.01 TT
  • 32. No updates – have to compute clique potentials from all children and assigned probabilities Tree structure The key encodes full set of variable values (LongWritable or composite) The value encodes partial sums (proportional to probabilities) No need for TotalOrderPartitioning (we know the key distribution) Need custom Partitioner and WritableComparator (next slide) Need to do the aggregation in the Mapper (sum, next slide) 24 MapReduce Implementation
  • 33. Build on top of old 1997 C program with a few modifications An interactive command line program for interactive analysis Estimates running time from optimal factory plan and Either executes it locally Ships a jar to a Hadoop cluster to execute 25 Current implementation

Notes de l'éditeur

  1. AbstractProbabilistic inference is a way of obtaining values of unobservable variables out of incomplete data. Probabilistic inference is used in robotics, medical diagnostic, image recognition, finance and other fields. One of the tools for inference and a way to represent knowledge is &apos;Bayesian Network&apos;, where nodes represent variables and edges represent probabilistic dependencies between variables. The advantage of exact probabilistic inference using BN is that it does not involve the traditional &apos;Gaussian distribution&apos; assumptions and the results are immune to Taleb&apos;s distributions, or distributions with a high probability of outliers.A typical application of probabilistic inference is to infer the probability of one or several dependent variables, like the probability that a person has a certain disease, given other observations, like presence of abdominal pain. In exact probabilistic inference, variables are clustered in groups, called cliques, and probabilistic inference can be carried out by manipulating more or less complex data structures on top of the cliques, which leads to high computational and space complexity of the inference: the data structures can become very complex and large. The advantage: one can encode arbitrarily complex distributions and dependencies.While a lot of research has been devoted to devising schemes to approximate the solution, Hadoop allows performing exact inference on the whole network. We present an approach for performing large-scale probabilistic inference in probabilistic networks in a Hadoop cluster. Probabilistic inference is reduced to a number of MR jobs over the data structures representing clique potentials. One of the applications is the CPCS BN, one of the biggest models created at Stanford Medical Informatics Center (now The Stanford Center for Biomedical Informatics Research) in 1994, never solved exactly. In this specific network containing 422 nodes representing states of different variables, 14 nodes describe diseases, 33 nodes describe history and risk factors, and the remaining 375 nodes describe various findings related to the diseases.
  2. Here is what I am going to talk about1. I will not be able to delve into every detail and the implementation is not complete2. BN Inference is not a Cloudera product today, therefore it’s not a product announcement!3. This is not a research paper either!Promise – no formulas or complicated mathI promise there will be at least one photo and an SQL statementCPCS -- (Computer-based Patient Case Study) model [Pradhanet al.1994]Pradhanet al.1994 Malcolm Pradhan, Gregory Provan, Blackford Middleton, and Max Henrion. Knowledge engineering for large belief networks. In Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-94), pages 484-490, San Francisco, CA, 1994. Morgan Kaufmann Publishers.
  3. I did probabilistic inference since 1994!There is a resurgence of interest in parallel computations, see Yinglong Xia and Viktor K. Prasanna2008-2010 papers
  4. Interest in Hadoop is surging…Hadoop is: ‘A scalable fault-tolerant distributed system for data storage and processing’Hadoop History2002-2004: Doug Cutting and Mike Cafarella started working on Nutch2003-2004: Google publishes GFS and MapReduce papers 2004: Cutting adds DFS &amp; MapReduce support to Nutch2006: Yahoo! hires Cutting, Hadoop spins out of Nutch2007: NY Times converts 4TB of archives over 100 EC2s2008: Web-scale deployments at Y!, Facebook, Last.fmApril 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodesMay 2009:Yahoo does fastest sort of a TB, 62secs over 1460 nodesYahoo sorts a PB in 16.25hours over 3658 nodesJune 2009, Oct 2009:Hadoop Summit, Hadoop WorldSeptember 2009: Doug Cutting joins Cloudera
  5. A gentle introduction to BNsA Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG)Formally, Bayesian networks are directed acyclic graphs whose nodes represent random variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies; nodes which are not connected represent variables which are conditionally independent of each other. Each node is associated with a probability function that takes as input a particular set of values for the node&apos;s parent variables and gives the probability of the variable represented by the node. For example, if the parents are m Boolean variables then the probability function could be represented by a table of 2m entries, one entry for each of the 2m possible combinations of its parents being true or false.Efficient algorithms exist that perform inference and learning in Bayesian networks. Bayesian networks that model infinite sequences of variables (e.g. speech signals or protein sequences) are called markov chains. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams.Bayes never invented the BNs, even didn’t have a publication on probabilities during his lifetime
  6. As you can notice, BNs are used anywhere were data are a bit more complex (‘unstructured data’ in RDBMS terms)Like Hadoop!Naïve Bayes is the most famous incarnation of a BN (conditional independence of attribute variables given the class label)Let’s look at the examples of BN
  7. A reasoning tool: People think they are good with probabilitiesOne advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distribution.Wind blows – trees move?May be extended to ‘causal networks’
  8. A more complex networkVisit to Asia – predisposing factorsTuberculosis, Lung Cancer, Bronchitis – diseasesX-Ray, Dyspnea – findingsLungCancer or Tuberculosis – hidden node
  9. CS 221 at StanfordBN Inference in HiveFor example Pr(Lung Cancer|Dyspnea)Can some intelligently: see optimal factoring approach in my Ph.D. thesisThe largest clique size – max width in CSP terms (did I mention it’s NP-hard?)Approximate and sampling algorithms existFormally, it can be represented as `variable elimination` or ‘belief propagation` up and down a join treeLet’s have a look
  10. Junction tree: Each probability is assigned to one of the cliques in the junction treeWhen we sum, the results is a message (M)When we multiply, the result is a (R)Already looks like MapReduce! MapReduce existed long before it was invented.But before we delve into MR implementation, lets talk about CPCS (Comuter-based Patient Case Study) networkDid I mention BN Inference is NP-hard? It can be mapped to a CSP problem
  11. A typical query is Pr(diseases|finding, risk factors)One big mess!
  12. A typical query is Pr(diseases| risk factors, findings)Interactive analysisWhat-if analysisStrength of InfluenceSensitivity AnalysisValue of informationValue of additional evidence (tests)Cost of not taking a specific decisionBy now you are wondering: why inference?
  13. Let’s have a break and discuss why you should use BN InferenceIf the current tools work for you, continue using themIf you run a company that underestimates risk and looses $1T as a result, you probably need to innovate: There should be some technology that can handle itNow, let’s delve into MapReduce implementation and results
  14. That’s a bit more complicated slide, but bear with meMap: summation, generate multiple keys/records per 1 original recordReduce: multiplicationThe key encodes full set of variable values (LongWritable or composite)The value encodes partial sums (proportional to probabilities)No need for TotalOrderPartitioning (we know the key distribution)Need custom Partitioner and WritableComparator (next slide)Need to do an aggregation in the Mapper (sum, next slide)By arranging the node order in the cliques we can optimize data localitySorting helps!
  15. Preserves data locality by specifying node order in a certain way (need for a custom WritableComparator and Partitioner)
  16. The computation is C6 -&gt; C5 -&gt; C4 -&gt; C3 -&gt; C2 -&gt; C1Topological parallelism is usually limitedMost of the work is done in reducers (indices remapping, summation)Let’s look at the actual clique sizes in CPCS!
  17. Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inferenceWhat is the scoop?
  18. The first three are for the ‘full’ propagation up and down the treeRandom A, B are randomly generated BNs used for the 1995 paperCpcs360 is a subset of cpcs422 used for interactive analysisCpcs422 on a 5-node subquery
  19. Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inference
  20. git@github.com:alexvk/BN-Inference.gitGoals:Inference is an interesting applicationWe have an interactive program to perform inferenceAll questions to Cloudera, Inc.Need:Implementors (to help)Large cluster (to have 10s of PB of storage)
  21. Doing inference on the ‘full’ network has only an academic inferenceHowever, to understand how the simplifications in the network affect results, we need to be able to perform exact inference
  22. Summation: Pure MR+ (M-R-M-R-...-M-R) jobNormalization: Requires update (or a copy) operationEach key can encode the set of values (odometer)No need for PartialOrder (we know the key distribution)Can optimize data locality
  23. Many tools for interactive analysis:-Sensitivity analysis-Strength of InfluenceValue of Information-Hybrid networks (with some continuous parents)
  24. As opposed to traditional MR, aggregation is made in the map phase (summation)
  25. Very few modifications:* In file included from utils.c:12:/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include/varargs.h:4:2: error: #error &quot;GCC no longer implements &lt;varargs.h&gt;.&quot;/usr/lib/gcc/x86_64-redhat-linux/4.1.2/include/varargs.h:5:2: error: #error &quot;Revise your code to use &lt;stdarg.h&gt;.&quot;* Convert ints to longsImplement MR logic and code generation