Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.

2. About Me About Cloudera Bayesian (Probabilistic) Networks BN Inference 101 CPCS Network Why BN Inference Inference with MR Results Conclusions 2 Session Agenda

3. Worked on BN Inference in 1995-1998 (for Ph.D.) Published the fastest implementation at the time Worked on DM/BI field since then Recently joined Cloudera, Inc. Started looking at how to solve world’s hardest problems 3 About Me

4. Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4 About Cloudera

5. Nodes Edges Probabilities 5 Bayesian Networks Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418

6. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) Medicine Document classification, information retrieval Image processing Data fusion Gaming Law On-line advertising! 6 Applications

7. 7 A Simple BN Network T F Rain Rain T F 0.4 0.6 F 0.2 0.8 0.1 0.9 T Sprinkler Sprinkler, Rain T F 0.01 0.99 F, F 0.8 0.2 F, T Wet Driveway 0.9 0.1 T, F 0.99 0.01 T, T Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain)

9. JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9 BN Inference 101 (in Hive)

11. 11 CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases

12. 12 CPCS Networks

14. Easy to incorporate human insight and intuitions

15. Very general, no specific ‘label’ node

16. Easy to do ‘what-if’, strength of influence, value of information, analysis

17. Immune to Gaussian assumptionsIt’s all just a joint probability distribution 13 Why Bayesian Network Inference?

18. Map & Reduces 14 B1C1E1 Keys Map B1C1E2 A1B1 B1 Reduce B1C2E1 A2B1 B1C2E2 A1B2 B2C1E1 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2 B2C1E2 A2B2 B2C2E1 B2C2E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) Aggregation 2 (x) B1C1E1 B1C1E2 C1D1 B1C2E1 C1 Aggregation 1 (+) C2D1 B1C2E2 C1D2 B2C1E1 B2C1E2 C2 BCE C2D2 B2C2E1 B2C2E2

19. for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15 MapReduce Implementation

21. Clique parallelism: divide computation of each clique into maps/reducers

22. Fall back into optimal factoring if a corresponding subtree is small

23. Combine multiple phases together

24. Reduce replication level16 Cliques, Trees, and Parallelism C6 C5 C4 C3 C2 C1 Cliques may be larger than they appear!

25. CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17 CPCS Inference

26. 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18 Results

27. Exact probabilistic inference is finally in sight for the full 422 node CPCS network Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19 Conclusions

28. Questions? alexvk@{cloudera,gmail}.com

29. BACKUP 21

30. Conditioning nodes (evidence) – do not need to be summed Bare child nodes’ values sum to one (barren node) – can be dropped from the network 22 Optimizing BN Inference 101 Noisy-OR (conditional independence of parents) Context specific independence (based on the specific value of one of the parents) T F 0.01 0.99 FF 0.8 0.2 FT Wet grass 0.9 0.1 TF 0.99 0.01 TT

31. 23 GeNIe package

32. No updates – have to compute clique potentials from all children and assigned probabilities Tree structure The key encodes full set of variable values (LongWritable or composite) The value encodes partial sums (proportional to probabilities) No need for TotalOrderPartitioning (we know the key distribution) Need custom Partitioner and WritableComparator (next slide) Need to do the aggregation in the Mapper (sum, next slide) 24 MapReduce Implementation

33. Build on top of old 1997 C program with a few modifications An interactive command line program for interactive analysis Estimates running time from optimal factory plan and Either executes it locally Ships a jar to a Hadoop cluster to execute 25 Current implementation

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Similaire à Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010 (20)

Plus de Yahoo Developer Network

Plus de Yahoo Developer Network (20)

Dernier

Dernier (20)

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010

Notes de l'éditeur