SlideShare une entreprise Scribd logo
1  sur  81
Télécharger pour lire hors ligne
Product  Decisions
through
Big  Data
Center  for  Data  Science
Ankur  Teredesai
University  of  Washington  Tacoma
1
March  14th,  2015
• Bioinformatics
• Health  and  
Wellness
• Predictive  Analytics
Health  
Informatics
• Distributed  Systems
• Databases
• Geo-­‐Spatial
• Embedded  Systems
Geo-­‐Spatial  Data  
Management
• Machine  Learning
• Data  Mining
• Computation  
Intelligence
• Computer  Vision
Intelligent  
Systems
• Web
• Devices
• Mobile  Networks
• UX  /  UI
Social  Computing
• Cryptology
• Secure  Machine  
Learning
Big  Data  Security
• Engineering
• Dev-­‐Ops
Big  Data  
Infrastructure
Center  for  Data  Science:  Societal  Impact
Machine  Learning
Analytics
Engineering
Features
AlgorithmScalability
ELT
Integrate  
Sources
Constraints
Deploy  Models
APIs
Apps
Data  Struggles
A  Big  Data  Project  Blueprint:
3
Data  Mining:  1989  -­‐ 2010  
• Data  Science  and  
Applications  move  and  
transform  sizeable  amounts  
of  data  out  of  the  native  
database  or  file  systems.
Applications
SQL/ODBC/JDBC  Data  Access
Distributed  Database
Multi-­Core,  Columnar,  
Key-­Value
Distributed  Database
Multi-­Core,  Columnar,  
Key-­Value
Distributed  Database
Multi-­Core,  Columnar,  
Key-­Value
Distributed  Database
Multi-­Core,  Columnar,  
Key-­Value
Data  Science  using  R,  
SAS,  SPSS,  Weka,  MAHOUT
H
I
G
H
V
O
L
U
M
E
H
I
G
H
L
A
T
E
N
C
Y
H
I
G
H
V
O
L
U
M
E
Application  Ecosystem  Integration
Data  Science  uses  native  data  
representation  and  inherent  distribution  
and  parallelism
Minimal  data  movement
Rapid  Application  development  using  
data  science  constructs
5
Big  Data  Science
Application  Ecosystem  Integration
Applications
SQL/ODBC/JDBC  Data  Access
Data  Science
•Internal  Algorithms  for  clustering,  
•classification,    regression
Distributed  Database
Multi-­Core,  Columnar,  Key-­Value
L
O
W
E
R
V
O
L
U
M
E
L
O
W
E
R
L
A
T
E
N
C
Y
H
I
G
H
V
O
L
U
M
E
L
O
W
L
A
T
E
N
C
YBig  Data  Science  Components
A  Short  History  of  (Big)  Data  Technology
1970:  Codd  invents  “A  
Relational  Model  of  
Data  for  Large  Shared  
Data  Banks”
1985:  Copeland  –
Decomposition  Storage  
Model  (essentially  the  
first  Columnar  Store)
1989:  Shared-­‐Nothing  
Architecture
2004:  Google  –
MapReduce
2005:  C-­‐Store  
(Eventually  Vertica),  
layers  WS/RS
2007:  Materialization  
Optimizations  in  
Columnar  Stores  and  
Hadoop Implementation
2005-­‐07:  Star-­‐Schema  
Benchmark
+  Hadoop
2008:  Attempts  to  
backport columnar  
advances  to  row  
storage,  not  very  
effective
Today:  BIG  DATA
Technology  Decisions
7
Columnar  Vs Relational  Storage  
Technologies
Infinite  scale  using  commodity  
hardware
Private  or  Public  Cloud
Massively  Distributed  and  
Parallel  Architecture:  Hadoop
Stream  Query  Processing  for  
trillions  of  events  and  petabytes  of  
data
Real-­time  classification and  
clustering:  Approximate  scoring  
and  segmentation  +  Reporting  
and  Data  Visualization
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  Set  
Approximation
CHF  Risk  of  
Readmission
Geo  
Routing
Random  Forests KNN
Industry  Partners  and  Domain  Experts
Other  
Solutions
HDFS NUMA
MPI Grappa
Census  US  Gov Unstructured  CCD
Bayesian  
Networks
Support  Vector
Machines
8
Cost  of  Chronic  
Interventions
Age/Gender  
Prediction
Malware  
Analytics
Personalized  
Cancer  Therapy
ETL  Tools
Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)
Sqoop
iTornado
Routing  Service  With  Real  World  Severe  
Weather
Demo  Paper  in  ACM  SIGSPATIAL 2014
(Best  Demo  paper  award)
Fatalities  Stats  byWeather Related  Hazards  
http://www.nws.noaa.gov,  June  2014.
COMA
Road  Network  Compression  For  Map  
Matching
ACM  SigSpatial IWGS  2014
PreGo
Dynamic  Multi-­‐Preference  Routing
Single  
Attribute
Multiple
Attribute
Time-­‐
Homogenous
Dijkstra,  A* Stewart  et  al  91
Time-­‐Variant Betsy  et al  07 ?
<3,4>
<2,2>
<5,7>
<0,0>
a
s
b
e
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[1,2,3,4,5]
R=[1,2,3,4,5]
d
c g
f h
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[1,2,3,4,5]
R=[1,2,3,4,5]
T=[5,1,3,4,5]
R=[7,1,2,4,5]
T=[1,1,3,4,5]
R=[1,2,3,4,5]
T=[2,1,3,4,5]
R=[2,1,3,4,5]
T=[1,2,2,4,3]
R=[2,1,5,4,3]
T=[1,2,3,1,1]
R=[1,2,3,0,1]
<1,1>
<4,4>
T=[4,2,1,3,5]
R=[3,2,1,4,5]
Special Needs Education: Teacher Trainer Effectiveness Analysis
Customized Surveys
Training Registration
Survey Management
To  support  streamlined  data  collection  and  
performance  evaluation  across  the  State  Needs  
Projects.
Project Stakeholders
Office of the Superintendent of Public
Instruction
Center for Data Science
Data Dashboard Purpose Report Generation
Geographic Distribution Maps
Demographic Reports
Brad Porter, Aniruddha Desai, Yitao Li, David Hazel,
Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green
Systems  Biology
13
Predictive  Models  
and  software
Applications:  Personalized  
medicine,  drug  discovery
Focus:  Develop  machine  learning  
methods  and  tools  to  effectively  
integrate  multiple  big  data  sources  in  
biology.
A  Flying  Hadoop Cluster
14
Detecting  Malware  Activity  based  on  
Automatically  Generated  Domains
Command  &  Control  
xyz.com xyz.com
Infected  node
Partnering  with  NIARA  we  obtained  a  large  dataset  of  Automatically  Generated  Domains.  
Based    on  the  intercepted  domain  features  we  
are  able  to  identify  the  malware  infecting  a  
network.  
(March  2012)
• Will  this  Heart  Failure  patient  
get  readmitted  within  30  days?
• Yes  or  No  (Binary  Classification)
16
Reduce  CHF  
Readmission
Readmission  ?
Machine  Learning?
Joint  NSF  /  NIH  Solicitation  on  Health  Care  and  Big  Data
Affordable  Care  Act  =>  Avoidable  Costs
Readmissions  are  AVOIDABLE
20%
32%
30  days
60  days
75%
25% Non  CHF
CHF
• Readmissions  national  cost  $17  billion  
annually
• 76  %  considered  avoidable  
17
Readmissions
Congestive  Heart  Failure  (CHF)
Source:  www.presidency.ucsb.edu,  cdc.gov,  tmz.com
Patient
Class
Labels
No  
readmission
Readmission
CHF  ROR:  30-­‐Day  Hospital  Readmission  Risk  
Prediction
Machine  
Learning    
Algorithms
18
Building  
the  
model
Scoring  
the  
tuple
Features
Vector
Features
Vectors
New  patient
No  readmission
Readmission
19
Some of the Steps
Data  
Understanding
And  Integration
Data  
Cleaning
Data  
Transformation
Extracting    data  from  Epic  -­‐
16  data  marts  and  200  views:
Heart Failure  Inpatient  Summary
Encounter.Flowsheet
PatientEncounterHospital
vs  
Public  Data:
State  Inpatient  Dataset  2009-­‐2012
20
AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE   DXCCS1 PRCCS1 TOTCHG
52 98122 1 3 12 3 0 153 212 56,511
87 98109 1 3 7 1 1 162 -­‐ 12,687
26 98028 4 3 1 30 1 139 195 127,300
• Washington  State  Inpatient  Data
• Admission  level  Claims  
• ~400  attributes  
• Demographics
• ICD9  Diagnosis  codes
• ICD9  Procedure  codes
• Charges
• Admissions  by  year
• 2009  – 652702
• 2010  – 651783
• 2011  – 648079
• 2012  – 648092
Variety  and  Volume  (2/3  V’s  of  Big  Data)
Pre  Admission Post  Admission Pre-­‐ Discharge Discharge
-­‐ Demographics
-­‐ Vital  Sign
-­‐Prior  Hospitalization
Pulse  rate            
Blood  pressure  
Respiration  rate  
BMI
Number  of    prior  admissions
Prior  length  of  stay
+ Demographics
Sodium  level
Glucose  level
Hemoglobin  level
Creatinine  level
Hematocrit  level
Neutrophils  level
Ejection  Fraction  
BUN  level
+ Vital  Sign
+ Prior  Hospitalization
-­‐ Lab  Test
+ Vital  Sign
+ Prior  Hospitalization
+ Demographics
+  Lab  Test
-­‐ Diagnosis  Information
Number  of  secondary  diagnosis
Chronic  systolic  heart  failure  
Acute  kidney  failure    
Chest  pain
Hyper  potassemia  
Bronchopneumonia
Other  chronic  pulmonary  heart  diseases  
Syncope  and  collapse        …
+ Prior  Hospitalization
+ Demographics
-­‐ Comorbidities
Acute  coronary  syndrome    Asthma
COPD    Ulcer    Dialysis    Dementia
Arrhythmias    Mal  Nutrition  
Vascular    Depression
-­‐ Discharge/Admit  codes
Admit  /Discharge  type
Severity  Of  illness    Risk  Of  Mortality  
-­‐ Utilization  Information
Operating  room  CTSCAN
Emergency  Room        CCU
Marital  status          Age
Racial  group      
Gender
(Dec  2012)  Initial  Models  
22
Data  integration
Feature  Construction
Predictive  modeling
• Logistic  Regression
• Naïve  Bayes
• Support  Vector  Machines
0.6
0.72
0.64
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
Yale  M
odel  (Com
parative  …Am
arasingham
  et  al.  
Our  current  Result
Area  Under  the  Curve  (AUC)
Several  Rejects:  
KDD  Industry  Track  
2013
AMIA  2013
JAMIA  2013
2012
(July  2013)  (much  better)   &  Some  Papers
§ Improved  data  exploration
§ S.-­‐C. Chin, K. Zolfaghar, S. Basu Roy, A.
Teredesai, and P. Amoroso, "Divide-­‐n-­‐
Discover -­‐-­‐ Discretization based Data
Exploration Framework for
Healthcare Analytics," 7th
International Conference on Health
Informatics (HEALTHINF Short Paper),
Angers, France, 2014
§ N. Meadem, N. Verbiest, K. Zolfaghar,
J. Agarwal, S.-­‐C. Chin, S. Basu Roy, A.
Teredesai, D. Hazel, P. Amoroso, and
L. Reed, "Predicting Risk of
Readmission for Congestive Heart
Failure Patients," Workshop on Data
Mining for Healthcare (DMH),
Chicago, IL, 2013
23
0.6
0.72
0.64
0.74
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Yale  Model  
(Comparative  
Baseline)
Amarasingham  
et  al.  
Our  2012  Result Our  current  
Result
Area  Under  the  Curve  (AUC)
§Improved  Modeling Effort
(Dec  2013)  Prototype  or  a  possible  Product?  
&  yes,  More  Papers
§ Successful  Deployment
24
§K. Zolfaghar, J. Agarwal, D. Sistla, S.-­‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-­‐O-­‐Meter: An Intelligent
Clinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD), Chicago, IL, 2013
§Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-­‐Chi Chin, Brian Muckian: Big
data solutions for predicting risk-­‐of-­‐readmission for congestive heart failure patients. BigData
Conference 2013: 64-­‐71
25
Multi  Layer  Classifier  :  Automatically  Detecting  
Classification  Windows
Will  patient ever readmit?
Will  patient readmit
within 30  days?
YES NO
YES NO
KNN
LR
NB
SVM
KNN
32%  of  all  data
Only 5%  of  patients that return
within 30  days is  filtered out
Generalizing  the  30,60,90  Day  readmission
§ Automatic  design  of  time  prediction  hierarchy
§ Feature  selection  and  factor  analysis  at  each  layer
§ Different  classification  algorithms  in  each  layer  and  satisfying  different  
quality  metrics
26
Automatic  design  of  prediction  hierarchy
27
Simple  3  Layer  Example
• Stage  1:  Design  a  predictive  model  for  the  patients  who  are  likely  to  
come  back  within  a  time  window  of  (X,  K),  where  X  is  the  maximum  
number  of  days  until  next  readmission
• Stage  2:  Design  a  predictive  model  for  time  window  of  (K,  30)
• Stage  3:  Design  a  predictive  model  for  time  window  of  <30  days  of  
readmission
HOW  TO  AUTOMATICALLY  DETECT  THE  MIDDLE  CUTPOINT  K?
28
Hill  Climbing  Algorithm  to  Detect  K
§ Generate  a  random  number    K  between  X  and  30
§ Compute   C1=  Centroid(X,K)  ,  C2=  Centroid(K+1,30)
§ Compute  the  KLCurrent =  KLDiv(C1,C2)
§ K’=K+i K”=K-­‐i
§ Find  a  point  K2  between  (K’,K’’)  ,  and  check
§ If  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  >  KLCurrent
§ If  the  above  condition  is  satisfied,  then  K=K2
§ KLCurrent =  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  
§ Repeat  the  above  steps  until  no  further  check  is  possible
29
30
Calculating  the  Probability  of  30  day  RoR
P(readmit ≤ 30) = P(≤ 30 |≤ K)× P(≤ K |Y)P(Y)
Risk-­‐O-­‐Meter
Distinguishing  Features
31
Risk-­‐O-­‐Meter
Users
Current  Systems
Healthcare  provider
and  Patients
Only  
healthcare  providers
Result  explanation
and  exploration
Need  deep  domain  
Knowledge
Handle  incomplete  patient  
input
All  in  one  Package  – Risk-­‐O-­‐Meter  (KDD  2013)
32
Pre  Admission Post  Admission Pre  -­‐ Discharge Discharge
Post-­‐Discharge  
Care  
Management  
Pipeline
“White  Gap”PCP HF  Service
Care  
Management
Payer
ChroniRisk Continuous  Readmission  Risk  Assessment  Across  Continuum  of  Care*
78%*
42%*
Service  Line  EMRPCP  Tools
Psycho-­‐social  risk  
scoring
2013  HF  Readmission  Statistics
• 7.1  M  Readmits
• 5.3  M  Avoidable
• $13,000  each
• $13  B  opportunity  cost
Patient  Encounters  Scored
+18,000 (HF  cohort)
Risk  – Done
Cost  – Done
Next?  
Actionable  Interventions
If  we  can  predict  can  we  recommend?
34
A  Framework  to  Recommend  Interventions  for  30-­‐Day  Heart  Failure  Readmission  Risk,  Rui Liu,  Kiyana
Zolfaghar,  SC  Chin,  Senjuti Basu Roy,  Ankur  Teredesai,  Data  Mining  (ICDM),  2014  IEEE  International  Conference  
on  DOI:  10.1109/ICDM.2014.89  Publication  Year:  2014  ,  Page(s):  911  -­‐ 916
A  real  and common Chronic  Readmission
75-­‐year  old,  female
Chronic  pulmonary  disease,  
depression,  hypertension
and  diastolic  heart  failure  
High Risk
Medium Risk
Low Risk
35
Readmit!
Intervention  Plan  1
Major  Operating  Room,  Chest  X-­‐ray  and  others
Intervention  Plan  2
Echocardiology,  CCU  and  others
Intervention  Plan  3
Emergency  Room  and  others
Risk  will  be  
lower  when  the  
interventions  
are  performed
The  patient  is  
not  readmitted
Intervention  Rule  Generation
Readmission
Age Gender
Pneumonia
DX486
Acute
respitory
failure
DX51881
CHF
DX4280
Cont inv mec ven
<96 hrs
PR9671
Venous cath NEC
PR3893
Packed cell
transfusion
PR9904 Rule  
Repository
Valid  Rule 1
Female, Diabetes,  Major  Operating  Room,  
Chest  X-­‐ray  and  others
Valid  Rule 2
Male, Hypertension, Echocardiology,  CCU  and  
others
Invalid Rule 3
Female,  Depression,  Emergency  Room  and  
others
Invalid  Rule  4
Male,  COPD,  Emergency  Room  and  others
36
Bayesian Network
Construction
Intervention  Rule  
Generation
Intervention  
Recommendation
Evaluation
Compute patient risk using only non-­‐
procedural attributes
Compute patient risk using procedural
attributes
Compare the difference between the two
probabilities
Store the rules where the risk is
reduced after introducing the
procedures
Recommendation  for  New  Patient
Intervention  Plan  1
Major  Operating  Room,  Chest  X-­‐ray  and  others
Intervention  Plan  2
Echocardiology,  CCU  and  others
Intervention  Plan  3
Emergency  Room  and  others
Top 3 intervention plans
Rule  Repository
New  Patient  Attributes
Summarized  Intervention  Plan
Major  Operating  Room,  Echocardiology ,  Chest  
X-­‐ray  and  others
37
Summarize
The Rule Repository is  HUGE!  (over  
30k  rules)
Parallel Solution!
Bayesian Network
Construction
Intervention  Rule  
Generation
Intervention  
Recommendation
Evaluation
Compute similarity between established
attribute profile and a given patient profile
Identify rules where the established
attribute is most similar to the patient
input
Recommend interventions extracted
from the established rules
Validation  – Data  Highlights
• State  Inpatient  Database  (SID) of  Washington  State  heart  failure  cohort  in  year  2010  
(67967  patients) for training and 2011 (52021 patients)  for  testing
• 3908  diagnosis  and  2049  procedure  codes  are  involved.
• Feature  Selection  is  performed  using  chi-­‐square  test.
Demographics Age,  Gender,  Race
Comorbidity  &  Diagnosis 21  comorbidities  and  90  diagnosis
Utilization  &  Interventions 21 health  service  utilization  flags  and  70  interventions
Others Length of  Stay,  #  of  diagnosis  and  interventions
38
High Dimensional
Bayesian Network
Construction
Intervention  Rule  
Generation
Intervention  
Recommendation
Evaluation
Extract patients from the test set who were not
readmitted within 30 days
Compute the evaluation metrics between the recommended interventions
and the actual interventions
Validation – Experiment Results
39
0
100
200
300
400
Linear  
Regression
Hill-­‐Climbing Grow-­‐Shrink Hybrid
Hits
0.34
0.35
0.36
0.37
0.38
0.39
0.4
Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid
Jaccard  Index
0.93
0.932
0.934
0.936
0.938
0.94
0.942
Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid
Accuracy
0.45
0.5
0.55
0.6
0.65
Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid
True  Positive  Rate
Bayesian Network
Construction
Intervention  Rule  
Generation
Intervention  
Recommendation
Evaluation
Back  to  the  Chronic  Readmission  Case
75-­‐year  old,  female
Chronic  pulmonary  disease,  
depression,  hypertension
and  diastolic  heart  failure  
40
No-­‐readmit!
Cardiac  catheterization  lab,  CT  scan,  echo-­‐
cardiology,  echo-­‐cardiogram,  
Cardiac  catheterization  lab,  CT  scan,  echo-­‐
cardiology,  echo-­‐cardiogram
Accountable  Care  Organizations
Cost/Charge  Prediction
41
HealthSCOPE:  An  Interactive  Distributed  Data  Mining  Framework  for  Scalable  Prediction  of  Healthcare  Costs  ,  Marquardt  James,  Newman  Stacey,
Hattarki Deepa,  Srinivasan Rajagopalan,  Sushmita Shanu,  Ram  Prabhu,  Prasad  Viren,  Hazel  David,  Ramesh  Archana,  De  Cock  Martine,  Teredesai  Ankur,  
IEEE  Data  Mining  Conference  Demo  Track,  2014  IEEE  International  Conference  on  DOI:  10.1109/ICDMW.2014.45  Publication  Year:  2014  ,  Page(s):  1227  -­‐
1230
42
What  are  healthcare  
costs  for  assigned  
population  in  2015  ?
Why  is  the  cost  so  
high  or  low  ?
How  does  the  cost  
distribute  across  
demographics  ?
QUESTIONS
DATA  
SCIENCE
DATA
APPLICATIONS
Motivation:  
ACO  Cost  Prediction
Demographics
Diagnosis  
Codes
Procedure  
Codes
Drugs
Lab  Results
Clinical
Claims
Sources  :  SID,  OSHPD,  MEPS Source  :  MultiCare  Collaboration
Charges
Vitals
Population Predictive  
Modeling
Feature  Prioritization
Health  Prediction
Care  Management
Individual Predictive  
Modeling
Chandola et.  al,  KDD  2013  
Cost/Charge  Prediction:  Problem  Description
• Goal  à predict  the  future  healthcare  cost  of  individuals  based  on  
their  past  medical  and  cost information.
• Supervised  machine  learning  problem.
• Input:
• Previous  health  information  (e.g.  diagnosis,  comorbidities,  etc).  
• General  demographics  (age,  gender,  race)
• Previous  healthcare  cost
• {X}  =  (x1,  x2,  x3 ......xp)
• Output:
• Y  =  Future  healthcare  cost
foo 43
foo 44
Four  Scenarios  for  predicting  cost  
• Three  Months  of  Historical  data  (Medical,  Demographic  and  Cost)
à Cost  of  Following  Nine  months  (1Q)
• Six Months  of  Historical  data  (Medical,  Demographic  and  Cost)
à Cost  of  Following  Six  months  (2Q)
• Nine Months  of  Historical  data  (Medical,  Demographic  and  Cost)
à Cost  of  Following  Three  months  (3Q)
• Twelve    Months  of  Historical  data  (Medical,  Demographic  and  Cost)
à Cost  of  Following  Twelve    months  (4Q)
Non-­‐ Gaussian  Distribution  of  Healthcare  Costs
foo 45
Makes  it  challenging  and  interesting  problem  for  research
Existing  Cost  prediction  Methods
• Limited  to  Rule  based  or  Multiple  Linear  Regression  methods
• Rule  Based  methods  
• Requires  domain  knowledge
• Expensive
• Multiple  Linear  Regression
• Multi-­‐collinearity Issue
• Sensitive  to  extreme  values  (outliers)
• Evaluation
• Estimate    the    mean    cost    of    the    given    sampling    distribution.
• Often  in-­‐sample  data  used  to  report  predictive  performance.
• R2   evaluation  metric (not  a  true  indicator)
Our  Contributions
• Investigate  the  utility  of  state-­‐of–the  –art  machine  learning    
algorithms  for  the  cost  prediction  problem.  
• We  empirically  evaluate  three  algorithms:
• Regression  Trees
• M5  Model  Trees
• Random  Forest
foo 47
Regression  Tree
48
Age  >  60?
Has  
Asthma?
Gender  =  
Female?
21,00046,00062,00085,000
Yes
Yes Yes
No
No No
M5  Model  Tree
foo 49
Has  
Asthma?
Gender  =  
Female?
Yes
Yes Yes
No
No No
Age  >  60?
Random  Forest
50
Had  
Procedure  
X?
Age  >  18?
Gender  =  
Male?
21,00046,00062,00085,000
Yes
Yes Yes
#  Admits  
>  3?
No No
Race  =  
White?
Has  CHF?
21,00046,00062,00085,000
Yes
Yes
YesNo No
No
NoAge  >  
60?
Has  
Asthma?
21,000
Gender  =  
Female?
46,00062,00085,000
Yes
Yes
YesNo No
No
51
Evaluation  Metrics
• Mean  Absolute  Error  (MAE)
• Root  Mean  Squared  Error  (RMSE)
52
MAE  Results  – SID  Data  (3Q  Scenario)
0
5,000
10,000
15,000
20,000
25,000
30,000
Average  
Baseline
Previous  
Cost  
Regression
Multiple  
Linear  
Regression
Regression  
tree
Random  
Forest
Model  Tree
MAE  ($)
Baselines
Advanced  Models
53
MAE  Results  – MEPS  Data
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Average  
Baseline
Previous  
Cost  
Regression
Multiple  
Linear  
Regression
Regression  
tree
Random  
Forest
Model  Tree
MAE  ($)
Baselines
Advanced  Models
54
Prediction  Error  Results  – M5  Model  Trees
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1Q 2Q 3Q 4Q
Error  ($)
MAE
RMSE
Error  Distribution:  WA  State  SID  Data
foo 55
For  large  fraction  of  of  the  
population  (75%),  we  were  able  to  
predict
with    higher    accuracy    using    these    
algorithms
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0% 25% 50% 75%
Maximum  Prediction  Error  ($)
Portion  of  Population
Multiple  Linear  
Regression
Regression  Tree
Random  Forest
Model  Tree
Sub-­‐Population  Cost  Prediction
Prediction
Prediction
Prediction
Population
Sub-­‐Population
Future
Healthcare
Cost
Congestive  heart  failure  (CHF)
Diabetes
COPD
Asthma
Coronary  artery  disease  (CAD)
Age  65+
Most  difficult  cohort  to  predict
foo 57
0
5000
10000
15000
20000
25000
30000
35000
Asthma Diabetes CHF COPD Coronary Over  65
MAE  ($)
model  trees
linear  regression
Engineering  the  Solutions:  
Risk-­‐O-­‐Readmission  &  Cost-­‐As-­‐a  
Service
58
Thu,  Nov  7,  2013  at  10:50  AM
59
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Forwarded  message  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐
From:  Windows  Azure  Pass  System  Admin  <wapadmin@microsoft.com>
Date:  Thu,  Nov  7,  2013  at  10:50  AM
Subject:  Gifting  Letter  for  Windows  Azure  Research  Pass
To:  "Ankur  M.  Teredesai"  <ankurt@uw.edu>
Cc:  "Azure4Research  (RFP  External)"  <azurerfp@microsoft.com>
Dear  Ankur  M.  Teredesai  ,
We  have  approved  your  application  for  a  Windows  Azure  Research  Pass  Grant.  In  
order  to  receive  your  pass,  download  the  Microsoft  gifting  letter  from  the  following  
link:
Risk-­‐of-­‐Readmission  as  a  Service
60
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear  Regression
Regression  Trees
Individual  Beneficiary
Feature  Vector
Individual  Beneficiary
Predicted  Cost
Predicted,  Previous  year,  Historic    
population  Costs  +  population  statistics
④
①
②
③
Scale  Issues:
Cost  Prediction  as  a  Service
R
Big  Data  Stack
Cost  Prediction  Engine
Model  Bank  deployed  on  
ADAPA
Spark
Beneficiary    Claims  for  individual
①
Predicted cost  for  the  individual
④
Web App  
for
Individual
WA-­‐SID  Claims  /  MEPS  
Survey  (for  training)
Data  Sources
C M5  Model  Trees
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear  Regression
Regression  Trees
Individual  Beneficiary
Feature  Vector
Individual  Beneficiary
Predicted  Cost
Predicted,  Previous  year,  Historic    
population  Costs  +  population  statistics
④
①
②
③
Cost  Prediction  as  a  Service
R
Big  Data  Stack
Cost  Prediction  Engine
Model  Bank  deployed  on  
ADAPA
Spark
Beneficiary    Claims  for  individual
①
Predicted cost  for  the  individual
④
Web App  
for
Individual
Data  Sources
WA-­‐SID  Claims  /  MEPS  
Survey  (for  training)
C M5  Model  Trees
Apache  Spark
foo 63
Apache  Spark
HDFS
Slave  1
Slave  1
Master
Driver RDD
In  Memory  Data
Partition  1
In  Memory  Data
Partition  2
Spark
Spark
Spark
Data  Partition1
Replica  Data  
Partition2
Data  Partition2
Replica  Data  
Partition2
Weighted  k-­‐NN  for  Regression
foo 64
Data  
Partition  1
kNN1
Predicted  Cost
kNN2
2k  NN
kNN
Node  1
Data  
Partition  2
Node  2
Test  
Instance Top  k
Group  
&  Sort
Group  &  Sort
Weighted  
Average
Compute  
kNN
Compute  
kNN
Rough  Set
• Rough set theory is an ML framework that
is especially suitable for information
systems with inconsistencies.
• Rough set theory handles discrete
attributes.
• Lower approximation: instances that
necessarily belong to the class
• Upper approximation: instances that
possibly belong to the class
Patient Age  ≥  50 Alcohol  Disorder  Visit Cost
P1 Yes Yes High
P2 Yes Yes High
P3 Yes No Low
P4 Yes No High
P5 No No Low
P6 No Yes High
Similar  Patients  but  belong  to  
different  classes!
Fuzzy  Rough  Set
• Uses  fuzzy  logic  to  handle  continuous  
attributes.
• Similarity  matrix  contains  values  
between  0  and  1.  
• Inconsistent  instances  are  highly  
related  but  have  a  different  class.
Patient Age Alcohol  Disorder  Visits   Cost
P1 52 1 $13335
P2 59 4 $277966
P3 55 0 $8139
P4 50 0 $66058
P5 34 0 $5815
P6 26 1 $38526
P1 P2 P3 P4 P5 P6
P1 1 0.52 0.83 0.84 0.60 0.61
P2 0.52 1.00 0.44 0.36 0.12 0.13
P3 0.83 0.44 1 0.92 0.68 0.44
P4 0.84 0.36 0.92 1 0.76 0.51
P5 0.60 0.12 0.68 0.76 1 0.75
P6 0.61 0.13 0.44 0.51 0.75 1
Fuzzy  Rough  Set
• Let rj,i be the degree of similarity of instances i and j.
• Let ci be the degree to which instance i belongs to the class.
• Then the degree to which instance j belongs to the:
• Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n}
• Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n}
• Current implementations can handle only up to 100,000 instances
because they keep the similarity matrix in memory.
Fuzzy  Rough  Set  
max{min(rj,i, ci) | i = 1,...,n}
Fuzzy  Rough  Set
min{max(1-rj,,i, ci) | i = 1,...,n}
Implementation
• The construction of the similarity matrix
can be done in a
parallel manner, making each of K
compute nodes calculate n/K columns of
the similarity matrix.
• No need to store the similarity matrix as
a whole.
• The construction of the similarity matrix
does not have to be
finished before (partial) computation of
the lower and upper
approximations can begin.
Node  1 Node  2
Implementation  -­‐
Lower  Approximation
Upper  Approximation
Spark  vs MPI              
Fuzzy  Rough  Set
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear  Regression
Regression  Trees
Individual  Beneficiary
Feature  Vector
Individual  Beneficiary
Predicted  Cost
Predicted,  Previous  year,  Historic    
population  Costs  +  population  statistics
④
①
②
③
Cost  Prediction  as  a  Service
R
Big  Data  Stack
Cost  Prediction  Engine
Model  Bank  deployed  on  
ADAPA
Spark
Beneficiary    Claims  for  individual
①
Predicted cost  for  the  individual
④
Web App  
for
Individual
WA-­‐SID  Claims  /  MEPS  
Survey  (for  training)
Data  Sources
C M5  Model  Trees
Readmission  Application
• Android
• Windows  Phone
• Patient  View
• what  is  my  risk
• Doctor  View  
• who  are  my  risky  patients?
• alerts
• Interventions
74
foo 75
http://healthscope.cloudapp.net/hscope-­‐dev/aco/
Healthcare  Scalable  COst  Prediction  Engine  (HealthSCOPE)
0.6  AUC
Yale  Model
(Baseline)
76
Milestones:  Readmission  Risk
0.64  AUC
UW  2012  
Result
Ensemble  
method,  
Hierarchical  
classification
Dec  2012
0.74  AUC
UW  2014Result
Lab  results
+
New  
Algorithm  
(Adaboost)
Feb    2014
QlikView
Readmission  
App
Dec  2013
Machine  Learning  
Process  to  Target  
New  Chronic  
Diseases
Aug  2014  -­‐>  Moving  Forward
Integrating  
care  pathway  
March  2014
Bayesian  
Network  
Learning
AUC  – Accuracy  measure  
(Area  Under  Curve)
Real  Time  
Care  
Factors  &  
Pathways
July  2014
with  
EPIC
Post-­‐Discharge
(Clinical    data)
June  2013
Risk-­‐o-­‐Meter
Development
+  
Big  data  Efforts
Pre-­‐Admission
(Clinical    data)
Post-­‐Discharge
(Claim  data)
Post-­‐Admission
(Clinical  data)
IEEE  Big  Data
REF  #3
KDD
REF  #1  &  2
HEALTHINF
REF  #4  &  5
KDD
REF  #6
ICDM  2014
REF  #6
Problem  
Explorat
ion
77
Milestones:  Cost  Prediction
H-­‐SCOPE  I
SID  Data
June  2014
H-­‐SCOPE  IV
SID  +  MEPS  
data
Nov.  2014
H-­‐SCOPE  III
Adapa Scoring  
Engine
Spark  
Framework
Sept.  2014 Aug  2015  -­‐>  Moving  Forward
H-­‐SCOPE  V
Five  Cohort
Dec.  2014
M5  Model  
Trees
Random  
Forest
Regression  
Tress
Health
SCOPE  VI
July  2015
Admit  Level
August  2014
H-­‐SCOPE  II
Population  View  
(ACO)
OSHPD  Data  
Application
Beneficiary  
Level
Beneficiary  
View
Four  Future  
Scenario  
ICDM  2014 KDD-­‐2015 AMIA-­‐2015
Sub-­‐
Population
Deep
Learning
Time  &
Cost  Of
Hospital
readmission
H-­‐SCOPE  VII
AHRQ  Private
data
WWW-­‐Digital
Health-­‐2015
Time,  Cost
And  
Illness  (Alignment)
Prediction  
78
AUC  – Accuracy  measure  
(Area  Under  Curve)
2012
78
Milestones:  Merging  Threads
2016  and  beyond2013 2014 2015
Risk  of  Readmission  (Clinical,  Sociological  &  Claims)
2014 2015
Cost  Prediction  (Claims  and  secondary  data  sources)
2015
Risk  &  Cost  Convergence
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  Set  
Approximation
CHF  Risk  of  
Readmission
Geo  
Routing
Random  Forests KNN
Industry  Partners  and  Domain  Experts
Other  
Solutions
HDFS NUMA
MPI Grappa
Census  US  Gov Unstructured  CCD
Bayesian  
Networks
Support  Vector
Machines
79
Cost  of  Chronic  
Interventions
Age/Gender  
Prediction
Malware  
Analytics
Personalized  
Cancer  Therapy
ETL  Tools
Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)
Sqoop
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  Set  
Approximation
Personalized  
Cancer  Therapy
Geo  
Routing
Random  Forests KNN
Industry  Partners  and  Domain  Experts
Other  
Solutions
HDFS NUMA
MPI Grappa
Census  US  Gov Unstructured  CCD
Bayesian  
Networks
Support  Vector
Machines
80
Cost  of  Chronic  
Interventions
Age/Gender  
Prediction
Malware  
Analytics
CHF  Risk  of  
Readmission
ETL  Tools
Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry)
Sqoop
81
Our  Sincere  Thanks for  Your  Support!

Contenu connexe

Tendances

Business intelligence architectures.pdf
Business intelligence architectures.pdfBusiness intelligence architectures.pdf
Business intelligence architectures.pdfAnand572211
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Businessazuyo.com
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesPetteri Alahuhta
 
Big Data Analytics Proposal #1
Big Data Analytics Proposal #1Big Data Analytics Proposal #1
Big Data Analytics Proposal #1Ziyad Saleh
 
Supply chain management
Supply chain managementSupply chain management
Supply chain managementmuditawasthi
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big dataSeta Wicaksana
 
Leveraging Your Data Report
Leveraging Your Data ReportLeveraging Your Data Report
Leveraging Your Data ReportNAED_Org
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
El big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex RayónEl big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex RayónBig-Data-Summit
 
Big data course | big data training | big data classes
Big data course | big data training | big data classesBig data course | big data training | big data classes
Big data course | big data training | big data classesNaviWalker
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBala Iyer
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science suresh sood
 

Tendances (20)

Big data
Big dataBig data
Big data
 
Business intelligence architectures.pdf
Business intelligence architectures.pdfBusiness intelligence architectures.pdf
Business intelligence architectures.pdf
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
 
Big data
Big dataBig data
Big data
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
On Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challengesOn Big Data Analytics - opportunities and challenges
On Big Data Analytics - opportunities and challenges
 
Big Data Analytics Proposal #1
Big Data Analytics Proposal #1Big Data Analytics Proposal #1
Big Data Analytics Proposal #1
 
Supply chain management
Supply chain managementSupply chain management
Supply chain management
 
IoT and Big Data
IoT and Big DataIoT and Big Data
IoT and Big Data
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Leveraging Your Data Report
Leveraging Your Data ReportLeveraging Your Data Report
Leveraging Your Data Report
 
Big data Introduction
Big data IntroductionBig data Introduction
Big data Introduction
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
El big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex RayónEl big data analytics donde menos te lo esperas - Alex Rayón
El big data analytics donde menos te lo esperas - Alex Rayón
 
Data science
Data scienceData science
Data science
 
Big data course | big data training | big data classes
Big data course | big data training | big data classesBig data course | big data training | big data classes
Big data course | big data training | big data classes
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the Marketspace
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science
 

En vedette

En vedette (6)

A Short History of Big Data
A Short History of Big DataA Short History of Big Data
A Short History of Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
What is big data?
What is big data?What is big data?
What is big data?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similaire à Societal Impact of Applied Data Science on the Big Data Stack

8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...LDBC council
 
Accure ai healthcare offering v4
Accure ai healthcare offering v4Accure ai healthcare offering v4
Accure ai healthcare offering v4Accureinc
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Ian Foster
 
2004 10-19 rudi vdv
2004 10-19 rudi vdv2004 10-19 rudi vdv
2004 10-19 rudi vdvguest3cf4991
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareDATA360US
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...MongoDB
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Joel Saltz
 
Health Science Data and Metadata: Trends and Needs
Health Science Data and Metadata: Trends and NeedsHealth Science Data and Metadata: Trends and Needs
Health Science Data and Metadata: Trends and NeedsLynne Frederickson
 
Bridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through TechnologyBridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through TechnologySaama
 
Next Gen Clinical Data Sciences
Next Gen Clinical Data SciencesNext Gen Clinical Data Sciences
Next Gen Clinical Data SciencesSaama
 
Knowing me, knowing you, knowing your disease
Knowing me, knowing you, knowing your diseaseKnowing me, knowing you, knowing your disease
Knowing me, knowing you, knowing your diseaseeHealth Forum
 
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMS
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMSHEART DISEASE PREDICTION RANDOM FOREST ALGORITHMS
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMSIRJET Journal
 
Data quality and uncertainty visualization
Data quality and uncertainty visualizationData quality and uncertainty visualization
Data quality and uncertainty visualizationbdemchak
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009Ian Foster
 
Clinical Trial Management Systems of next next decade
Clinical Trial Management Systems of next next decadeClinical Trial Management Systems of next next decade
Clinical Trial Management Systems of next next decadeFotis Stathopoulos
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 

Similaire à Societal Impact of Applied Data Science on the Big Data Stack (20)

8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
 
Accure ai healthcare offering v4
Accure ai healthcare offering v4Accure ai healthcare offering v4
Accure ai healthcare offering v4
 
Meaningful use stage 3 - Nalashaa capabilities
Meaningful use stage 3 - Nalashaa capabilitiesMeaningful use stage 3 - Nalashaa capabilities
Meaningful use stage 3 - Nalashaa capabilities
 
Yoga_anddatascience
Yoga_anddatascienceYoga_anddatascience
Yoga_anddatascience
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
2004 10-19 rudi vdv
2004 10-19 rudi vdv2004 10-19 rudi vdv
2004 10-19 rudi vdv
 
Enterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for HealthcareEnterprise Analytics: Serving Big Data Projects for Healthcare
Enterprise Analytics: Serving Big Data Projects for Healthcare
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...
MongoDB World 2019: A Real-time Clinical Decision Support System: Building A ...
 
Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014Computational Pathology Workshop July 8 2014
Computational Pathology Workshop July 8 2014
 
Health Science Data and Metadata: Trends and Needs
Health Science Data and Metadata: Trends and NeedsHealth Science Data and Metadata: Trends and Needs
Health Science Data and Metadata: Trends and Needs
 
Bridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through TechnologyBridging Health Care and Clinical Trial Data through Technology
Bridging Health Care and Clinical Trial Data through Technology
 
Next Gen Clinical Data Sciences
Next Gen Clinical Data SciencesNext Gen Clinical Data Sciences
Next Gen Clinical Data Sciences
 
Knowing me, knowing you, knowing your disease
Knowing me, knowing you, knowing your diseaseKnowing me, knowing you, knowing your disease
Knowing me, knowing you, knowing your disease
 
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMS
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMSHEART DISEASE PREDICTION RANDOM FOREST ALGORITHMS
HEART DISEASE PREDICTION RANDOM FOREST ALGORITHMS
 
Translational Biomedical Informatics 2010: Infrastructure and Scaling
Translational Biomedical Informatics 2010: Infrastructure and ScalingTranslational Biomedical Informatics 2010: Infrastructure and Scaling
Translational Biomedical Informatics 2010: Infrastructure and Scaling
 
Data quality and uncertainty visualization
Data quality and uncertainty visualizationData quality and uncertainty visualization
Data quality and uncertainty visualization
 
AAPM Foster July 2009
AAPM Foster July 2009AAPM Foster July 2009
AAPM Foster July 2009
 
Clinical Trial Management Systems of next next decade
Clinical Trial Management Systems of next next decadeClinical Trial Management Systems of next next decade
Clinical Trial Management Systems of next next decade
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Dernier

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Dernier (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Societal Impact of Applied Data Science on the Big Data Stack

  • 1. Product  Decisions through Big  Data Center  for  Data  Science Ankur  Teredesai University  of  Washington  Tacoma 1 March  14th,  2015
  • 2. • Bioinformatics • Health  and   Wellness • Predictive  Analytics Health   Informatics • Distributed  Systems • Databases • Geo-­‐Spatial • Embedded  Systems Geo-­‐Spatial  Data   Management • Machine  Learning • Data  Mining • Computation   Intelligence • Computer  Vision Intelligent   Systems • Web • Devices • Mobile  Networks • UX  /  UI Social  Computing • Cryptology • Secure  Machine   Learning Big  Data  Security • Engineering • Dev-­‐Ops Big  Data   Infrastructure Center  for  Data  Science:  Societal  Impact
  • 3. Machine  Learning Analytics Engineering Features AlgorithmScalability ELT Integrate   Sources Constraints Deploy  Models APIs Apps Data  Struggles A  Big  Data  Project  Blueprint: 3
  • 4. Data  Mining:  1989  -­‐ 2010   • Data  Science  and   Applications  move  and   transform  sizeable  amounts   of  data  out  of  the  native   database  or  file  systems. Applications SQL/ODBC/JDBC  Data  Access Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Data  Science  using  R,   SAS,  SPSS,  Weka,  MAHOUT H I G H V O L U M E H I G H L A T E N C Y H I G H V O L U M E Application  Ecosystem  Integration
  • 5. Data  Science  uses  native  data   representation  and  inherent  distribution   and  parallelism Minimal  data  movement Rapid  Application  development  using   data  science  constructs 5 Big  Data  Science Application  Ecosystem  Integration Applications SQL/ODBC/JDBC  Data  Access Data  Science •Internal  Algorithms  for  clustering,   •classification,    regression Distributed  Database Multi-­Core,  Columnar,  Key-­Value L O W E R V O L U M E L O W E R L A T E N C Y H I G H V O L U M E L O W L A T E N C YBig  Data  Science  Components
  • 6. A  Short  History  of  (Big)  Data  Technology 1970:  Codd  invents  “A   Relational  Model  of   Data  for  Large  Shared   Data  Banks” 1985:  Copeland  – Decomposition  Storage   Model  (essentially  the   first  Columnar  Store) 1989:  Shared-­‐Nothing   Architecture 2004:  Google  – MapReduce 2005:  C-­‐Store   (Eventually  Vertica),   layers  WS/RS 2007:  Materialization   Optimizations  in   Columnar  Stores  and   Hadoop Implementation 2005-­‐07:  Star-­‐Schema   Benchmark +  Hadoop 2008:  Attempts  to   backport columnar   advances  to  row   storage,  not  very   effective Today:  BIG  DATA
  • 7. Technology  Decisions 7 Columnar  Vs Relational  Storage   Technologies Infinite  scale  using  commodity   hardware Private  or  Public  Cloud Massively  Distributed  and   Parallel  Architecture:  Hadoop Stream  Query  Processing  for   trillions  of  events  and  petabytes  of   data Real-­time  classification and   clustering:  Approximate  scoring   and  segmentation  +  Reporting   and  Data  Visualization
  • 8. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation CHF  Risk  of   Readmission Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 8 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics Personalized   Cancer  Therapy ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  • 9. iTornado Routing  Service  With  Real  World  Severe   Weather Demo  Paper  in  ACM  SIGSPATIAL 2014 (Best  Demo  paper  award) Fatalities  Stats  byWeather Related  Hazards   http://www.nws.noaa.gov,  June  2014.
  • 10. COMA Road  Network  Compression  For  Map   Matching ACM  SigSpatial IWGS  2014
  • 11. PreGo Dynamic  Multi-­‐Preference  Routing Single   Attribute Multiple Attribute Time-­‐ Homogenous Dijkstra,  A* Stewart  et  al  91 Time-­‐Variant Betsy  et al  07 ? <3,4> <2,2> <5,7> <0,0> a s b e T=[1,2,3,4,5] R=[1,2,3,4,5] T=[1,2,3,4,5] R=[1,2,3,4,5] d c g f h T=[1,2,3,4,5] R=[1,2,3,4,5] T=[1,2,3,4,5] R=[1,2,3,4,5] T=[5,1,3,4,5] R=[7,1,2,4,5] T=[1,1,3,4,5] R=[1,2,3,4,5] T=[2,1,3,4,5] R=[2,1,3,4,5] T=[1,2,2,4,3] R=[2,1,5,4,3] T=[1,2,3,1,1] R=[1,2,3,0,1] <1,1> <4,4> T=[4,2,1,3,5] R=[3,2,1,4,5]
  • 12. Special Needs Education: Teacher Trainer Effectiveness Analysis Customized Surveys Training Registration Survey Management To  support  streamlined  data  collection  and   performance  evaluation  across  the  State  Needs   Projects. Project Stakeholders Office of the Superintendent of Public Instruction Center for Data Science Data Dashboard Purpose Report Generation Geographic Distribution Maps Demographic Reports Brad Porter, Aniruddha Desai, Yitao Li, David Hazel, Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green
  • 13. Systems  Biology 13 Predictive  Models   and  software Applications:  Personalized   medicine,  drug  discovery Focus:  Develop  machine  learning   methods  and  tools  to  effectively   integrate  multiple  big  data  sources  in   biology.
  • 14. A  Flying  Hadoop Cluster 14
  • 15. Detecting  Malware  Activity  based  on   Automatically  Generated  Domains Command  &  Control   xyz.com xyz.com Infected  node Partnering  with  NIARA  we  obtained  a  large  dataset  of  Automatically  Generated  Domains.   Based    on  the  intercepted  domain  features  we   are  able  to  identify  the  malware  infecting  a   network.  
  • 16. (March  2012) • Will  this  Heart  Failure  patient   get  readmitted  within  30  days? • Yes  or  No  (Binary  Classification) 16 Reduce  CHF   Readmission Readmission  ? Machine  Learning? Joint  NSF  /  NIH  Solicitation  on  Health  Care  and  Big  Data
  • 17. Affordable  Care  Act  =>  Avoidable  Costs Readmissions  are  AVOIDABLE 20% 32% 30  days 60  days 75% 25% Non  CHF CHF • Readmissions  national  cost  $17  billion   annually • 76  %  considered  avoidable   17 Readmissions Congestive  Heart  Failure  (CHF) Source:  www.presidency.ucsb.edu,  cdc.gov,  tmz.com
  • 18. Patient Class Labels No   readmission Readmission CHF  ROR:  30-­‐Day  Hospital  Readmission  Risk   Prediction Machine   Learning     Algorithms 18 Building   the   model Scoring   the   tuple Features Vector Features Vectors New  patient No  readmission Readmission
  • 19. 19 Some of the Steps Data   Understanding And  Integration Data   Cleaning Data   Transformation Extracting    data  from  Epic  -­‐ 16  data  marts  and  200  views: Heart Failure  Inpatient  Summary Encounter.Flowsheet PatientEncounterHospital vs  
  • 20. Public  Data: State  Inpatient  Dataset  2009-­‐2012 20 AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE   DXCCS1 PRCCS1 TOTCHG 52 98122 1 3 12 3 0 153 212 56,511 87 98109 1 3 7 1 1 162 -­‐ 12,687 26 98028 4 3 1 30 1 139 195 127,300 • Washington  State  Inpatient  Data • Admission  level  Claims   • ~400  attributes   • Demographics • ICD9  Diagnosis  codes • ICD9  Procedure  codes • Charges • Admissions  by  year • 2009  – 652702 • 2010  – 651783 • 2011  – 648079 • 2012  – 648092
  • 21. Variety  and  Volume  (2/3  V’s  of  Big  Data) Pre  Admission Post  Admission Pre-­‐ Discharge Discharge -­‐ Demographics -­‐ Vital  Sign -­‐Prior  Hospitalization Pulse  rate             Blood  pressure   Respiration  rate   BMI Number  of    prior  admissions Prior  length  of  stay + Demographics Sodium  level Glucose  level Hemoglobin  level Creatinine  level Hematocrit  level Neutrophils  level Ejection  Fraction   BUN  level + Vital  Sign + Prior  Hospitalization -­‐ Lab  Test + Vital  Sign + Prior  Hospitalization + Demographics +  Lab  Test -­‐ Diagnosis  Information Number  of  secondary  diagnosis Chronic  systolic  heart  failure   Acute  kidney  failure     Chest  pain Hyper  potassemia   Bronchopneumonia Other  chronic  pulmonary  heart  diseases   Syncope  and  collapse        … + Prior  Hospitalization + Demographics -­‐ Comorbidities Acute  coronary  syndrome    Asthma COPD    Ulcer    Dialysis    Dementia Arrhythmias    Mal  Nutrition   Vascular    Depression -­‐ Discharge/Admit  codes Admit  /Discharge  type Severity  Of  illness    Risk  Of  Mortality   -­‐ Utilization  Information Operating  room  CTSCAN Emergency  Room        CCU Marital  status          Age Racial  group       Gender
  • 22. (Dec  2012)  Initial  Models   22 Data  integration Feature  Construction Predictive  modeling • Logistic  Regression • Naïve  Bayes • Support  Vector  Machines 0.6 0.72 0.64 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 Yale  M odel  (Com parative  …Am arasingham  et  al.   Our  current  Result Area  Under  the  Curve  (AUC) Several  Rejects:   KDD  Industry  Track   2013 AMIA  2013 JAMIA  2013 2012
  • 23. (July  2013)  (much  better)   &  Some  Papers § Improved  data  exploration § S.-­‐C. Chin, K. Zolfaghar, S. Basu Roy, A. Teredesai, and P. Amoroso, "Divide-­‐n-­‐ Discover -­‐-­‐ Discretization based Data Exploration Framework for Healthcare Analytics," 7th International Conference on Health Informatics (HEALTHINF Short Paper), Angers, France, 2014 § N. Meadem, N. Verbiest, K. Zolfaghar, J. Agarwal, S.-­‐C. Chin, S. Basu Roy, A. Teredesai, D. Hazel, P. Amoroso, and L. Reed, "Predicting Risk of Readmission for Congestive Heart Failure Patients," Workshop on Data Mining for Healthcare (DMH), Chicago, IL, 2013 23 0.6 0.72 0.64 0.74 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Yale  Model   (Comparative   Baseline) Amarasingham   et  al.   Our  2012  Result Our  current   Result Area  Under  the  Curve  (AUC) §Improved  Modeling Effort
  • 24. (Dec  2013)  Prototype  or  a  possible  Product?   &  yes,  More  Papers § Successful  Deployment 24 §K. Zolfaghar, J. Agarwal, D. Sistla, S.-­‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-­‐O-­‐Meter: An Intelligent Clinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, 2013 §Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-­‐Chi Chin, Brian Muckian: Big data solutions for predicting risk-­‐of-­‐readmission for congestive heart failure patients. BigData Conference 2013: 64-­‐71
  • 25. 25 Multi  Layer  Classifier  :  Automatically  Detecting   Classification  Windows Will  patient ever readmit? Will  patient readmit within 30  days? YES NO YES NO KNN LR NB SVM KNN 32%  of  all  data Only 5%  of  patients that return within 30  days is  filtered out
  • 26. Generalizing  the  30,60,90  Day  readmission § Automatic  design  of  time  prediction  hierarchy § Feature  selection  and  factor  analysis  at  each  layer § Different  classification  algorithms  in  each  layer  and  satisfying  different   quality  metrics 26
  • 27. Automatic  design  of  prediction  hierarchy 27
  • 28. Simple  3  Layer  Example • Stage  1:  Design  a  predictive  model  for  the  patients  who  are  likely  to   come  back  within  a  time  window  of  (X,  K),  where  X  is  the  maximum   number  of  days  until  next  readmission • Stage  2:  Design  a  predictive  model  for  time  window  of  (K,  30) • Stage  3:  Design  a  predictive  model  for  time  window  of  <30  days  of   readmission HOW  TO  AUTOMATICALLY  DETECT  THE  MIDDLE  CUTPOINT  K? 28
  • 29. Hill  Climbing  Algorithm  to  Detect  K § Generate  a  random  number    K  between  X  and  30 § Compute   C1=  Centroid(X,K)  ,  C2=  Centroid(K+1,30) § Compute  the  KLCurrent =  KLDiv(C1,C2) § K’=K+i K”=K-­‐i § Find  a  point  K2  between  (K’,K’’)  ,  and  check § If  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  >  KLCurrent § If  the  above  condition  is  satisfied,  then  K=K2 § KLCurrent =  KLDiv(  Centroid(X,K2),  Centroid(K2,30))   § Repeat  the  above  steps  until  no  further  check  is  possible 29
  • 30. 30 Calculating  the  Probability  of  30  day  RoR P(readmit ≤ 30) = P(≤ 30 |≤ K)× P(≤ K |Y)P(Y)
  • 31. Risk-­‐O-­‐Meter Distinguishing  Features 31 Risk-­‐O-­‐Meter Users Current  Systems Healthcare  provider and  Patients Only   healthcare  providers Result  explanation and  exploration Need  deep  domain   Knowledge Handle  incomplete  patient   input
  • 32. All  in  one  Package  – Risk-­‐O-­‐Meter  (KDD  2013) 32
  • 33. Pre  Admission Post  Admission Pre  -­‐ Discharge Discharge Post-­‐Discharge   Care   Management   Pipeline “White  Gap”PCP HF  Service Care   Management Payer ChroniRisk Continuous  Readmission  Risk  Assessment  Across  Continuum  of  Care* 78%* 42%* Service  Line  EMRPCP  Tools Psycho-­‐social  risk   scoring 2013  HF  Readmission  Statistics • 7.1  M  Readmits • 5.3  M  Avoidable • $13,000  each • $13  B  opportunity  cost Patient  Encounters  Scored +18,000 (HF  cohort)
  • 34. Risk  – Done Cost  – Done Next?   Actionable  Interventions If  we  can  predict  can  we  recommend? 34 A  Framework  to  Recommend  Interventions  for  30-­‐Day  Heart  Failure  Readmission  Risk,  Rui Liu,  Kiyana Zolfaghar,  SC  Chin,  Senjuti Basu Roy,  Ankur  Teredesai,  Data  Mining  (ICDM),  2014  IEEE  International  Conference   on  DOI:  10.1109/ICDM.2014.89  Publication  Year:  2014  ,  Page(s):  911  -­‐ 916
  • 35. A  real  and common Chronic  Readmission 75-­‐year  old,  female Chronic  pulmonary  disease,   depression,  hypertension and  diastolic  heart  failure   High Risk Medium Risk Low Risk 35 Readmit! Intervention  Plan  1 Major  Operating  Room,  Chest  X-­‐ray  and  others Intervention  Plan  2 Echocardiology,  CCU  and  others Intervention  Plan  3 Emergency  Room  and  others
  • 36. Risk  will  be   lower  when  the   interventions   are  performed The  patient  is   not  readmitted Intervention  Rule  Generation Readmission Age Gender Pneumonia DX486 Acute respitory failure DX51881 CHF DX4280 Cont inv mec ven <96 hrs PR9671 Venous cath NEC PR3893 Packed cell transfusion PR9904 Rule   Repository Valid  Rule 1 Female, Diabetes,  Major  Operating  Room,   Chest  X-­‐ray  and  others Valid  Rule 2 Male, Hypertension, Echocardiology,  CCU  and   others Invalid Rule 3 Female,  Depression,  Emergency  Room  and   others Invalid  Rule  4 Male,  COPD,  Emergency  Room  and  others 36 Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Compute patient risk using only non-­‐ procedural attributes Compute patient risk using procedural attributes Compare the difference between the two probabilities Store the rules where the risk is reduced after introducing the procedures
  • 37. Recommendation  for  New  Patient Intervention  Plan  1 Major  Operating  Room,  Chest  X-­‐ray  and  others Intervention  Plan  2 Echocardiology,  CCU  and  others Intervention  Plan  3 Emergency  Room  and  others Top 3 intervention plans Rule  Repository New  Patient  Attributes Summarized  Intervention  Plan Major  Operating  Room,  Echocardiology ,  Chest   X-­‐ray  and  others 37 Summarize The Rule Repository is  HUGE!  (over   30k  rules) Parallel Solution! Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Compute similarity between established attribute profile and a given patient profile Identify rules where the established attribute is most similar to the patient input Recommend interventions extracted from the established rules
  • 38. Validation  – Data  Highlights • State  Inpatient  Database  (SID) of  Washington  State  heart  failure  cohort  in  year  2010   (67967  patients) for training and 2011 (52021 patients)  for  testing • 3908  diagnosis  and  2049  procedure  codes  are  involved. • Feature  Selection  is  performed  using  chi-­‐square  test. Demographics Age,  Gender,  Race Comorbidity  &  Diagnosis 21  comorbidities  and  90  diagnosis Utilization  &  Interventions 21 health  service  utilization  flags  and  70  interventions Others Length of  Stay,  #  of  diagnosis  and  interventions 38 High Dimensional Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Extract patients from the test set who were not readmitted within 30 days Compute the evaluation metrics between the recommended interventions and the actual interventions
  • 39. Validation – Experiment Results 39 0 100 200 300 400 Linear   Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Hits 0.34 0.35 0.36 0.37 0.38 0.39 0.4 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Jaccard  Index 0.93 0.932 0.934 0.936 0.938 0.94 0.942 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Accuracy 0.45 0.5 0.55 0.6 0.65 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid True  Positive  Rate Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation
  • 40. Back  to  the  Chronic  Readmission  Case 75-­‐year  old,  female Chronic  pulmonary  disease,   depression,  hypertension and  diastolic  heart  failure   40 No-­‐readmit! Cardiac  catheterization  lab,  CT  scan,  echo-­‐ cardiology,  echo-­‐cardiogram,   Cardiac  catheterization  lab,  CT  scan,  echo-­‐ cardiology,  echo-­‐cardiogram
  • 41. Accountable  Care  Organizations Cost/Charge  Prediction 41 HealthSCOPE:  An  Interactive  Distributed  Data  Mining  Framework  for  Scalable  Prediction  of  Healthcare  Costs  ,  Marquardt  James,  Newman  Stacey, Hattarki Deepa,  Srinivasan Rajagopalan,  Sushmita Shanu,  Ram  Prabhu,  Prasad  Viren,  Hazel  David,  Ramesh  Archana,  De  Cock  Martine,  Teredesai  Ankur,   IEEE  Data  Mining  Conference  Demo  Track,  2014  IEEE  International  Conference  on  DOI:  10.1109/ICDMW.2014.45  Publication  Year:  2014  ,  Page(s):  1227  -­‐ 1230
  • 42. 42 What  are  healthcare   costs  for  assigned   population  in  2015  ? Why  is  the  cost  so   high  or  low  ? How  does  the  cost   distribute  across   demographics  ? QUESTIONS DATA   SCIENCE DATA APPLICATIONS Motivation:   ACO  Cost  Prediction Demographics Diagnosis   Codes Procedure   Codes Drugs Lab  Results Clinical Claims Sources  :  SID,  OSHPD,  MEPS Source  :  MultiCare  Collaboration Charges Vitals Population Predictive   Modeling Feature  Prioritization Health  Prediction Care  Management Individual Predictive   Modeling Chandola et.  al,  KDD  2013  
  • 43. Cost/Charge  Prediction:  Problem  Description • Goal  à predict  the  future  healthcare  cost  of  individuals  based  on   their  past  medical  and  cost information. • Supervised  machine  learning  problem. • Input: • Previous  health  information  (e.g.  diagnosis,  comorbidities,  etc).   • General  demographics  (age,  gender,  race) • Previous  healthcare  cost • {X}  =  (x1,  x2,  x3 ......xp) • Output: • Y  =  Future  healthcare  cost foo 43
  • 44. foo 44 Four  Scenarios  for  predicting  cost   • Three  Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Nine  months  (1Q) • Six Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Six  months  (2Q) • Nine Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Three  months  (3Q) • Twelve    Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Twelve    months  (4Q)
  • 45. Non-­‐ Gaussian  Distribution  of  Healthcare  Costs foo 45 Makes  it  challenging  and  interesting  problem  for  research
  • 46. Existing  Cost  prediction  Methods • Limited  to  Rule  based  or  Multiple  Linear  Regression  methods • Rule  Based  methods   • Requires  domain  knowledge • Expensive • Multiple  Linear  Regression • Multi-­‐collinearity Issue • Sensitive  to  extreme  values  (outliers) • Evaluation • Estimate    the    mean    cost    of    the    given    sampling    distribution. • Often  in-­‐sample  data  used  to  report  predictive  performance. • R2   evaluation  metric (not  a  true  indicator)
  • 47. Our  Contributions • Investigate  the  utility  of  state-­‐of–the  –art  machine  learning     algorithms  for  the  cost  prediction  problem.   • We  empirically  evaluate  three  algorithms: • Regression  Trees • M5  Model  Trees • Random  Forest foo 47
  • 48. Regression  Tree 48 Age  >  60? Has   Asthma? Gender  =   Female? 21,00046,00062,00085,000 Yes Yes Yes No No No
  • 49. M5  Model  Tree foo 49 Has   Asthma? Gender  =   Female? Yes Yes Yes No No No Age  >  60?
  • 50. Random  Forest 50 Had   Procedure   X? Age  >  18? Gender  =   Male? 21,00046,00062,00085,000 Yes Yes Yes #  Admits   >  3? No No Race  =   White? Has  CHF? 21,00046,00062,00085,000 Yes Yes YesNo No No NoAge  >   60? Has   Asthma? 21,000 Gender  =   Female? 46,00062,00085,000 Yes Yes YesNo No No
  • 51. 51 Evaluation  Metrics • Mean  Absolute  Error  (MAE) • Root  Mean  Squared  Error  (RMSE)
  • 52. 52 MAE  Results  – SID  Data  (3Q  Scenario) 0 5,000 10,000 15,000 20,000 25,000 30,000 Average   Baseline Previous   Cost   Regression Multiple   Linear   Regression Regression   tree Random   Forest Model  Tree MAE  ($) Baselines Advanced  Models
  • 53. 53 MAE  Results  – MEPS  Data 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Average   Baseline Previous   Cost   Regression Multiple   Linear   Regression Regression   tree Random   Forest Model  Tree MAE  ($) Baselines Advanced  Models
  • 54. 54 Prediction  Error  Results  – M5  Model  Trees 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 1Q 2Q 3Q 4Q Error  ($) MAE RMSE
  • 55. Error  Distribution:  WA  State  SID  Data foo 55 For  large  fraction  of  of  the   population  (75%),  we  were  able  to   predict with    higher    accuracy    using    these     algorithms 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0% 25% 50% 75% Maximum  Prediction  Error  ($) Portion  of  Population Multiple  Linear   Regression Regression  Tree Random  Forest Model  Tree
  • 56. Sub-­‐Population  Cost  Prediction Prediction Prediction Prediction Population Sub-­‐Population Future Healthcare Cost Congestive  heart  failure  (CHF) Diabetes COPD Asthma Coronary  artery  disease  (CAD) Age  65+
  • 57. Most  difficult  cohort  to  predict foo 57 0 5000 10000 15000 20000 25000 30000 35000 Asthma Diabetes CHF COPD Coronary Over  65 MAE  ($) model  trees linear  regression
  • 58. Engineering  the  Solutions:   Risk-­‐O-­‐Readmission  &  Cost-­‐As-­‐a   Service 58
  • 59. Thu,  Nov  7,  2013  at  10:50  AM 59 -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Forwarded  message  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ From:  Windows  Azure  Pass  System  Admin  <wapadmin@microsoft.com> Date:  Thu,  Nov  7,  2013  at  10:50  AM Subject:  Gifting  Letter  for  Windows  Azure  Research  Pass To:  "Ankur  M.  Teredesai"  <ankurt@uw.edu> Cc:  "Azure4Research  (RFP  External)"  <azurerfp@microsoft.com> Dear  Ankur  M.  Teredesai  , We  have  approved  your  application  for  a  Windows  Azure  Research  Pass  Grant.  In   order  to  receive  your  pass,  download  the  Microsoft  gifting  letter  from  the  following   link:
  • 61. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Scale  Issues: Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual WA-­‐SID  Claims  /  MEPS   Survey  (for  training) Data  Sources C M5  Model  Trees
  • 62. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual Data  Sources WA-­‐SID  Claims  /  MEPS   Survey  (for  training) C M5  Model  Trees
  • 63. Apache  Spark foo 63 Apache  Spark HDFS Slave  1 Slave  1 Master Driver RDD In  Memory  Data Partition  1 In  Memory  Data Partition  2 Spark Spark Spark Data  Partition1 Replica  Data   Partition2 Data  Partition2 Replica  Data   Partition2
  • 64. Weighted  k-­‐NN  for  Regression foo 64 Data   Partition  1 kNN1 Predicted  Cost kNN2 2k  NN kNN Node  1 Data   Partition  2 Node  2 Test   Instance Top  k Group   &  Sort Group  &  Sort Weighted   Average Compute   kNN Compute   kNN
  • 65. Rough  Set • Rough set theory is an ML framework that is especially suitable for information systems with inconsistencies. • Rough set theory handles discrete attributes. • Lower approximation: instances that necessarily belong to the class • Upper approximation: instances that possibly belong to the class Patient Age  ≥  50 Alcohol  Disorder  Visit Cost P1 Yes Yes High P2 Yes Yes High P3 Yes No Low P4 Yes No High P5 No No Low P6 No Yes High Similar  Patients  but  belong  to   different  classes!
  • 66. Fuzzy  Rough  Set • Uses  fuzzy  logic  to  handle  continuous   attributes. • Similarity  matrix  contains  values   between  0  and  1.   • Inconsistent  instances  are  highly   related  but  have  a  different  class. Patient Age Alcohol  Disorder  Visits   Cost P1 52 1 $13335 P2 59 4 $277966 P3 55 0 $8139 P4 50 0 $66058 P5 34 0 $5815 P6 26 1 $38526 P1 P2 P3 P4 P5 P6 P1 1 0.52 0.83 0.84 0.60 0.61 P2 0.52 1.00 0.44 0.36 0.12 0.13 P3 0.83 0.44 1 0.92 0.68 0.44 P4 0.84 0.36 0.92 1 0.76 0.51 P5 0.60 0.12 0.68 0.76 1 0.75 P6 0.61 0.13 0.44 0.51 0.75 1
  • 67. Fuzzy  Rough  Set • Let rj,i be the degree of similarity of instances i and j. • Let ci be the degree to which instance i belongs to the class. • Then the degree to which instance j belongs to the: • Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n} • Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n} • Current implementations can handle only up to 100,000 instances because they keep the similarity matrix in memory.
  • 68. Fuzzy  Rough  Set   max{min(rj,i, ci) | i = 1,...,n}
  • 70. Implementation • The construction of the similarity matrix can be done in a parallel manner, making each of K compute nodes calculate n/K columns of the similarity matrix. • No need to store the similarity matrix as a whole. • The construction of the similarity matrix does not have to be finished before (partial) computation of the lower and upper approximations can begin. Node  1 Node  2
  • 72. Spark  vs MPI               Fuzzy  Rough  Set
  • 73. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual WA-­‐SID  Claims  /  MEPS   Survey  (for  training) Data  Sources C M5  Model  Trees
  • 74. Readmission  Application • Android • Windows  Phone • Patient  View • what  is  my  risk • Doctor  View   • who  are  my  risky  patients? • alerts • Interventions 74
  • 76. 0.6  AUC Yale  Model (Baseline) 76 Milestones:  Readmission  Risk 0.64  AUC UW  2012   Result Ensemble   method,   Hierarchical   classification Dec  2012 0.74  AUC UW  2014Result Lab  results + New   Algorithm   (Adaboost) Feb    2014 QlikView Readmission   App Dec  2013 Machine  Learning   Process  to  Target   New  Chronic   Diseases Aug  2014  -­‐>  Moving  Forward Integrating   care  pathway   March  2014 Bayesian   Network   Learning AUC  – Accuracy  measure   (Area  Under  Curve) Real  Time   Care   Factors  &   Pathways July  2014 with   EPIC Post-­‐Discharge (Clinical    data) June  2013 Risk-­‐o-­‐Meter Development +   Big  data  Efforts Pre-­‐Admission (Clinical    data) Post-­‐Discharge (Claim  data) Post-­‐Admission (Clinical  data) IEEE  Big  Data REF  #3 KDD REF  #1  &  2 HEALTHINF REF  #4  &  5 KDD REF  #6 ICDM  2014 REF  #6
  • 77. Problem   Explorat ion 77 Milestones:  Cost  Prediction H-­‐SCOPE  I SID  Data June  2014 H-­‐SCOPE  IV SID  +  MEPS   data Nov.  2014 H-­‐SCOPE  III Adapa Scoring   Engine Spark   Framework Sept.  2014 Aug  2015  -­‐>  Moving  Forward H-­‐SCOPE  V Five  Cohort Dec.  2014 M5  Model   Trees Random   Forest Regression   Tress Health SCOPE  VI July  2015 Admit  Level August  2014 H-­‐SCOPE  II Population  View   (ACO) OSHPD  Data   Application Beneficiary   Level Beneficiary   View Four  Future   Scenario   ICDM  2014 KDD-­‐2015 AMIA-­‐2015 Sub-­‐ Population Deep Learning Time  & Cost  Of Hospital readmission H-­‐SCOPE  VII AHRQ  Private data WWW-­‐Digital Health-­‐2015 Time,  Cost And   Illness  (Alignment) Prediction  
  • 78. 78 AUC  – Accuracy  measure   (Area  Under  Curve) 2012 78 Milestones:  Merging  Threads 2016  and  beyond2013 2014 2015 Risk  of  Readmission  (Clinical,  Sociological  &  Claims) 2014 2015 Cost  Prediction  (Claims  and  secondary  data  sources) 2015 Risk  &  Cost  Convergence
  • 79. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation CHF  Risk  of   Readmission Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 79 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics Personalized   Cancer  Therapy ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  • 80. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation Personalized   Cancer  Therapy Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 80 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics CHF  Risk  of   Readmission ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  • 81. 81 Our  Sincere  Thanks for  Your  Support!