Contenu connexe Similaire à Practical Computing with Chaos (20) Plus de MapR Technologies (20) Practical Computing with Chaos2. © 2014 MapR Technologies 2
Practical Computing with Chaos
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
3. © 2014 MapR Technologies 3
e-book available courtesy of MapR
Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
4. © 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
5. © 2014 MapR Technologies 5
Agenda
• Monty Hall
• Randomized geo-coding
• Thompson sampling
– Bayesian Bandits
– Targeting
– Bayesian ranking
• Dithering (sound, signals)
• Synthetic data (preview)
6. © 2014 MapR Technologies 6
Let’s Start with Trouble
• Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize
• You pick one
• Monte shows you one of the remaining doors is empty
• You can switch at this point to the other door or not
• Should you switch?
10. © 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
11. © 2014 MapR Technologies 11
Live Coding
With REAL Chaos
13. © 2014 MapR Technologies 13
Geo-coding
• Some databases have disk locality key locality
• The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible
– But loses some distance information
– A line is not a square!
• We want to do proximity searches
– This gets harder in the polar regions for most codings
14. © 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
15. © 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
0
0 3
20
1
16. © 2014 MapR Technologies 16
Z-coding – Interleave Bits
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
17. © 2014 MapR Technologies 17
Neighbors Often Share Prefix
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
00. 11.11
10. 01.01
00. 11.01
18. © 2014 MapR Technologies 18
Often, not always
Close Far
19. © 2014 MapR Technologies 19
Random Sampling to Derive Keys
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
20. © 2014 MapR Technologies 20
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
21. © 2014 MapR Technologies 21
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
22. © 2014 MapR Technologies 22
"00.01.10" - "00.01.11"
"00.11.00" - "00.11.11"
"01.00.10"
"01.10.00" - "01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
24. © 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
25. © 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2024
Time
26. © 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
010003000
27. © 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2024
Time
29. © 2014 MapR Technologies 29
Learning in the Real World
• In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret
– We would like to minimize this
– But we can’t quantify regret without incurring regret!
30. © 2014 MapR Technologies 30
An Example
• Pick one of five options
– Purple, blue, green, red, yellow
– Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization
– Best = minimal regret + minimal code complexity
33. © 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD
,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr
,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC
,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD
,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc
,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC
FROM (SELECT distinct enc.encounter_key as ENC_KEY,
enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE,
bt.bill_type, cnt.contract_nbr as CONTR_,
ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD,
eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type,
prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd,
rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY,
st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD,
derr.error_desc as DG_ERRDESC
FROM oicpcuhg.ir_encounter enc
`
Can You See the Problem?
34. © 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ
ON (typ.encounter_type_key = enc.encounter_type_key)
LEFT OUTER JOIN oicpcuhg.ir_billtype bt
ON (bt.bill_type_key = enc.bill_type_key)
LEFT OUTER JOIN oicpcuhg.ir_contract cnt
ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds
ON (ds.source_key = enc.data_source_key)
LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key)
INNER JOIN oicpcuhg.ir_member m
ON (
m.hp_cd = enc.hp_cd
AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr)
LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror
ON (eerror.encounter_key = enc.encounter_key and
eerror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error eerr
ON (eerr.error_key = eerror.error_key)
LEFT OUTER JOIN oicpcuhg.ir_provider prv
ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
35. © 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp
ON (esp.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_specialty sp
ON (sp.specialty_key = esp.specialty_key)
LEFT OUTER JOIN oicpcuhg.ir_service svc
ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev
ON (rev.rev_cd = svc.rev_cd)
LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag
ON (diag.encounter_key = enc.encounter_key)
INNER JOIN oicpcuhg.ir_diagcd dgcd
ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st
ON (st.rec_state_key = diag.rec_state_key)
INNER JOIN oicpcuhg.ir_recordstatus sts
ON (sts.rec_status_key = diag.rec_status_key)
LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror
ON (derror.diagnosis_key = diag.diagnosis_key and
derror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error derr
ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr
ON (trim(umr.member_nbr) = IR.member_Nbr AND
trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND
trim(umr.diag1) = IR.diag_cd)
36. © 2014 MapR Technologies 36
One Attack
• The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data
– How many columns
– What types
– Perhaps statistical summaries
37. © 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
DataData
DataFake
DataFake
x y α ξ
x y α ξ
38. © 2014 MapR Technologies 38
The Upshot
• So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
39. © 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
40. © 2014 MapR Technologies 40
Last October: Time Series Databases
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
41. © 2014 MapR Technologies 41
Coming in February: Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
42. © 2014 MapR Technologies 42
Thank you for coming today!
Notes de l'éditeur
Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon,
but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one. Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…