인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭

How Artificial Intelligence would Innovate the Medicine in the Future.
Professor, SAHIST, Sungkyunkwan University
Director, Digital Healthcare Institute
Yoon Sup Choi, Ph.D.
인공지능은 의료를 어떻게 혁신할 것인가

“It's in Apple's DNA that technology alone is not enough.  
It's technology married with liberal arts.”

The Convergence of IT, BT and Medicine

대한영상의학회 춘계학술대회 2017.6

Vinod Khosla
Founder, 1st CEO of Sun Microsystems
Partner of KPCB, CEO of KhoslaVentures
LegendaryVenture Capitalist in SiliconValley

“Technology will replace 80% of doctors”

https://www.youtube.com/watch?time_continue=70&v=2HMPRXstSvQ
“영상의학과 전문의를 양성하는 것을 당장 그만둬야 한다.
5년 안에 딥러닝이 영상의학과 전문의를 능가할 것은 자명하다.”
Hinton on Radiology

•AP 통신: 로봇이 인간 대신 기사를 작성

•초당 2,000 개의 기사 작성 가능

•기존에 300개 기업의 실적 ➞ 3,000 개 기업을 커버

• 1978
• As part of the obscure task of “discovery” —
providing documents relevant to a lawsuit — the
studios examined six million documents at a
cost of more than $2.2 million, much of it to pay
for a platoon of lawyers and paralegals who
worked for months at high hourly rates.
• 2011
• Now, thanks to advances in artiﬁcial intelligence,
“e-discovery” software can analyze documents
in a fraction of the time for a fraction of the
cost.
• In January, for example, Blackstone Discovery of
Palo Alto, Calif., helped analyze 1.5 million
documents for less than $100,000.

“At its height back in 2000, the U.S. cash equities trading desk at
Goldman Sachs’s New York headquarters employed 600 traders,
buying and selling stock on the orders of the investment bank’s
large clients. Today there are just two equity traders left”

•일본의 Fukoku 생명보험에서는 보험금 지급 여부를 심사하
는 사람을 30명 이상 해고하고, IBM Watson Explorer 에게
맡기기로 결정

•의료 기록을 바탕으로 Watson이 보험금 지급 여부를 판단

•인공지능으로 교체하여 생산성을 30% 향상

•2년 안에 ROI 가 나올 것이라고 예상

•1년차: 140m yen

•2년차: 200m yen

No choice but to bring AI into the medicine

Martin Duggan,“IBM Watson Health - Integrated Care & the Evolution to Cognitive Computing”

•약한 인공 지능 (Artificial Narrow Intelligence)

• 특정 방면에서 잘하는 인공지능

• 체스, 퀴즈, 메일 필터링, 상품 추천, 자율 운전

•강한 인공 지능 (Artificial General Intelligence)

• 모든 방면에서 인간 급의 인공 지능

• 사고, 계획, 문제해결, 추상화, 복잡한 개념 학습

•초 인공 지능 (Artificial Super Intelligence)

• 과학기술, 사회적 능력 등 모든 영역에서 인간보다 뛰어난 인공 지능

• “충분히 발달한 과학은 마법과 구분할 수 없다” - 아서 C. 클라크

2010 2020 2030 2040 2050 2060 2070 2080 2090 2100
90%
50%
10%
PT-AI
AGI
EETNTOP100 Combined
언제쯤 기계가 인간 수준의 지능을 획득할 것인가?
Philosophy and Theory of AI (2011)
Artiﬁcial General Intelligence (2012)
Greek Association for Artiﬁcial Intelligence
Survey of most frequently cited 100 authors (2013)
Combined
응답자
누적 비율
Superintelligence, Nick Bostrom (2014)

Superintelligence: Science of ﬁction?
Panelists: Elon Musk (Tesla, SpaceX), Bart Selman (Cornell), Ray Kurzweil (Google),
David Chalmers (NYU), Nick Bostrom(FHI), Demis Hassabis (Deep Mind), Stuart
Russell (Berkeley), Sam Harris, and Jaan Tallinn (CSER/FLI)
January 6-8, 2017, Asilomar, CA
https://brunch.co.kr/@kakao-it/49
https://www.youtube.com/watch?v=h0962biiZa4

Superintelligence: Science of ﬁction?
Panelists: Elon Musk (Tesla, SpaceX), Bart Selman (Cornell), Ray Kurzweil (Google),
David Chalmers (NYU), Nick Bostrom(FHI), Demis Hassabis (Deep Mind), Stuart
Russell (Berkeley), Sam Harris, and Jaan Tallinn (CSER/FLI)
January 6-8, 2017, Asilomar, CA
Q: 초인공지능이란 영역은 도달 가능한 것인가?
Q: 초지능을 가진 개체의 출현이 가능할 것이라고 생각하는가?
Table 1
Elon Musk Start Russell Bart Selman Ray Kurzweil David Chalmers Nick Bostrom DemisHassabis Sam Harris Jaan Tallinn
YES YES YES YES YES YES YES YES YES
Table 1-1
YES YES YES YES YES YES YES YES YES
Q: 초지능의 실현이 일어나기를 희망하는가?
Table 1-1-1
Complicated Complicated Complicated YES Complicated YES YES Complicated Complicated
https://brunch.co.kr/@kakao-it/49
https://www.youtube.com/watch?v=h0962biiZa4

http://waitbutwhy.com/2015/01/artiﬁcial-intelligence-revolution-2.html

Superintelligence, Nick Bostrom (2014)
일단 인간 수준(human baseline)의 강한 인공지능이 구현되면,

이후 초지능(superintelligence)로 도약(take off)하기까지는

극히 짧은 시간이 걸릴 수 있다.
How far to superintelligence

•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
인공지능의 의료 활용

Jeopardy!
2011년 인간 챔피언 두 명 과 퀴즈 대결을 벌여서 압도적인 우승을 차지

600,000 pieces of medical evidence
2 million pages of text from 42 medical journals and clinical trials
69 guidelines, 61,540 clinical trials
IBM Watson on Medicine
Watson learned...
+
1,500 lung cancer cases
physician notes, lab results and clinical research
+
14,700 hours of hands-on training

Annals of Oncology (2016) 27 (suppl_9): ix179-ix180. 10.1093/annonc/mdw601
Validation study to assess performance of IBM cognitive
computing system Watson for oncology with Manipal
multidisciplinary tumour board for 1000 consecutive cases:  
An Indian experience
• MMDT(Manipal multidisciplinary tumour board) treatment recommendation and
data of 1000 cases of 4 different cancers breast (638), colon (126), rectum (124)
and lung (112) which were treated in last 3 years was collected.
• Of the treatment recommendations given by MMDT, WFO provided  
 
50% in REC, 28% in FC, 17% in NREC
• Nearly 80% of the recommendations were in WFO REC and FC group
• 5% of the treatment provided by MMDT was not available with WFO
• The degree of concordance varied depending on the type of cancer
• WFO-REC was high in Rectum (85%) and least in Lung (17.8%)
• high with TNBC (67.9%); HER2 negative (35%) 
• WFO took a median of 40 sec to capture, analyze and give the treatment. 
 
(vs MMDT took the median time of 15 min)

San Antonio Breast Cancer Symposium—December 6-10, 2016
Concordance WFO (@T2) and MMDT (@T1* v. T2**)
(N= 638 Breast Cancer Cases)
Time Point
/Concordance
REC REC + FC
n % n %
T1* 296 46 463 73
T2** 381 60 574 90
This presentation is the intellectual property of the author/presenter.Contact somusp@yahoo.com for permission to reprint and/or distribute.26
* T1 Time of original treatment decision by MMDT in the past (last 1-3 years)
** T2 Time (2016) of WFO’s treatment advice and of MMDT’s treatment decision upon blinded re-review of non-concordant
cases

Sung Won Park,APFCP, 2017
Assessing the performance of Watson for Oncology using colon
cancer cases treated with surgery and adjuvant chemotherapy  
at Gachon University Gil Medical Center
• Stage II with high risk and stage III colon cancer patients (N=162)
• Retrospective study: From September 1, 2014 to August 31, 2016
• Gachon University Gil Medical Center (GMC)
• Generally accepted by GMC-recommendation in 83.3%
• Concordant with
• WFO-Rec: 53.1%
• WFO-FC: 30.2%
• WFO-NREC: 13.0%
• Not included: 3.7%

WFO in ASCO 2017
• Concordance assessment of a cognitive computing system in Thailand (범룽랏)

• 2015-2016 patients 211명 (92명 retrospective; 119 prospective)

• Concordance

• Overall: 83%

• Colorectal 89%, lung 91%, breast 76%, Gastric 78%

WFO in ASCO 2017
• Early experience with IBM WFO cognitive computing system for lung  
 
and colorectal cancer treatment (마니팔 병원) 
• 지난 3년간: lung cancer(112), colon cancer(126), rectum cancer(124)
• lung cancer: localized 88.9%, meta 97.9%
• colon cancer: localized 85.5%, meta 76.6%
• rectum cancer: localized 96.8%, meta 80.6%

WFO in ASCO 2017
• Early experience with IBM WFO cognitive computing system for lung  
 
and colorectal cancer treatment (마니팔 병원) 
• 지난 3년간: lung cancer(112), colon cancer(126), rectum cancer(124)
• lung cancer: localized 88.9%, meta 97.9%
• colon cancer: localized 85.5%, meta 76.6%
• rectum cancer: localized 96.8%, meta 80.6%
Performance of WFO in India
2017 ASCO annual Meeting, J Clin Oncol 35, 2017 (suppl; abstr 8527)

WFO in ASCO 2017
•Use of a cognitive computing system for treatment of colon and gastric  
 
cancer in South Korea (가천대길병원)

• 2012-2016

• 대장암 환자(stage II-IV) 340명

• 진행성 위암 환자 185명 (Retrospective) 
• Concordance

• 대장암 환자 전체 (340명): 73%

• 보조 (adjuvant) 항암치료를 받은 250명: 85%

• 전이성 환자 90명: 40% 
• 위암 환자 전체: 49%

• Trastzumab/FOLFOX 가 국민 건강 보험 수가를 받지 못함

• S-1(tegafur, gimeracil and oteracil)+cisplatin):

• 국내는 매우 루틴; 미국에서는 X

잠정적 결론
•왓슨 포 온콜로지와 의사의 일치율:

•암종별로 다르다.

•같은 암종에서도 병기별로 다르다.

•같은 암종에 대해서도 병원별/국가별로 다르다.

•시간이 흐름에 따라 달라질 가능성이 있다.

WHY?
•국가별 가이드라인의 차이

• WFO는 기본적으로 MSKCC 기준

• 인종적 차이, 인허가 약물의 차이, 보험 제도의 차이

•NCCN 가이드라인의 업데이트

•암종별 치료 가능한 옵션의 다양성 차이

• 폐암: 다양함 vs 직장암: 다양하지 않음

• TNBC: 다양하지 않음 vs HER2 (-): 다양함

•WFO 의 정확성을 어떻게 증명해야 할까?

•WFO 의 정확성을 어떻게 증명해야 할까? 임상시험!
vs.
Watson for Oncology종양내과 전문의(들)
암환자 10,000명 암환자 10,000명
이러한 임상 시험은 가능할까?
• Prospective, single blind randomized trial
• Primary endpoint: overall survival (OS)
• Secondary endpoint: progression-free survival (PFS)

vs.
종양내과 전문의(들) Watson for Oncology
•WFO 만으로 환자를 진료하는 것이 비윤리적이다. (신약의 경우 전임상에서 검증)

•의사들의 실력이 heterogeneous 하므로 임상시험의 결과가 일반화되기 어렵다.

•왓슨은 계속 진화하므로, 과거의 임상시험 결과가 현재에 적용되기 어렵다.
암환자 10,000명 암환자 10,000명

vs.
NCCN 가이드라인
NCCN 가이드라인

+ Watson for Oncology
이러한 임상 연구는 가능할 것이나, 결과 예측은 어렵다.
•WFO 만으로 환자를 진료하는 것이 비윤리적이다. (신약의 경우 전임상에서 검증)

•의사들의 실력이 heterogeneous 하므로 임상시험의 결과가 일반화되기 어렵다.

•왓슨은 계속 진화하므로, 과거의 임상시험 결과가 현재에 적용되기 어렵다.

•IBM에서 이 임상연구를 진행하기를 과연 원할 것인가?
암환자 10,000명 암환자 10,000명

Meeting with Dr. Kyu Rhee,
Chief Health Ofﬁcer of Watson Health
(2017. 7. 4.)

Four factors to validate
clinical beneﬁts of WFO
•Patients Outcome

•human doctor vs human + WFO

•mortality and morbidity

•Cost (의료비용 절감)

•낮은 재발율 및 재입원율을 통한 비용 절감 효과가 있는가

•Doctor’s Satisfaction (의료진의 만족도)

•WFO을 이용함으로써 의사들의 진료 프로세스가 개선되는가?

•WFO를 사용하는 의료진의 사용자 경험은 어떠한가?

•Patients Satisfaction (환자의 만족도)

•환자들이 WFO을 사용하기를 원하는가?

원칙이 필요하다
•어떤 환자의 경우, 왓슨에게 의견을 물을 것인가?

•왓슨을 (암종별로) 얼마나 신뢰할 것인가?

•왓슨의 의견을 환자에게 공개할 것인가?

•왓슨과 의료진의 판단이 다른 경우 어떻게 할 것인가?

•왓슨에게 보험 급여를 매길 수 있는가?
이러한 기준에 따라 의료의 질/치료효과가 달라질 수 있으나,

현재 개별 병원이 개별적인 기준으로 활용하게 됨

Watson for Oncology 는 수가를 받을 수 있을까?

Empowering the Oncology Community for Cancer Care
Genomics
Oncology
Clinical
Trial
Matching
Watson Health’s oncology clients span more than 35 hospital systems
“Empowering the Oncology Community
for Cancer Care”
Andrew Norden, KOTRA Conference, March 2017, “The Future of Health is Cognitive”

IBM Watson Health
Watson for Clinical Trial Matching (CTM)
18
1. According to the National Comprehensive Cancer Network (NCCN)
2. http://csdd.tufts.edu/files/uploads/02_-_jan_15,_2013_-_recruitment-retention.pdf© 2015 International Business Machines Corporation
Searching across
eligibility criteria of clinical
trials is time consuming
and labor intensive
Current
Challenges
Fewer than 5% of
adult cancer patients
participate in clinical
trials1
37% of sites fail to meet
minimum enrollment
targets. 11% of sites fail
to enroll a single patient 2
The Watson solution
• Uses structured and unstructured
patient data to quickly check
eligibility across relevant clinical
trials
• Provides eligible trial
considerations ranked by
relevance
• Increases speed to qualify
patients
Clinical Investigators
(Opportunity)
• Trials to Patient: Perform
feasibility analysis for a trial
• Identify sites with most
potential for patient enrollment
• Optimize inclusion/exclusion
criteria in protocols
Faster, more efficient
recruitment strategies,
better designed protocols
Point of Care
(Offering)
• Patient to Trials:
Quickly find the
right trial that a
patient might be
eligible for
amongst 100s of
open trials
available
Improve patient care
quality, consistency,
increased efficiencyIBM Confidential

•총 16주간 HOG( Highlands Oncology Group)의 폐암과 유방암 환자 2,620명을 대상

•90명의 환자를 3개의 노바티스 유방암 임상 프로토콜에 따라 선별

•임상 시험 코디네이터: 1시간 50분

•Watson CTM: 24분 (78% 시간 단축)

•Watson CTM은 임상 시험 기준에 해당되지 않는 환자 94%를 자동으로 스크리닝

Watson Genomics Overview
20
Watson Genomics Content
• 20+ Content Sources Including:
• Medical Articles (23Million)
• Drug Information
• Clinical Trial Information
• Genomic Information
Case Sequenced
VCF / MAF, Log2, Dge
Encryption
Molecular Profile
Analysis
Pathway Analysis
Drug Analysis
Service Analysis, Reports, & Visualizations

2017 HIMSS, courtesy of Hyejin Kam (Asan Medical Center)

2015.10.4.Transforming Medicine, San Diego
의료 데이터
의료 기기

IBM Watson Health
Organizations Leveraging Watson
Watson for Oncology
Best Doctors (second opinion)
Bumrungrad International Hospital
Confidential client (Bangladesh and Nepal)
Gachon University Gil Medical Center (Korea)
Hangzhou Cognitive Care – 50+ Chinese hospitals
Jupiter Medical Center
Manipal Hospitals – 16 Indian Hospitals
MD Anderson (**Oncology Expert Advisor)
Memorial Sloan Kettering Cancer Center
MRDM - Zorg (Netherlands)
Pusan National University Hospital
Clinical Trial Matching
Best Doctors (second opinion)
Confidential – Major Academic Center
Highlands Oncology Group
Froedtert & Medical College of Wisconsin
Mayo Clinic
Multiple Life Sciences pilots
24
Watson Genomic Analytics
Ann & Robert H Lurie Children’s Hospital of Chicago
BC Cancer Agency
City of Hope
Cleveland Clinic
Columbia University, Irwing Cancer Center
Duke Cancer Institute
Fred & Pamela Buffett Cancer Center
Fleury (Brazil)
Illumina 170 Gene Panel
NIH Japan
McDonnell Institute at Washington University in St. Louis
New York Genome Center
Pusan National University Hospital
Quest Diagnostics
Stanford Health
University of Kansas Cancer Center
University of North Carolina Lineberger Cancer Center
University of Southern California
University of Washington Medical Center
University of Tokyo
Yale Cancer Center
Andrew Norden, KOTRA Conference, March 2017, “The Future of Health is Cognitive”

한국에서도 Watson을 볼 수 있을까?
2015.7.9. 서울대학병원

길병원 인공지능 암센터 다학제진료실
(Source: Stat News)

계명대학교 동산병원
(Source: Yoon Sup Choi)

대구가톨릭대학병원
(Source: Yoon Sup Choi)

• 인공지능으로 인한 인간 의사의 권위 약화

• 환자의 자기 결정권 및 권익 증대

• 의사의 진료 방식 및 교육 방식의 변화 필요
http://news.donga.com/3/all/20170320/83400087/1

• 의사와 Watson의 판단이 다른 경우?

• NCCN 가이드라인과 다른 판단을 주기는 것으로 보임

• 100 여명 중에 5 case.  
• 환자의 판단이 합리적이라고 볼 수 있는가?

• Watson의 정확도는 검증되지 않았음

• ‘제 4차 산업혁명’ 등의 buzz word의 영향으로 보임

• 임상 시험이 필요하지 않은가?

• 환자들의 선호는 인공지능의 adoption rate 에 영향

• 병원 도입에 영향을 미치는 요인들

• analytical validity

• clinical validity/utility

• 의사들의 인식/심리적 요인

• 환자들의 인식/심리적 요인

• 규제 환경 (인허가, 수가 등등)

• 결국 환자가 원하면 (그것이 의학적으로 타당한지를 떠나서)
병원 도입은 더욱 늘어날 수 밖에 없음

• Watson에 대한 환자 반응이 생각보다 매우 좋음

• 도입 2개월만에 85명 암 환자 진료

• 기존의 길병원 예측보다는 더 빠른 수치일 듯

• Big5 에서도 길병원으로 전원 문의 증가 한다는 후문

• 교수들이 더 열심히 상의하고 환자 본다고 함

• 부산대학병원 (2017년 1월)

• Watson의 솔루션 두 가지를 도입

• Watson for Oncology

• Watson for Genomics

• 건양대학병원 Watson for Oncology 도입

• 2017년 3월

• “최원준 건양대병원장은 "지역 환자들은 수도권의 여러
병원을 찾아다닐 필요가 없어질 것"이라며 "병원의 우수
한 협진 팀과 인공지능 의료 시스템의 시너지를 바탕으
로 암 환자에게 최상의 의료 서비스를 제공하겠다"고 약
속했다."

•“향후 10년 동안 첫번째 cardiovascular event 가 올 것인가” 예측

•전향적 코호트 스터디: 영국 환자 378,256 명

•일상적 의료 데이터를 바탕으로 기계학습으로 질병을 예측하는 첫번째 대규모 스터디

•기존의 ACC/AHA 가이드라인과 4가지 기계학습 알고리즘의 정확도를 비교

•Random forest; Logistic regression; Gradient bossting; Neural network

Can machine-learning improve cardiovascular
risk prediction using routine clinical data?
Stephen F.Weng et al PLoS One 2017
in a sensitivity of 62.7% and PPV of 17.1%. The random forest algorithm resulted in a net
increase of 191 CVD cases from the baseline model, increasing the sensitivity to 65.3% and
PPV to 17.8% while logistic regression resulted in a net increase of 324 CVD cases (sensitivity
67.1%; PPV 18.3%). Gradient boosting machines and neural networks performed best, result-
ing in a net increase of 354 (sensitivity 67.5%; PPV 18.4%) and 355 CVD (sensitivity 67.5%;
PPV 18.4%) cases correctly predicted, respectively.
The ACC/AHA baseline model correctly predicted 53,106 non-cases from 75,585 total non-
cases, resulting in a specificity of 70.3% and NPV of 95.1%. The net increase in non-cases
Table 3. Top 10 risk factor variables for CVD algorithms listed in descending order of coefficient effect size (ACC/AHA; logistic regression),
weighting (neural networks), or selection frequency (random forest, gradient boosting machines). Algorithms were derived from training cohort of
295,267 patients.
ACC/AHA Algorithm Machine-learning Algorithms
Men Women ML: Logistic
Regression
ML: Random Forest ML: Gradient Boosting
Machines
ML: Neural Networks
Age Age Ethnicity Age Age Atrial Fibrillation
Total Cholesterol HDL Cholesterol Age Gender Gender Ethnicity
HDL Cholesterol Total Cholesterol SES: Townsend
Deprivation Index
Ethnicity Ethnicity Oral Corticosteroid
Prescribed
Smoking Smoking Gender Smoking Smoking Age
Age x Total Cholesterol Age x HDL Cholesterol Smoking HDL cholesterol HDL cholesterol Severe Mental Illness
Treated Systolic Blood
Pressure
Age x Total Cholesterol Atrial Fibrillation HbA1c Triglycerides SES: Townsend
Deprivation Index
Age x Smoking Treated Systolic Blood
Pressure
Chronic Kidney Disease Triglycerides Total Cholesterol Chronic Kidney Disease
Age x HDL Cholesterol Untreated Systolic
Blood Pressure
Rheumatoid Arthritis SES: Townsend
Deprivation Index
HbA1c BMI missing
Untreated Systolic
Blood Pressure
Age x Smoking Family history of
premature CHD
BMI Systolic Blood Pressure Smoking
Diabetes Diabetes COPD Total Cholesterol SES: Townsend
Deprivation Index
Gender
Italics: Protective Factors
https://doi.org/10.1371/journal.pone.0174944.t003
PLOS ONE | https://doi.org/10.1371/journal.pone.0174944 April 4, 2017 8 / 14
•기존 ACC/AHA 가이드라인의 위험 요소의 일부분만 기계학습 알고리즘에도 포함

•하지만, Diabetes는 네 모델 모두에서 포함되지 않았다.

•기존의 위험 예측 툴에는 포함되지 않던, 아래와 같은 새로운 요소들이 포함되었다.

•COPD, severe mental illness, prescribing of oral corticosteroids

•triglyceride level 등의 바이오 마커

Can machine-learning improve cardiovascular
risk prediction using routine clinical data?
Stephen F.Weng et al PLoS One 2017
correctly predicted compared to the baseline ACC/AHA model ranged from 191 non-cases for
the random forest algorithm to 355 non-cases for the neural networks. Full details on classifi-
cation analysis can be found in S2 Table.
Discussion
Compared to an established AHA/ACC risk prediction algorithm, we found all machine-
learning algorithms tested were better at identifying individuals who will develop CVD and
those that will not. Unlike established approaches to risk prediction, the machine-learning
methods used were not limited to a small set of risk factors, and incorporated more pre-exist-
Table 4. Performance of the machine-learning (ML) algorithms predicting 10-year cardiovascular disease (CVD) risk derived from applying train-
ing algorithms on the validation cohort of 82,989 patients. Higher c-statistics results in better algorithm discrimination. The baseline (BL) ACC/AHA
10-year risk prediction algorithm is provided for comparative purposes.
Algorithms AUC c-statistic Standard Error* 95% Conﬁdence
Interval
Absolute Change from Baseline
LCL UCL
BL: ACC/AHA 0.728 0.002 0.723 0.735 —
ML: Random Forest 0.745 0.003 0.739 0.750 +1.7%
ML: Logistic Regression 0.760 0.003 0.755 0.766 +3.2%
ML: Gradient Boosting Machines 0.761 0.002 0.755 0.766 +3.3%
ML: Neural Networks 0.764 0.002 0.759 0.769 +3.6%
*Standard error estimated by jack-knife procedure [30]
https://doi.org/10.1371/journal.pone.0174944.t004
Can machine-learning improve cardiovascular risk prediction using routine clinical data?
•네 가지 기계학습 모델 모두 기존의 ACC/AHA 가이드라인 대비 더 정확했다.

•Neural Networks 이 AUC=0.764 로 가장 정확했다.

•“이 모델을 활용했더라면 355 명의 추가적인 cardiovascular event 를 예방했을 것”

•Deep Learning 을 활용하면 정확도는 더 높아질 수 있을 것

•Genetic information 등의 추가적인 risk factor 를 활용해볼 수 있다.

3
output
max-pooling
convolution --
motif detection
embedding
sequencing
medical record
visits/admissions
time gaps/transferphrase/admission
prediction
1
2
3
4
5
time gap
record
vector
word
vector
?
prediction point
Figure 1. Overview of Deepr for predicting future risk from medical record. Top-left box depicts an example of medical record with multiple visits, each of
which has multiple coded objects (diagnosis & procedure). The future risk is unknown (question mark (?)). Steps from-left-to-right: (1) Medical record is
sequenced into phrases separated by coded time-gaps/transfers; then from-bottom-to-top: (2) Words are embedded into continuous vectors, (3) local word
vectors are convoluted to detect local motifs, (4) max-pooling to derive record-level vector, (5) classiﬁer is applied to predict an output, which is a future event.
Best viewed in color.
B. Sequencing EMR
This task refers to transforming an EMR into a sentence,
which is essentially a sequence of words. We present here how
the words are deﬁned and arranged in the sentence.
procedures are in digits:
1910 Z83 911 1008 D12 K31 1-3m R94 RAREWORD H53
Y83 M62 Y92 E87 T81 RAREWORD RAREWORD 1893 D12
Deepr:A Convolutional Net for Medical Records
•“퇴원한 환자가 6개월 이내에 다시 입원할 것인가?” 예측

•EMR 데이터를 바탕으로 예측

•호주의 환자 30만명을 대상으로 검증

Deep Learning
http://theanalyticsstore.ie/deep-learning/

12 Olga Russakovsky* et al.
Fig. 4 Random selection of images in ILSVRC detection validation set. The images in the top 4 rows were taken from
ILSVRC2012 single-object localization validation set, and the images in the bottom 4 rows were collected from Flickr using
scene-level queries.
tage of all the positive examples available. The second is images collected from Flickr speciﬁcally for the de- http://arxiv.org/pdf/1409.0575.pdf

• Main competition

• 객체 분류 (Classification): 그림 속의 객체를 분류

• 객체 위치 (localization): 그림 속 ‘하나’의 객체를 분류하고 위치를 파악

• 객체 인식 (object detection): 그림 속 ‘모든’ 객체를 분류하고 위치 파악
16 Olga Russakovsky* et al.
Fig. 7 Tasks in ILSVRC. The ﬁrst column shows the ground truth labeling on an example image, and the next three show
three sample outputs with the corresponding evaluation score.
http://arxiv.org/pdf/1409.0575.pdf

Performance of winning entries in the ILSVRC2010-2015 competitions
in each of the three tasks
http://image-net.org/challenges/LSVRC/2015/results#loc
Single-object localization
Localizationerror
0
10
20
30
40
50
2011 2012 2013 2014 2015
Object detection
Averageprecision
0.0
17.5
35.0
52.5
70.0
2013 2014 2015
Image classiﬁcation
Classiﬁcationerror
0
10
20
30
2010 2011 2012 2013 2014 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition”, 2015
How deep is deep?

http://image-net.org/challenges/LSVRC/2015/results
Localization

Classiﬁcation
http://image-net.org/challenges/LSVRC/2015/results

http://venturebeat.com/2015/12/25/5-deep-learning-startups-to-follow-in-2016/

DeepFace: Closing the Gap to Human-Level
Performance in FaceVerification
Taigman,Y. et al. (2014). DeepFace: Closing the Gap to Human-Level Performance in FaceVerification, CVPR’14.
Figure 2. Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three
locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million
parameters, where more than 95% come from the local and fully connected layers.
very few parameters. These layers merely expand the input
into a set of simple local features.
The subsequent layers (L4, L5 and L6) are instead lo-
cally connected [13, 16], like a convolutional layer they ap-
ply a filter bank, but every location in the feature map learns
a different set of filters. Since different regions of an aligned
image have different local statistics, the spatial stationarity
The goal of training is to maximize the probability of
the correct class (face id). We achieve this by minimiz-
ing the cross-entropy loss for each training sample. If k
is the index of the true label for a given input, the loss is:
L = log pk. The loss is minimized over the parameters
by computing the gradient of L w.r.t. the parameters and
Human: 95% vs. DeepFace in Facebook: 97.35%
Recognition Accuracy for Labeled Faces in the Wild (LFW) dataset (13,233 images, 5,749 people)

FaceNet:A Unified Embedding for Face
Recognition and Clustering
Schroff, F. et al. (2015). FaceNet:A Unified Embedding for Face Recognition and Clustering
Human: 95% vs. FaceNet of Google: 99.63%
Recognition Accuracy for Labeled Faces in the Wild (LFW) dataset (13,233 images, 5,749 people)
False accept
False reject
s. This shows all pairs of images that were
on LFW. Only eight of the 13 errors shown
he other four are mislabeled in LFW.
on Youtube Faces DB
ge similarity of all pairs of the first one
our face detector detects in each video.
False accept
False reject
Figure 6. LFW errors. This shows all pairs of images that were
incorrectly classified on LFW. Only eight of the 13 errors shown
here are actual errors the other four are mislabeled in LFW.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the first one
hundred frames that our face detector detects in each video.
This gives us a classification accuracy of 95.12%±0.39.
Using the first one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verification task,
lead to truly amazing results. Figure 7 shows one cluster in
a users personal photo collection, generated using agglom-
erative clustering. It is a clear showcase of the incredible
invariance to occlusion, lighting, pose and even age.
Figure 7. Face Clustering. Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verification. This sets it apart
from other methods [15, 17] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
nation of multiple models and PCA, as well as SVM clas-
sification. Our end-to-end training both simplifies the setup
and shows that directly optimizing a loss relevant to the task
at hand improves performance.
Another strength of our model is that it only requires
False accept
False reject
Figure 6. LFW errors. This shows all pairs of images that were
incorrectly classified on LFW. Only eight of the 13 errors shown
here are actual errors the other four are mislabeled in LFW.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the first one
hundred frames that our face detector detects in each video.
This gives us a classification accuracy of 95.12%±0.39.
Using the first one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verification task,
Figure 7. Face Clustering. Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verification. This sets it apart
from other methods [15, 17] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
nation of multiple models and PCA, as well as SVM clas-

Show and Tell:
A Neural Image Caption Generator
Vinyals, O. et al. (2015). Show and Tell:A Neural Image Caption Generator, arXiv:1411.4555
v
om
Samy Bengio
Google
bengio@google.com
Dumitru Erhan
Google
dumitru@google.com
s a
cts
his
re-
m-
ed
he
de-
nts
A group of people
shopping at an
outdoor market.
!
There are many
vegetables at the
fruit stand.
Vision!
Deep CNN
Language !
Generating!
RNN
Figure 1. NIC, our model, is based end-to-end on a neural net-
work consisting of a vision CNN followed by a language gener-

Show and Tell:
A Neural Image Caption Generator
Vinyals, O. et al. (2015). Show and Tell:A Neural Image Caption Generator, arXiv:1411.4555
Figure 5. A selection of evaluation results, grouped by human rating.

Medical Imaging AI Startups by Applications

Bone Age Assessment
• M: 28 Classes
• F: 20 Classes
• Method: G.P.
• Top3-95.28% (F)
• Top3-81.55% (M)

Business Area
Medical Image Analysis
VUNOnet and our machine learning technology will help doctors and hospitals manage
medical scans and images intelligently to make diagnosis faster and more accurately.
Original Image Automatic Segmentation EmphysemaNormal ReticularOpacity
Our system finds DILDs at the highest accuracy * DILDs: Diffuse Interstitial Lung Disease
Digital Radiologist
Collaboration with Prof. Joon Beom Seo (Asan Medical Center)
Analysed 1200 patients for 3 months

Digital Radiologist

Digital Radiologist
Med Phys. 2013 May;40(5):051912. doi: 10.1118/1.4802214.

Digital Radiologist
Med Phys. 2013 May;40(5):051912. doi: 10.1118/1.4802214.
Feature Engineering vs Feature Learning
alization of Hand-crafted Feature vs Learned Feature in 2D
Feature Engineering vs Feature Learning
• Visualization of Hand-crafted Feature vs Learned Feature in 2D
Visualization of Hand-crafted Feature vs Learned Feature in 2D

Bench to Bedside : Practical Applications
• Contents-based Case Retrieval
–Finding similar cases with the clinically matching context - Search engine for medical images.
–Clinicians can refer the diagnosis, prognosis of past similar patients to make better clinical decision.
–Accepted to present at RSNA 2017
Digital Radiologist

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientiﬁc-community
•112,120 장의 질병군이 포함된 chest frontal 영상

•그 중에 984장에 대한 bounding box 정보 포함

•14개의 질환 카테고리

•ROI, CT confirmed result, Pathological result 많지 않음

•Zebra Medical Vision에서 $1 에 영상의학데이터를 판독해주는 서비스를 런칭 (2017년 10월)

•항목은 확정되지는 않았으나, Pulmonary Hypertension, Lung Nodule, Fatty Liver, Emphysema,  
 
 
Coronary Calcium Scoring, Bone Mineral Density, Aortic Aneurysm 등으로 예상
https://www.zebra-med.com/aione/

Zebra Medical Vision’s AI1: AI at Your Fingertips
https://www.youtube.com/watch?v=0PGgCpXa-Fs

Detection of Diabetic Retinopathy

당뇨성 망막병증
• 당뇨병의 대표적 합병증: 당뇨병력이 30년 이상 환자 90% 발병

• 안과 전문의들이 안저(안구의 안쪽)를 사진으로 찍어서 판독

• 망막 내 미세혈관 생성, 출혈, 삼출물 정도를 파악하여 진단

Copyright 2016 American Medical Association. All rights reserved.
Development and Validation of a Deep Learning Algorithm
for Detection of Diabetic Retinopathy
in Retinal Fundus Photographs
Varun Gulshan, PhD; Lily Peng, MD, PhD; Marc Coram, PhD; Martin C. Stumpe, PhD; Derek Wu, BS; Arunachalam Narayanaswamy, PhD;
Subhashini Venugopalan, MS; Kasumi Widner, MS; Tom Madams, MEng; Jorge Cuadros, OD, PhD; Ramasamy Kim, OD, DNB;
Rajiv Raman, MS, DNB; Philip C. Nelson, BS; Jessica L. Mega, MD, MPH; Dale R. Webster, PhD
IMPORTANCE Deep learning is a family of computational methods that allow an algorithm to
program itself by learning from a large set of examples that demonstrate the desired
behavior, removing the need to specify rules explicitly. Application of these methods to
medical imaging requires further assessment and validation.
OBJECTIVE To apply deep learning to create an algorithm for automated detection of diabetic
retinopathy and diabetic macular edema in retinal fundus photographs.
DESIGN AND SETTING A specific type of neural network optimized for image classification
called a deep convolutional neural network was trained using a retrospective development
data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy,
diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists
and ophthalmology senior residents between May and December 2015. The resultant
algorithm was validated in January and February 2016 using 2 separate data sets, both
graded by at least 7 US board-certified ophthalmologists with high intragrader consistency.
EXPOSURE Deep learning–trained algorithm.
MAIN OUTCOMES AND MEASURES The sensitivity and specificity of the algorithm for detecting
referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy,
referable diabetic macular edema, or both, were generated based on the reference standard
of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2
operating points selected from the development set, one selected for high specificity and
another for high sensitivity.
RESULTS TheEyePACS-1datasetconsistedof9963imagesfrom4997patients(meanage,54.4
years;62.2%women;prevalenceofRDR,683/8878fullygradableimages[7.8%]);the
Messidor-2datasethad1748imagesfrom874patients(meanage,57.6years;42.6%women;
prevalenceofRDR,254/1745fullygradableimages[14.6%]).FordetectingRDR,thealgorithm
hadanareaunderthereceiveroperatingcurveof0.991(95%CI,0.988-0.993)forEyePACS-1and
0.990(95%CI,0.986-0.995)forMessidor-2.Usingthefirstoperatingcutpointwithhigh
specificity,forEyePACS-1,thesensitivitywas90.3%(95%CI,87.5%-92.7%)andthespecificity
was98.1%(95%CI,97.8%-98.5%).ForMessidor-2,thesensitivitywas87.0%(95%CI,81.1%-
91.0%)andthespecificitywas98.5%(95%CI,97.7%-99.1%).Usingasecondoperatingpoint
withhighsensitivityinthedevelopmentset,forEyePACS-1thesensitivitywas97.5%and
specificitywas93.4%andforMessidor-2thesensitivitywas96.1%andspecificitywas93.9%.
CONCLUSIONS AND RELEVANCE In this evaluation of retinal fundus photographs from adults
with diabetes, an algorithm based on deep machine learning had high sensitivity and
specificity for detecting referable diabetic retinopathy. Further research is necessary to
determine the feasibility of applying this algorithm in the clinical setting and to determine
whether use of the algorithm could lead to improved care and outcomes compared with
current ophthalmologic assessment.
JAMA. doi:10.1001/jama.2016.17216
Published online November 29, 2016.
Editorial
Supplemental content
Author Affiliations: Google Inc,
Mountain View, California (Gulshan,
Peng, Coram, Stumpe, Wu,
Narayanaswamy, Venugopalan,
Widner, Madams, Nelson, Webster);
Department of Computer Science,
University of Texas, Austin
(Venugopalan); EyePACS LLC,
San Jose, California (Cuadros); School
of Optometry, Vision Science
Graduate Group, University of
California, Berkeley (Cuadros);
Aravind Medical Research
Foundation, Aravind Eye Care
System, Madurai, India (Kim); Shri
Bhagwan Mahavir Vitreoretinal
Services, Sankara Nethralaya,
Chennai, Tamil Nadu, India (Raman);
Verily Life Sciences, Mountain View,
California (Mega); Cardiovascular
Division, Department of Medicine,
Brigham and Women’s Hospital and
Harvard Medical School, Boston,
Massachusetts (Mega).
Corresponding Author: Lily Peng,
MD, PhD, Google Research, 1600
Amphitheatre Way, Mountain View,
CA 94043 (lhpeng@google.com).
Research
JAMA | Original Investigation | INNOVATIONS IN HEALTH CARE DELIVERY
(Reprinted) E1
Copyright 2016 American Medical Association. All rights reserved.

Case Study: TensorFlow in Medicine - Retinal Imaging (TensorFlow Dev Summit 2017)

Inception-v3 (aka GoogleNet)
https://research.googleblog.com/2016/03/train-your-own-image-classiﬁer-with.html
https://arxiv.org/abs/1512.00567

Training Set / Test Set
• CNN으로 후향적으로 128,175개의 안저 이미지 학습

• 미국의 안과전문의 54명이 3-7회 판독한 데이터

• 우수한 안과전문의들 7-8명의 판독 결과와 인공지능의 판독 결과 비교

• EyePACS-1 (9,963 개), Messidor-2 (1,748 개)a) Fullscreen mode
b) Hit reset to reload this image. This will reset all of the grading.
c) Comment box for other pathologies you see
eFigure 2. Screenshot of the Second Screen of the Grading Tool, Which Asks Graders to Assess the
Image for DR, DME and Other Notable Conditions or Findings

• EyePACS-1 과 Messidor-2 의 AUC = 0.991, 0.990

• 7-8명의 안과 전문의와 sensitivity, specificity 가 동일한 수준

• F-score: 0.95 (vs. 인간 의사는 0.91)
Additional sensitivity analyses were conducted for sev-
eralsubcategories:(1)detectingmoderateorworsediabeticreti-
effects of data set size on algorithm performance were exam-
ined and shown to plateau at around 60 000 images (or ap-
Figure 2. Validation Set Performance for Referable Diabetic Retinopathy
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
EyePACS-1: AUC, 99.1%; 95% CI, 98.8%-99.3%A
100
High-sensitivity operating point
High-specificity operating point
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
Messidor-2: AUC, 99.0%; 95% CI, 98.6%-99.5%B
100
High-specificity operating point
High-sensitivity operating point
Performance of the algorithm (black curve) and ophthalmologists (colored
circles) for the presence of referable diabetic retinopathy (moderate or worse
diabetic retinopathy or referable diabetic macular edema) on A, EyePACS-1
(8788 fully gradable images) and B, Messidor-2 (1745 fully gradable images).
The black diamonds on the graph correspond to the sensitivity and specificity of
the algorithm at the high-sensitivity and high-specificity operating points.
In A, for the high-sensitivity operating point, specificity was 93.4% (95% CI,
92.8%-94.0%) and sensitivity was 97.5% (95% CI, 95.8%-98.7%); for the
high-specificity operating point, specificity was 98.1% (95% CI, 97.8%-98.5%)
and sensitivity was 90.3% (95% CI, 87.5%-92.7%). In B, for the high-sensitivity
operating point, specificity was 93.9% (95% CI, 92.4%-95.3%) and sensitivity
was 96.1% (95% CI, 92.4%-98.3%); for the high-specificity operating point,
specificity was 98.5% (95% CI, 97.7%-99.1%) and sensitivity was 87.0% (95%
CI, 81.1%-91.0%). There were 8 ophthalmologists who graded EyePACS-1 and 7
ophthalmologists who graded Messidor-2. AUC indicates area under the
receiver operating characteristic curve.
Research Original Investigation Accuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy
Results

0 0 M O N T H 2 0 1 7 | V O L 0 0 0 | N A T U R E | 1
LETTER doi:10.1038/nature21056
Dermatologist-level classification of skin cancer
with deep neural networks
Andre Esteva1
*, Brett Kuprel1
*, Roberto A. Novoa2,3
, Justin Ko2
, Susan M. Swetter2,4
, Helen M. Blau5
& Sebastian Thrun6
Skin cancer, the most common human malignancy1–3
, is primarily
diagnosed visually, beginning with an initial clinical screening
and followed potentially by dermoscopic analysis, a biopsy and
histopathological examination. Automated classification of skin
lesions using images is a challenging task owing to the fine-grained
variability in the appearance of skin lesions. Deep convolutional
neural networks (CNNs)4,5
show potential for general and highly
variable tasks across many fine-grained object categories6–11
.
Here we demonstrate classification of skin lesions using a single
CNN, trained end-to-end from images directly, using only pixels
and disease labels as inputs. We train a CNN using a dataset of
129,450 clinical images—two orders of magnitude larger than
previous datasets12
—consisting of 2,032 different diseases. We
test its performance against 21 board-certified dermatologists on
biopsy-proven clinical images with two critical binary classification
use cases: keratinocyte carcinomas versus benign seborrheic
keratoses; and malignant melanomas versus benign nevi. The first
case represents the identification of the most common cancers, the
second represents the identification of the deadliest skin cancer.
The CNN achieves performance on par with all tested experts
across both tasks, demonstrating an artificial intelligence capable
of classifying skin cancer with a level of competence comparable to
dermatologists. Outfitted with deep neural networks, mobile devices
can potentially extend the reach of dermatologists outside of the
clinic. It is projected that 6.3 billion smartphone subscriptions will
exist by the year 2021 (ref. 13) and can therefore potentially provide
low-cost universal access to vital diagnostic care.
There are 5.4 million new cases of skin cancer in the United States2
every year. One in five Americans will be diagnosed with a cutaneous
malignancy in their lifetime. Although melanomas represent fewer than
5% of all skin cancers in the United States, they account for approxi-
mately 75% of all skin-cancer-related deaths, and are responsible for
over 10,000 deaths annually in the United States alone. Early detection
is critical, as the estimated 5-year survival rate for melanoma drops
from over 99% if detected in its earliest stages to about 14% if detected
in its latest stages. We developed a computational method which may
allow medical practitioners and patients to proactively track skin
lesions and detect cancer earlier. By creating a novel disease taxonomy,
and a disease-partitioning algorithm that maps individual diseases into
training classes, we are able to build a deep learning system for auto-
mated dermatology.
Previous work in dermatological computer-aided classification12,14,15
has lacked the generalization capability of medical practitioners
owing to insufficient data and a focus on standardized tasks such as
dermoscopy16–18
and histological image classification19–22
. Dermoscopy
images are acquired via a specialized instrument and histological
images are acquired via invasive biopsy and microscopy; whereby
both modalities yield highly standardized images. Photographic
images (for example, smartphone images) exhibit variability in factors
such as zoom, angle and lighting, making classification substantially
more challenging23,24
. We overcome this challenge by using a data-
driven approach—1.41 million pre-training and training images
make classification robust to photographic variability. Many previous
techniques require extensive preprocessing, lesion segmentation and
extraction of domain-specific visual features before classification. By
contrast, our system requires no hand-crafted features; it is trained
end-to-end directly from image labels and raw pixels, with a single
network for both photographic and dermoscopic images. The existing
body of work uses small datasets of typically less than a thousand
images of skin lesions16,18,19
, which, as a result, do not generalize well
to new images. We demonstrate generalizable classification with a new
dermatologist-labelled dataset of 129,450 clinical images, including
3,374 dermoscopy images.
Deep learning algorithms, powered by advances in computation
and very large datasets25
, have recently been shown to exceed human
performance in visual tasks such as playing Atari games26
, strategic
board games like Go27
and object recognition6
. In this paper we
outline the development of a CNN that matches the performance of
dermatologists at three key diagnostic tasks: melanoma classification,
melanoma classification using dermoscopy and carcinoma
classification. We restrict the comparisons to image-based classification.
We utilize a GoogleNet Inception v3 CNN architecture9
that was pre-
trained on approximately 1.28 million images (1,000 object categories)
from the 2014 ImageNet Large Scale Visual Recognition Challenge6
,
and train it on our dataset using transfer learning28
. Figure 1 shows the
working system. The CNN is trained using 757 disease classes. Our
dataset is composed of dermatologist-labelled images organized in a
tree-structured taxonomy of 2,032 diseases, in which the individual
diseases form the leaf nodes. The images come from 18 different
clinician-curated, open-access online repositories, as well as from
clinical data from Stanford University Medical Center. Figure 2a shows
a subset of the full taxonomy, which has been organized clinically and
visually by medical experts. We split our dataset into 127,463 training
and validation images and 1,942 biopsy-labelled test images.
To take advantage of fine-grained information contained within the
taxonomy structure, we develop an algorithm (Extended Data Table 1)
to partition diseases into fine-grained training classes (for example,
amelanotic melanoma and acrolentiginous melanoma). During
inference, the CNN outputs a probability distribution over these fine
classes. To recover the probabilities for coarser-level classes of interest
(for example, melanoma) we sum the probabilities of their descendants
(see Methods and Extended Data Fig. 1 for more details).
We validate the effectiveness of the algorithm in two ways, using
nine-fold cross-validation. First, we validate the algorithm using a
three-class disease partition—the first-level nodes of the taxonomy,
which represent benign lesions, malignant lesions and non-neoplastic
1
Department of Electrical Engineering, Stanford University, Stanford, California, USA. 2
Department of Dermatology, Stanford University, Stanford, California, USA. 3
Department of Pathology,
Stanford University, Stanford, California, USA. 4
Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA. 5
Baxter Laboratory for Stem Cell Biology, Department
of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA. 6
Department of Computer Science, Stanford University,
Stanford, California, USA.
*These authors contributed equally to this work.
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

LETTERH
his task, the CNN achieves 72.1±0.9% (mean±s.d.) overall
he average of individual inference class accuracies) and two
gists attain 65.56% and 66.0% accuracy on a subset of the
set. Second, we validate the algorithm using a nine-class
rtition—the second-level nodes—so that the diseases of
have similar medical treatment plans. The CNN achieves
two trials, one using standard images and the other using
images, which reflect the two steps that a dermatologist m
to obtain a clinical impression. The same CNN is used for a
Figure 2b shows a few example images, demonstrating th
distinguishing between malignant and benign lesions, whic
visual features. Our comparison metrics are sensitivity an
Acral-lentiginous melanoma
Amelanotic melanoma
Lentigo melanoma
…
Blue nevus
Halo nevus
Mongolian spot
…
Training classes (757)Deep convolutional neural network (Inception v3) Inference classes (varies by task)
92% malignant melanocytic lesion
8% benign melanocytic lesion
Skin lesion image
Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax
Deep CNN layout. Our classification technique is a
Data flow is from left to right: an image of a skin lesion
e, melanoma) is sequentially warped into a probability
over clinical classes of skin disease using Google Inception
hitecture pretrained on the ImageNet dataset (1.28 million
1,000 generic object classes) and fine-tuned on our own
29,450 skin lesions comprising 2,032 different diseases.
ning classes are defined using a novel taxonomy of skin disease
oning algorithm that maps diseases into training classes
(for example, acrolentiginous melanoma, amelanotic melano
melanoma). Inference classes are more general and are comp
or more training classes (for example, malignant melanocytic
class of melanomas). The probability of an inference class is c
summing the probabilities of the training classes according to
structure (see Methods). Inception v3 CNN architecture repr
from https://research.googleblog.com/2016/03/train-your-ow
classifier-with.html
GoogleNet Inception v3
• 129,450개의 피부과 병변 이미지 데이터를 자체 제작

• 미국의 피부과 전문의 18명이 데이터 curation

• CNN (Inception v3)으로 이미지를 학습

• 피부과 전문의들 21명과 인공지능의 판독 결과 비교

• 표피세포 암 (keratinocyte carcinoma)과 지루각화증(benign seborrheic keratosis)의 구분

• 악성 흑색종과 양성 병변 구분 (표준 이미지 데이터 기반)

• 악성 흑색종과 양성 병변 구분 (더마토스코프로 찍은 이미지 기반)

Skin cancer classiﬁcation performance of
the CNN and dermatologists. LETT
a
b
0 1
Sensitivity
0
1
Specificity
Melanoma: 130 images
0 1
Sensitivity
0
1
Specificity
Algorithm: AUC = 0.96
0 1
Sensitivity
0
1
Specificity
Melanoma: 111 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Carcinoma: 707 images
0 1
Sensitivity
0
1
Specificity
Melanoma: 1,010 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Dermatologists (25)
Average dermatologist
Dermatologists (22)
Dermatologists (21)
cancer classification performance of the CNN and
21명 중에 인공지능보다 정확성이 떨어지는 피부과 전문의들이 상당수 있었음

피부과 전문의들의 평균 성적도 인공지능보다 좋지 않았음

the CNN and dermatologists. LETT
a
b
0 1
Sensitivity
0
1
Specificity
0 1
Sensitivity
0
1
Specificity
0 1
Sensitivity
0
1
Specificity
Melanoma: 111 dermoscopy images
0 1
Sensitivity
0
1
Specificity
0 1
Sensitivity
0
1
Specificity
Melanoma: 1,010 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Dermatologists (25)
Dermatologists (22)
Dermatologists (21)
cancer classification performance of the CNN and

Skin Cancer Image Classiﬁcation (TensorFlow Dev Summit 2017)
the CNN and dermatologists.
https://www.youtube.com/watch?v=toK1OSLep3s&t=419s

Diagnostic Concordance Among Pathologists
Interpreting Breast Biopsy Specimens
A B DC
Benign without atypia / Atypic / DCIS (ductal carcinoma in situ) / Invasive Carcinoma
Interpretation?
Elmore etl al. JAMA 2015

Figure 4. Participating Pathologists’ Interpretations of Each of the 240 Breast Biopsy Test Cases
0 25 50 75 100
Interpretations, %
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
Case
Benign without atypia
72 Cases
2070 Total interpretations
A
0 25 50 75 100
Interpretations, %
218
220
222
224
226
228
230
232
234
236
238
240
Case
Invasive carcinoma
23 Cases
D
0 25 50 75 100
Interpretations, %
147
145
149
151
153
155
157
159
161
163
165
167
169
171
173
175
177
179
181
183
185
187
189
191
193
195
197
199
201
203
205
207
209
211
213
215
217
Case
DCIS
73 Cases
C
0 25 50 75 100
Interpretations, %
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
Case
Atypia
72 Cases
B
Benign without atypia
Atypia
DCIS
Invasive carcinoma
Pathologist interpretation
DCIS indicates ductal carcinoma in situ.
Diagnostic Concordance in Interpreting Breast Biopsies Original Investigation Research

• Concordance noted in 5194 of 6900 case interpretations or 75.3%.
• Reference diagnosis was obtained from consensus of 3 experienced breast pathologists.
spentonthisactivitywas16(95%CI,15-17);43participantswere
awarded the maximum 20 hours.
Pathologists’ Diagnoses Compared With Consensus-Derived
Reference Diagnoses
The 115 participants each interpreted 60 cases, providing 6900
total individual interpretations for comparison with the con-
sensus-derived reference diagnoses (Figure 3). Participants
agreed with the consensus-derived reference diagnosis for
75.3% of the interpretations (95% CI, 73.4%-77.0%). Partici-
pants (n = 94) who completed the CME activity reported that
Patient and Pathologist Characteristics Associated With
Overinterpretation and Underinterpretation
The association of breast density with overall pathologists’
concordance (as well as both overinterpretation and under-
interpretation rates) was statistically significant, as shown
in Table 3 when comparing mammographic density grouped
into 2 categories (low density vs high density). The overall
concordance estimates also decreased consistently with
increasing breast density across all 4 Breast Imaging-
Reporting and Data System (BI-RADS) density categories:
BI-RADS A, 81% (95% CI, 75%-86%); BI-RADS B, 77% (95%
Figure 3. Comparison of 115 Participating Pathologists’ Interpretations vs the Consensus-Derived Reference
Diagnosis for 6900 Total Case Interpretationsa
Participating Pathologists’ Interpretation
ConsensusReference
Diagnosisb
Benign
without atypia Atypia DCIS
Invasive
carcinoma Total
Benign without atypia 1803 200 46 21 2070
Atypia 719 990 353 8 2070
DCIS 133 146 1764 54 2097
Invasive carcinoma 3 0 23 637 663
Total 2658 1336 2186 720 6900
DCIS indicates ductal carcinoma
in situ.
a
Concordance noted in 5194 of
6900 case interpretations or
75.3%.
b
Reference diagnosis was obtained
from consensus of 3 experienced
breast pathologists.
Diagnostic Concordance in Interpreting Breast Biopsies Original Investigation Research
Comparison of 115 Participating Pathologists’ Interpretations vs  
the Consensus-Derived Reference Diagnosis for 6900 Total Case Interpretations

Constructing higher-level
contextual/relational features:
Relationships between epithelial
nuclear neighbors
Relationships between morphologically
regular and irregular nuclei
and stromal objects
nuclei and cytoplasm
Characteristics of
stromal nuclei
and stromal matrix
Characteristics of
epithelial nuclei and
epithelial cytoplasm
Building an epithelial/stromal classifier:
Epithelial vs.stroma
classifier
classifier
B
Basic image processing and feature construction:
H&E image Image broken into superpixels Nuclei identified within
each superpixel
A
Relationships of contiguous epithelial
regions with underlying nuclear objects
Learning an image-based model to predict survival
Processed images from patients Processed images from patients
C
D
onNovember17,2011stm.sciencemag.orgwnloadedfrom
TMAs contain 0.6-mm-diameter cores (median
of two cores per case) that represent only a small
sample of the full tumor. We acquired data from
two separate and independent cohorts: Nether-
lands Cancer Institute (NKI; 248 patients) and
Vancouver General Hospital (VGH; 328 patients).
Unlike previous work in cancer morphom-
etry (18–21), our image analysis pipeline was
not limited to a predefined set of morphometric
features selected by pathologists. Rather, C-Path
measures an extensive, quantitative feature set
from the breast cancer epithelium and the stro-
ma (Fig. 1). Our image processing system first
performed an automated, hierarchical scene seg-
mentation that generated thousands of measure-
ments, including both standard morphometric
descriptors of image objects and higher-level
contextual, relational, and global image features.
The pipeline consisted of three stages (Fig. 1, A
to C, and tables S8 and S9). First, we used a set of
processing steps to separate the tissue from the
background, partition the image into small regions
of coherent appearance known as superpixels,
find nuclei within the superpixels, and construct
Constructing higher-level
contextual/relational features:
nuclear neighbors
Relationships between morphologically
regular and irregular nuclei
and stromal objects
nuclei and cytoplasm
Characteristics of
stromal nuclei
and stromal matrix
Characteristics of
epithelial nuclei and
epithelial cytoplasm
classifier
classifier
Relationships of contiguous epithelial
regions with underlying nuclear objects
Learning an image-based model to predict survival
Processed images from patients
alive at 5 years
Processed images from patients
deceased at 5 years
L1-regularized
logisticregression
modelbuilding
5YS predictive model
Unlabeled images
Time
P(survival)
C
D
Identification of novel prognostically
important morphologic features
basic cellular morphologic properties (epithelial reg-
ular nuclei = red; epithelial atypical nuclei = pale blue;
epithelial cytoplasm = purple; stromal matrix = green;
stromal round nuclei = dark green; stromal spindled
nuclei = teal blue; unclassified regions = dark gray;
spindled nuclei in unclassified regions = yellow; round
nuclei in unclassified regions = gray; background =
white). (Left panel) After the classification of each
image object, a rich feature set is constructed. (D)
Learning an image-based model to predict survival.
Processed images from patients alive at 5 years after
surgery and from patients deceased at 5 years after
surgery were used to construct an image-based prog-
nostic model. After construction of the model, it was
applied to a test set of breast cancer images (not
used in model building) to classify patients as high
or low risk of death by 5 years.
www.ScienceTranslationalMedicine.org 9 November 2011 Vol 3 Issue 108 108ra113 2
onNovember17,2011stm.sciencemag.orgDownloadedfrom
Digital Pathologist
Sci Transl Med. 2011 Nov 9;3(108):108ra113

Digital Pathologist
Sci Transl Med. 2011 Nov 9;3(108):108ra113
Top stromal features associated with survival.
primarily characterizing epithelial nuclear characteristics, such as
size, color, and texture (21, 36). In contrast, after initial filtering of im-
ages to ensure high-quality TMA images and training of the C-Path
models using expert-derived image annotations (epithelium and
stroma labels to build the epithelial-stromal classifier and survival
time and survival status to build the prognostic model), our image
analysis system is automated with no manual steps, which greatly in-
creases its scalability. Additionally, in contrast to previous approaches,
our system measures thousands of morphologic descriptors of diverse
identification of prognostic features whose significance was not pre-
viously recognized.
Using our system, we built an image-based prognostic model on
the NKI data set and showed that in this patient cohort the model
was a strong predictor of survival and provided significant additional
prognostic information to clinical, molecular, and pathological prog-
nostic factors in a multivariate model. We also demonstrated that the
image-based prognostic model, built using the NKI data set, is a strong
prognostic factor on another, independent data set with very different
SD of the ratio of the pixel intensity SD to the mean intensity
for pixels within a ring of the center of epithelial nuclei
A
The sum of the number of unclassified objects
SD of the maximum blue pixel value for atypical epithelial nuclei
Maximum distance between atypical epithelial nuclei
B
C
D
Maximum value of the minimum green pixel intensity value in
epithelial contiguous regions
Minimum elliptic fit of epithelial contiguous regions
SD of distance between epithelial cytoplasmic and nuclear objects
Average border between epithelial cytoplasmic objects
E
F
G
H
Fig. 5. Top epithelial features. The eight panels in the figure (A to H) each
shows one of the top-ranking epithelial features from the bootstrap anal-
ysis. Left panels, improved prognosis; right panels, worse prognosis. (A) SD
of the (SD of intensity/mean intensity) for pixels within a ring of the center
of epithelial nuclei. Left, relatively consistent nuclear intensity pattern (low
score); right, great nuclear intensity diversity (high score). (B) Sum of the
number of unclassified objects. Red, epithelial regions; green, stromal re-
gions; no overlaid color, unclassified region. Left, few unclassified objects
(low score); right, higher number of unclassified objects (high score). (C) SD
of the maximum blue pixel value for atypical epithelial nuclei. Left, high
score; right, low score. (D) Maximum distance between atypical epithe-
lial nuclei. Left, high score; right, low score. (Insets) Red, atypical epithelial
nuclei; black, typical epithelial nuclei. (E) Minimum elliptic fit of epithelial
contiguous regions. Left, high score; right, low score. (F) SD of distance
between epithelial cytoplasmic and nuclear objects. Left, high score; right,
low score. (G) Average border between epithelial cytoplasmic objects. Left,
high score; right, low score. (H) Maximum value of the minimum green
pixel intensity value in epithelial contiguous regions. Left, low score indi-
cating black pixels within epithelial region; right, higher score indicating
presence of epithelial regions lacking black pixels.
and stromal matrix throughout the image, with thin cords of epithe-
lial cells infiltrating through stroma across the image, so that each
stromal matrix region borders a relatively constant proportion of ep-
ithelial and stromal regions. The stromal feature with the second
largest coefficient (Fig. 4B) was the sum of the minimum green in-
tensity value of stromal-contiguous regions. This feature received a
value of zero when stromal regions contained dark pixels (such as
inflammatory nuclei). The feature received a positive value when
stromal objects were devoid of dark pixels. This feature provided in-
formation about the relationship between stromal cellular composi-
tion and prognosis and suggested that the presence of inflammatory
cells in the stroma is associated with poor prognosis, a finding con-
sistent with previous observations (32). The third most significant
stromal feature (Fig. 4C) was a measure of the relative border between
spindled stromal nuclei to round stromal nuclei, with an increased rel-
ative border of spindled stromal nuclei to round stromal nuclei asso-
ciated with worse overall survival. Although the biological underpinning
of this morphologic feature is currently not known, this analysis sug-
gested that spatial relationships between different populations of stro-
mal cell types are associated with breast cancer progression.
Reproducibility of C-Path 5YS model predictions on
samples with multiple TMA cores
For the C-Path 5YS model (which was trained on the full NKI data
set), we assessed the intrapatient agreement of model predictions when
predictions were made separately on each image contributed by pa-
tients in the VGH data set. For the 190 VGH patients who contributed
two images with complete image data, the binary predictions (high
or low risk) on the individual images agreed with each other for 69%
(131 of 190) of the cases and agreed with the prediction on the aver-
aged data for 84% (319 of 380) of the images. Using the continuous
prediction score (which ranged from 0 to 100), the median of the ab-
solute difference in prediction score among the patients with replicate
images was 5%, and the Spearman correlation among replicates was
0.27 (P = 0.0002) (fig. S3). This degree of intrapatient agreement is
only moderate, and these findings suggest significant intrapatient tumor
heterogeneity, which is a cardinal feature of breast carcinomas (33–35).
Qualitative visual inspection of images receiving discordant scores
suggested that intrapatient variability in both the epithelial and the
stromal components is likely to contribute to discordant scores for
the individual images. These differences appeared to relate both to
the proportions of the epithelium and stroma and to the appearance
of the epithelium and stroma. Last, we sought to analyze whether sur-
vival predictions were more accurate on the VGH cases that contributed
multiple cores compared to the cases that contributed only a single
core. This analysis showed that the C-Path 5YS model showed signif-
icantly improved prognostic prediction accuracy on the VGH cases
for which we had multiple images compared to the cases that con-
tributed only a single image (Fig. 7). Together, these findings show
a significant degree of intrapatient variability and indicate that increased
tumor sampling is associated with improved model performance.
DISCUSSION
Heat map of stromal matrix
objects mean abs.diff
to neighbors
H&E image separated
into epithelial and
stromal objects
A
B
C
Worse
prognosis
Improved
prognosis
Improved
prognosis
Improved
prognosis
Worse
prognosis
Worse
prognosis
Fig. 4. Top stromal features associated with survival. (A) Variability in ab-
solute difference in intensity between stromal matrix regions and neigh-
bors. Top panel, high score (24.1); bottom panel, low score (10.5). (Insets)
Top panel, high score; bottom panel; low score. Right panels, stromal matrix
objects colored blue (low), green (medium), or white (high) according to
each object’s absolute difference in intensity to neighbors. (B) Presence
R E S E A R C H A R T I C L E
Top epithelial features.The eight panels in the ﬁgure (A to H) each
shows one of the top-ranking epithelial features from the bootstrap
analysis. Left panels, improved prognosis; right panels, worse prognosis.

ISBI Grand Challenge on
Cancer Metastases Detection in Lymph Node

International Symposium on Biomedical Imaging 2016
H&E Image Processing Framework
Train
whole slide image
sample
sample
training data
normaltumor
Test
whole slide image
overlapping image
patches tumor prob. map
1.0
0.0
0.5
Convolutional Neural
Network
P(tumor)

https://blogs.nvidia.com/blog/2016/09/19/deep-learning-breast-cancer-diagnosis/

Clinical study on ISBI dataset
Error Rate
Pathologist in competition setting 3.5%
Pathologists in clinical practice (n = 12) 13% - 26%
Pathologists on micro-metastasis(small tumors) 23% - 42%
Beck Lab Deep Learning Model 0.65%
Beck Lab’s deep learning model now outperforms pathologist
Andrew Beck, Machine Learning for Healthcare, MIT 2017

Andrew Beck, Advancing medicine with intelligent pathology, AACR 2017

Assisting Pathologists in Detecting
Cancer with Deep Learning
• The localization score(FROC) for the algorithm reached 89%, which signiﬁcantly
exceeded the score of 73% for a pathologist with no time constraint.

• Algorithms need to be incorporated in a way that complements the pathologist’s workﬂow.
• Algorithms could improve the efﬁciency and consistency of pathologists.
• For example, pathologists could reduce their false negative rates (percentage of  
 
undetected tumors) by reviewing the top ranked predicted tumor regions  
 
including up to 8 false positive regions per slide.

6
Input & Validation Test
model size FROC @8FP AUC FROC @8FP AUC
40X 98.1 100 99.0 87.3 (83.2, 91.1) 91.1 (87.2, 94.5) 96.7 (92.6, 99.6)
40X-pretrained 99.3 100 100 85.5 (81.0, 89.5) 91.1 (86.8, 94.6) 97.5 (93.8, 99.8)
40X-small 99.3 100 100 86.4 (82.2, 90.4) 92.4 (88.8, 95.7) 97.1 (93.2, 99.8)
ensemble-of-3 - - - 88.5 (84.3, 92.2) 92.4 (88.7, 95.6) 97.7 (93.0, 100)
20X-small 94.7 100 99.6 85.5 (81.0, 89.7) 91.1 (86.9, 94.8) 98.6 (96.7, 100)
10X-small 88.7 97.2 97.7 79.3 (74.2, 84.1) 84.9 (80.0, 89.4) 96.5 (91.9, 99.7)
40X+20X-small 94.9 98.6 99.0 85.9 (81.6, 89.9) 92.9 (89.3, 96.1) 97.0 (93.1, 99.9)
40X+10X-small 93.8 98.6 100 82.2 (77.0, 86.7) 87.6 (83.2, 91.7) 98.6 (96.2, 99.9)
Pathologist [1] - - - 73.3* 73.3* 96.6
Camelyon16 winner [1, 23] - - - 80.7 82.7 99.4
Table 1. Results on Camelyon16 dataset (95% confidence intervals, CI). Bold indicates
results within the CI of the best model. “Small” models contain 300K parameters per
Inception tower instead of 20M. -: not reported. *A pathologist achieved this sensitivity
(with no FP) using 30 hours.
to 10 20% variance), and can confound evaluation of model improvements
by grouping multiple nearby tumors as one. By contrast, our non-maxima sup-
pression approach is relatively insensitive to r between 4 and 6, although less
accurate models benefited from tuning r using the validation set (e.g., 8). Fi-
The FROC evaluates tumor detection and localization
The FROC is defined as the sensitivity at 0.25,0.5,1,2,4,8 average FPs per tumor-negative slide.
Yun Liu et al. Detecting Cancer Metastases on Gigapixel Pathology Images (2017)
Sensitivity at 8 false positives per image

Yun Liu et al. Detecting Cancer Metastases on Gigapixel Pathology Images (2017)
• 구글의 인공지능은 @8FP 및 FROC에서 큰 개선 (92.9%, 88.5%)

•@8FP: FP를 8개까지 봐주면서, 달성할 수 있는 sensitivity

•FROC: FP를 슬라이드당 1/4, 1/2, 1, 2, 4, 8개를 허용한 민감도의 평균

•즉, FP를 조금 봐준다면, 인공지능은 매우 높은 민감도를 달성

•인간 병리학자는 민감도 73%에 반해 특이도는 거의 100% 달성
•인간 병리학자와 인공지능 병리학자는 서로 잘하는 것이 다름

•양쪽이 협력하면 판독 효율성, 일관성, 민감도 등에서 개선 기대 가능

http://lunit.io/news/lunit-wins-tumor-proliferation-assessment-challenge-tupac-2016/

http://www.rolls-royce.com/about/our-technology/enabling-technologies/engine-health-management.aspx#sense
250 sensors to monitor the “health” of the GE turbines

Fig 1. What can consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi
sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an
accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me
attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a
PLOS Medicine 2016

S E P S I S
A targeted real-time early warning score (TREWScore)
for septic shock
Katharine E. Henry,1
David N. Hager,2
Peter J. Pronovost,3,4,5
Suchi Saria1,3,5,6
*
Sepsis is a leading cause of death in the United States, with mortality highest among patients who develop septic
shock. Early aggressive treatment decreases morbidity and mortality. Although automated screening tools can detect
patients currently experiencing severe sepsis and septic shock, none predict those at greatest risk of developing
shock. We analyzed routinely available physiological and laboratory data from intensive care unit patients and devel-
oped “TREWScore,” a targeted real-time early warning score that predicts which patients will develop septic shock.
TREWScore identified patients before the onset of septic shock with an area under the ROC (receiver operating
characteristic) curve (AUC) of 0.83 [95% confidence interval (CI), 0.81 to 0.85]. At a specificity of 0.67, TREWScore
achieved a sensitivity of 0.85 and identified patients a median of 28.2 [interquartile range (IQR), 10.6 to 94.2] hours
before onset. Of those identified, two-thirds were identified before any sepsis-related organ dysfunction. In compar-
ison, the Modified Early Warning Score, which has been used clinically for septic shock prediction, achieved a lower
AUC of 0.73 (95% CI, 0.71 to 0.76). A routine screening protocol based on the presence of two of the systemic inflam-
matory response syndrome criteria, suspicion of infection, and either hypotension or hyperlactatemia achieved a low-
er sensitivity of 0.74 at a comparable specificity of 0.64. Continuous sampling of data from the electronic health
records and calculation of TREWScore may allow clinicians to identify patients at risk for septic shock and provide
earlier interventions that would prevent or mitigate the associated morbidity and mortality.
INTRODUCTION
Seven hundred fifty thousand patients develop severe sepsis and septic
shock in the United States each year. More than half of them are
admitted to an intensive care unit (ICU), accounting for 10% of all
ICU admissions, 20 to 30% of hospital deaths, and $15.4 billion in an-
nual health care costs (1–3). Several studies have demonstrated that
morbidity, mortality, and length of stay are decreased when severe sep-
sis and septic shock are identified and treated early (4–8). In particular,
one study showed that mortality from septic shock increased by 7.6%
with every hour that treatment was delayed after the onset of hypo-
tension (9).
More recent studies comparing protocolized care, usual care, and
early goal-directed therapy (EGDT) for patients with septic shock sug-
gest that usual care is as effective as EGDT (10–12). Some have inter-
preted this to mean that usual care has improved over time and reflects
important aspects of EGDT, such as early antibiotics and early ag-
gressive fluid resuscitation (13). It is likely that continued early identi-
fication and treatment will further improve outcomes. However, the
Acute Physiology Score (SAPS II), SequentialOrgan Failure Assessment
(SOFA) scores, Modified Early Warning Score (MEWS), and Simple
Clinical Score (SCS) have been validated to assess illness severity and
risk of death among septic patients (14–17). Although these scores
are useful for predicting general deterioration or mortality, they typical-
ly cannot distinguish with high sensitivity and specificity which patients
are at highest risk of developing a specific acute condition.
The increased use of electronic health records (EHRs), which can be
queried in real time, has generated interest in automating tools that
identify patients at risk for septic shock (18–20). A number of “early
warning systems,” “track and trigger” initiatives, “listening applica-
tions,” and “sniffers” have been implemented to improve detection
andtimelinessof therapy forpatients with severe sepsis andseptic shock
(18, 20–23). Although these tools have been successful at detecting pa-
tients currently experiencing severe sepsis or septic shock, none predict
which patients are at highest risk of developing septic shock.
The adoption of the Affordable Care Act has added to the growing
excitement around predictive models derived from electronic health
onNovember3,2016http://stm.sciencemag.org/Downloadedfrom

puted as new data became avail
when his or her score crossed t
dation set, the AUC obtained f
0.81 to 0.85) (Fig. 2). At a spec
of 0.33], TREWScore achieved a s
a median of 28.2 hours (IQR, 10
Identification of patients b
A critical event in the developme
related organ dysfunction (seve
been shown to increase after th
more than two-thirds (68.8%) o
were identified before any sepsi
tients were identified a median
(Fig. 3B).
Comparison of TREWScore
Weevaluatedtheperformanceof
methods for the purpose of provid
use of TREWScore. We first com
to MEWS, a general metric used
of catastrophic deterioration (17)
oped for tracking sepsis, MEWS
tion of patients at risk for severe
Fig. 2. ROC for detection of septic shock before onset in the validation
set. The ROC curve for TREWScore is shown in blue, with the ROC curve for
MEWS in red. The sensitivity and specificity performance of the routine
screening criteria is indicated by the purple dot. Normal 95% CIs are shown
for TREWScore and MEWS. TPR, true-positive rate; FPR, false-positive rate.
A targeted real-time early warning score (TREWScore)
for septic shock
AUC=0.83
At a speciﬁcity of 0.67,TREWScore achieved a sensitivity of 0.85  
and identiﬁed patients a median of 28.2 hours before onset.

In an early research project involving 600 patient cases, the team was able to  
predict near-term hypoglycemic events up to 3 hours in advance of the symptoms.
IBM Watson-Medtronic
Jan 7, 2016

Sugar.IQ
사용자의 음식 섭취와 그에 따른 혈당 변
화, 인슐린 주입 등의 과거 기록 기반

식후 사용자의 혈당이 어떻게 변화할지
Watson 이 예측

ADA 2017, San Diego, Courtesy of Taeho Kim (Seoul Medical Center)

Prediction ofVentricular Arrhythmia

Prediction ofVentricular Arrhythmia
Collaboration with Prof. Segyeong Joo (Asan Medical Center)
Analysed “Physionet Spontaneous Ventricular Tachyarrhythmia Database” for 2.5 months (on going project)
Joo S, Choi KJ, Huh SJ, 2012, Expert Systems with Applications (Vol 39, Issue 3)
▪ Recurrent Neural Network with Only Frequency Domain Transform
• Input : Spectrogram with 129 features obtained after ectopic beats removal
• Stack of LSTM Networks
• Binary cross-entropy loss
• Trained with RMSprop
• Prediction Accuracy : 76.6% ➞ 89.6%
Dropout
Dropout

Prediction ofVentricular
TachycardiaOne Hour before
Occurrence UsingArtificial
Neural Networks
Hyojeong Lee1,*
, Soo-Yong Shin2,*
, Myeongsook Seo3
,Gi-Byoung Nam3
& Segyeong Joo1,4
Ventricular tachycardia (VT) is a potentially fatal tachyarrhythmia, which causes a rapid heartbeat as
a result of improper electrical activity of the heart.This is a potentially life-threatening arrhythmia
because it can cause low blood pressure and may lead to ventricular fibrillation, asystole, and sudden
cardiac death.To preventVT, we developed an early prediction model that can predict this event one
hour before its onset using an artificial neural network (ANN) generated using 14 parameters obtained
from heart rate variability (HRV) and respiratory rate variability (RRV) analysis. De-identified raw
data from the monitors of patients admitted to the cardiovascular intensive care unit atAsan Medical
Center between September 2013 andApril 2015 were collected.The dataset consisted of 52 recordings
obtained one hour prior toVT events and 52 control recordings.Two-thirds of the extracted parameters
were used to train theANN, and the remaining third was used to evaluate performance of the learned
ANN.The developedVT prediction model proved its performance by achieving a sensitivity of 0.88,
specificity of 0.82, andAUC of 0.93.
Sudden cardiac death (SCD) causes more than 300,000 deaths annually in the United States1
. Coronary artery
disease, cardiomyopathy, structural heart problems, Brugada syndrome, and long QT syndrome are well known
causes of SCD1–4
. In addition, spontaneous ventricular tachyarrhythmia (VTA) is a main cause of SCD, contrib-
uting to about 80% of SCDs5
. Ventricular tachycardia (VT) and ventricular fibrillation (VF) comprise VTA. VT
is defined as a very rapid heartbeat (more than 100 times per minute), which does not allow enough time for the
ventricles to fill with blood between beats. VT may terminate spontaneously after a few seconds; however, in some
cases, VT can progress to more dangerous or fatal arrhythmia, VF. Accordingly, early prediction of VT will help
in reducing mortality from SCD by allowing for preventive care of VTA.
Several studies have reported attempts at predicting VTAs by assessing the occurrence of syncope, left ventricu-
lar systolic dysfunction, QRS (Q, R, and S wave in electrocardiogram) duration, QT (Q and T wave) dispersion,
Holter monitoring, signal averaged electrocardiograms (ECGs), heart rate variability (HRV), T wave alternans,
electrophysiologic testing, B-type natriuretic peptides, and other parameters or method6–10
. Among these studies,
prediction of VTAs based on HRV analysis has recently emerged and shown potential for predicting VTA11–13
.
Previous studies have focused on the prediction of VT using HRV analysis. In addition, most studies assessed
the statistical value of each parameter calculated on or prior to the VT event and parameters of control data,
which were collected from Holter recordings and implantable cardioverter defibrillators (ICDs)12,14,15
. However,
the results were not satisfactory in predicting fatal events like VT.
To make a better prediction model of VT, it is essential to utilize multiple parameters from various methods
of HRV analysis and to generate a classifier that can deal with complex patterns composed of such parameters7
.
Artificial neural network (ANN) is a valuable tool for classification of a database with multiple parameters. ANN
is a kind of machine learning algorithms, which can be trained using data with multiple parameters16
. After
training, the ANN calculates an output value according to the input parameters, and this output value can be used
1
Department of Biomedical Engineering, University of Ulsan College of Medicine, Seoul, Republic of Korea.
2
Department of Biomedical Informatics, Asan Medical Center, Seoul, Republic of Korea. 3
Department of Internal
Re e e : 26 pr 2016
A ep e : 03 s 2016
P s e : 26 s 2016
OPEN
Lee H. et al, Scientiﬁc Report, 2016

Prediction of Ventricular Tachycardia One Hour before
Occurrence Using Artiﬁcial Neural Networks
ww.nature.com/scientificreports/
in pattern recognition or classification. ANN has not been widely used in medical analysis since the algorithm
is not intuitive for physicians. However, utilization of ANN in medical research has recently emerged17–19
. Our
Parameters
Control dataset (n=110) VTs dataset (n=110)
Mean±SD Mean±SD p-Value
Mean NN (ms) 0.709±0.149 0.718±0.158 0.304
SDNN (ms) 0.061±0.042 0.073±0.045 0.013
RMSSD (ms) 0.068±0.053 0.081±0.057 0.031
pNN50 (%) 0.209±0.224 0.239±0.205 0.067
VLF (ms2
) 4.1E-05±6.54E-05 6.23E-05±9.81E-05 0.057
LF (ms2
) 7.61E-04±1.16E-03 1.04E-03±1.15E-03 0.084
HF (ms2
) 1.53E-03±2.02E-03 1.96E-03±2.16E-03 0.088
LF/HF 0.498±0.372 0.533±0.435 0.315
SD1 (ms) 0.039±0.029 0.047±0.032 0.031
SD2 (ms) 0.081±0.057 0.098±0.06 0.012
SD1/SD2 0.466±0.169 0.469±0.164 0.426
RPdM (ms) 2.73±0.817 2.95±0.871 0.038
RPdSD (ms) 0.721±0.578 0.915±0.868 0.075
RPdV 28.4±5.31 25.4±3.56 <0.002
Table 1. Comparison of HRV and RRV parameters between the control and VT dataset.
ANN with Input Sensitivity (%) Specificity (%) Accuracy (%) PPV (%) NPV (%) AUC
HRV parameters 11 70.6(12/17) 76.5(13/17) 73.5(25/34) 75.0(12/16) 72.2(13/18) 0.75
RRV parameters 3 82.4(14/17) 82.4(14/17) 82.4(28/34) 82.4(14/17) 82.4(14/17) 0.83
HRV+RRV parameters 14 88.2(15/17) 82.4(14/17) 85.3(29/34) 83.3(15/18) 87.5(14/16) 0.93
Table 2. Performance of three ANNs in predicting a VT event 1hour before onset for the test dataset.
This ANN with 13 hidden
neurons in one hidden
layer showed the best
performance.

www.nature.com/scientificreports/
Discussion
Figure 1. ROC curve of three ANNs (dashed line, with only HRV parameters; dashdot line, with
parameters; solid line, with HRV and RRV parameters; dotted line, reference) used in the predict
VT event one hour before onset.
ROC curve of three ANNs (dashed line, with only HRV parameters; dashdot line, with
only RRV parameters; solid line, with HRV and RRV parameters; dotted line, reference)
used in the prediction of aVT event one hour before onset.
Prediction of Ventricular Tachycardia One Hour before
Occurrence Using Artiﬁcial Neural Networks

•아주대병원 외상센터, 응급실, 내과계 중환자실 등 3곳의 80개 병상

•산소포화도, 혈압, 맥박, 뇌파, 체온 등 8가지 환자 생체 데이터를 하나로 통합 저장

•생체 정보를 인공지능으로 실시간 모니터링+분석하여 1-3시간 전에 예측

•부정맥, 패혈증, 급성호흡곤란증후군(ARDS), 계획되지 않은 기도삽관 등의 질병

서울아산병원 딥러닝 워크샵 2017. 9.

(source: VUNO)
APPH(Alarms Per Patients Per Hour)

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks
Pranav Rajpurkar⇤
PRANAVSR@CS.STANFORD.EDU
Awni Y. Hannun⇤
AWNI@CS.STANFORD.EDU
Masoumeh Haghpanahi MHAGHPANAHI@IRHYTHMTECH.COM
Codie Bourn CBOURN@IRHYTHMTECH.COM
Andrew Y. Ng ANG@CS.STANFORD.EDU
Abstract
We develop an algorithm which exceeds the per-
formance of board certiﬁed cardiologists in de-
tecting a wide range of heart arrhythmias from
electrocardiograms recorded with a single-lead
wearable monitor. We build a dataset with more
than 500 times the number of unique patients
than previously studied corpora. On this dataset,
we train a 34-layer convolutional neural network
which maps a sequence of ECG samples to a se-
quence of rhythm classes. Committees of board-
certiﬁed cardiologists annotate a gold standard
test set on which we compare the performance of
our model to that of 6 other individual cardiolo-
gists. We exceed the average cardiologist perfor-
mance in both recall (sensitivity) and precision
(positive predictive value).
1. Introduction
We develop a model which can diagnose irregular heart
Figure 1. Our trained convolutional neural network correctly de-
tecting the sinus rhythm (SINUS) and Atrial Fibrillation (AFIB)
from this ECG recorded with a single-lead wearable heart moni-
tor.
Arrhythmia detection from ECG recordings is usually per-
formed by expert technicians and cardiologists given the
1707.01836v1[cs.CV]6Jul2017

•ZIO Patch

•2009년에 FDA에서 인허가 받은 의료기기

•최대 2주까지 붙이고 다니면서 지속적으로 심전도를 측정

Cardiologist-Level Arrhythmia Detection
with Convolutional Neural Networks
• Training Set

• 약 3만 명의 환자에게서 얻은 64,000여 건의 심전도 데이터

• 34층 깊이의 CNN으로 학습

• Test Set

• 336명에게서 얻은 Zio Patch의 심전도 데이터

• 세 명의 심장내과 전문의들이 상의하여 정답

• 총 12가지 종류의 부정맥으로 분류

• 총 6명의 심장내과 전문의들 vs 인공지능

• 부정맥의 발생 여부

• 부정맥 종류

Cardiologist-Level Arrhythmia Detection
with Convolutional Neural Networks
Cardiologist-Level Arrhythmia Detection with Convoluti
Figure 3. Evaluated on the test set, the model outperforms the
Patch which h
et al., 2013).
seconds long
Each record is
pert highlights
responding to
The 30 second
ECG annotatio
tations were d
Technicians w
rhythmia detec
ination by Car
technicians we
could annotate
labeled from t

인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭

인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭

Similaire à 인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭 (20)

Plus de Yoon Sup Choi

Plus de Yoon Sup Choi (10)

Dernier

Dernier (20)

인공지능은 의료를 어떻게 혁신하는가 (2017년 11월) 최윤섭