The rapid spread of COVID-19 worldwide has claimed thousands of lives and has put unprecedented pressure on the healthcare systems around the world. The World Health Organization has emphasised the need for comprehensive testing in order to fight the virus [1]. With the lack of testing kits available worldwide, there is a call for novel testing methods that can help arrest the spread faster [18]. Every health care worker exposed to the virus puts additional pressure on an already outstretched infrastructure. Here, in this study we propose a machine learning approach towards predicting Covid-19 cases among a sample population who have undergone other clinical tests and blood spectrum tests. The patient data used in this effort has been donated by Hospital Israelita Albert Einstein, at São Paulo, Brazil [6] for the purpose of research. The problem at hand is divided into two parts: Predict confirmed COVID-19 cases amongst suspected cases based on the laboratory tests of their clinical samples. Predict admission to general, semi-ICU, and ICU wards among those who predicted positive for COVID-19 in the first task. Our approach uses Classification from Supervised Learning techniques to solve this problem. The efficacy of this approach could be used to scale and develop automated systems that could predict the likeliness of Covid-19 based on laboratory tests that are readily accessible. From the features presented to us in the dataset, we are able to predict with 87.0 - 97.4 percent accuracy at a 95 percent confidence level that a patient is suffering from Covid-19 when biomarkers are taken into consideration. Among those that tested positive, we were able to demonstrate that our model could predict with 87.0 - 100 percent accuracy at a 95 percent confidence that whether the patient would be admitted to a particular ward.
A Machine Learning Approach to Predicting Covid-19 Cases Amongst Suspected Cases and Their Category of Admission
1. A MACHINE LEARNING APPROACH
TO PREDICTING COVID-19 CASES
AMONGST SUSPECTED CASES
AND THEIR CATEGORY OF
ADMISSION
Dr. D Narayana, Amanpreet Singh,
Aparna Ranjith, Divya Dixit, Abhay,
Suparna Khan, Anwesh Reddy
Presented by
Amanpreet Singh
2. COVID-19 timeline
DEC’19
First documented
Covid-19 hospital
admission in
Wuhan, Hubei
JAN ‘20
Multiple cases
registered from
countries around
the globe.
FEB ‘20
WHO named the
disease Coronavirus
Disease - 2019
(COVID-19)
MAR ‘20
UN global response
launched by WHO.
Complete
lockdown imposed
in several countries
APR ‘20
Evidence of
transmission from
asymptomatic to
symptomatic and
pre-asymptomatic
people infected
with Covid-19 were
reported
4. Objective
● Predict confirmed COVID-19 cases amongst suspected cases
based on the laboratory tests of their clinical samples.
● Predict admission to general, semi-ICU, and ICU wards among
those who predicted positive for COVID-19 in the first task.
The World Health Organization has emphasized the need for
comprehensive testing in order to fight the virus . With the lack of
testing kits available worldwide, there is a call for novel testing methods
that can help arrest the spread effectively.
5. About the data
Source Hospital Israelita Albert Einstein, at São Paulo, Brazil
Date Range The data has been collected between March 28 - April 3, 2020
Total records 5644
Attributes 111
Data points used
for prediction
Subset I Subset II
Attributes: 20 Attributes:48
Records: 598 Records: 254
Standard
Deviation
1
Mean 0
Challenges Missing values and imbalanced data
6. Clinical Parameters
● Complete Blood Count Tests
● Liver Function Tests
● Renal Analysis (Kidney)
● Influenza Tests
● Blood gas analysis
● Salt tests
7. Logical Grouping of Attributes
Blood Liver Function Influenza Renal(Urine)
Haemoglobin Alanine
Transaminase
RSV Urine-ph
Red Blood Cells Aspartate
Transaminase
Influenza A Urine-Aspect
Lymphocytes Total Bilirubin Influenza B Urine-Color
C-Reactive
Protein
Indirect Bilirubin Rhinovirus/Enteroviru
s
Urine-RBC
Creatinine Alkaline
Phosphatase
Adenovirus Urine-Density
Potassium GGT Parainfluenza 3 Urine-Haemoglobin
9. Ward Distribution of Patients
Of all the positive patients
● 1.4% were admitted to ICU
● 1.4% were admitted to Semi-ICU
● 6.5% were admitted to Regular
Ward
10. Methodology
● Data Cleaning
● Imbalanced Data – Sampling(UpSampling and Downsampling)
● Imputation
● Classification
● Results
11. Feature Selection Criteria
The first subset has been made taking the features related to CBC along with age
quantile and three categorical variables representing the hospital admission and
the target variable (‘sars_cov2’). The information in these features is related to
RBC, WBC and platelets. A total of 20 features have been chosen with 598
records that have no missing values.
The second subset includes features related to blood, flu and liver parameters
along with age quantile and three categorical variables representing the hospital
admission and the target variable (‘sars_cov2’). It contains 48 attributes with 254
records. The larger dataset after careful filtering had missing values which were
eventually treated using a placeholder value.
For the ward prediction task, the ICU, semi-icu, regular ward were
merged into one categorical column with ordinal values 0,1,2,3
01
02
03
14. Model Performances
Model Performances
Model Training Accuracy(%) Testing Accuracy(%)
Logistic Regression 91.52 89.61
Decision Tree 90.39 83.11
Random Forest 97.17 94.80
Bagging 100 96.10
Gradient Boost 100 94.80
ADA Boosting 98.30 93.50
15. Conclusion
● Our results demonstrate that adding more bio-markers can help in
increasing prediction of covid-19 using machine learning approach.
● Further it can help in predicting severity of the disease (ward).
● Adding these methods to testing protocols can minimize contact for
healthcare workers.
● The model’s output can be used as a tool for prioritization and to support
further medical decision-making processes.
16. Future Enhancements
● Larger Dataset
● More ethnically representative sample data
● Capturing critical features like D-dimer, CRP
● Including X-ray
● Considering co-morbidities
Hello everyone, my name is Amanpreet Singh
In this study we propose a machine learning approach towards
predicting Covid-19 cases among a sample population who
have undergone other clinical tests and blood spectrum tests.
Before moving forward lets talk a little about the covid-19
–December ‘19 - First documented COVID-19 patient was admitted in the hospital in Wuhan, Hubei Province, China
–January ‘20 - Multiple cases registered from countries around the globe and WHO declared the COVID-19 outbreak as the sixth public health emergency of international concern
–February ‘20 - WHO named the disease caused by 2019-nCoV: Coronavirus Disease-2019 (COVID-19)
–March ‘20 - The UN Global Humanitarian Response Plan was launched by the WHO. Complete Lockdown was introduced in many countries.
–April ‘20 – Evidence of transmission from symptomatic, pre-symptomatic and asymptomatic people infected with COVID-19 were reported. WHO published a draft landscape of COVID-19 candidate vaccines
Developing countries with fewer testing resources have adopted conservative testing methods in order to conserve
testing kits on symptomatic and mild to severe cases. As more evidence surfaces around symptoms and associated
risks of Covid-19, it has been observed that blood examination can play a key role in the route to extensive testing.
The patient data used in this effort has been donated by Hospital Israelita Albert Einstein, at São Paulo, Brazil for
the purpose of research. The blood and clinical samples were collected of patients who visited the hospital
for a suspected infection of Covid.
The data was made available in a standardized and normalized form.
It has a unit standard deviation and a mean of zero
There was an imbalance in class distribution across both target columns ‘sars_cov2’ and ‘Ward’
The features which comprise various clinical parameters can
be broadly categorized under CBC, liver function, renal
analysis, salt tests, blood gas analysis (arterial and venous),
And influenza tests.
The CBC is indicative of an individual’s overall health and detects
a range of potential disorders but is not limited to anaemia, leukaemia, or inflammation
The attributes which have more than 98% of missing values were eliminated.
After Data Cleaning, it was observed that there was an imbalance in class distribution across both target columns
‘sars_cov2’ and ‘Ward’ so we have performed sampling there, both up sampling and down sampling.
The missing data was not imputed as imputations may render the results compromised.
Binary classification is used in the first problem statement. And in the second problem we have used multi-class classification.
In the results we have used auc/roc metrics with prediction accuracy of our model performances.
Scenario-1: Area under the ROC-curve for Covid-19 Predictions was observed to be 87.58%.
Scenario-2: Area under the ROC-curve for Covid-19 Predictions was observed to be 97.82%.
Prediction accuracy of the model is also an important factor in determining the efficiency of the model.
we are able to predict with 87.0 - 97.4 percent accuracy at a 95 percent confidence level that a patient is suffering from Covid-19
When biomarkers are taken into consideration.
Among those that tested positive, we were able to demonstrate that our model
could predict with 87.0 - 100 percent accuracy at a 95 percent
confidence that whether the patient would be admitted to a
particular ward.
We performed classification using different machine learning algorithms and compared the testing and
training accuracy of the different classifiers used.
The best accuracy was obtained using Random Forest Classifier with a training accuracy of 97.17% and testing
accuracy of 94.80 % .