Data science landscape in the insurance industry

12-10-2017
1
Data Science Landscape in the Insurance Industry
Stefano Perfetti
ETH Zurich
Author Note
This thesis for the Master of Advanced Studies in Management, Technology and Economics (MTEC) at ETH Zurich
was conducted in collaboration with the Mobiliar Lab for Analytics, an ETH Zurich research group funded by Swiss
Insurer ‘Die Mobiliar’ and focused on exploring the potential of advanced analytics over large volumes of data in
the domain of the insurance industry. Die Mobiliar kindly supported this study by granting the time and effort of
several employees who participated in preliminary interviews and acted as test respondents for the survey which
was one of the research methods used.
However, for my thesis work I was not rewarded financially in any way by any entity and I paid myself the small
costs originated by the research. More importantly, no entity received or will receive any advantage from my thesis
beyond the publicly available results.
For any question, request for data or feedback, I am permanently reachable at:
perfetti dot stefano tiger at gmail dot com (remove the animal to prove you are human)
After the work was completed and submitted on 12-10-2017, this report was revised on 06-12-2017 to fix typos.

2
Table of Contents
Abstract .......................................................................................................................3
Introduction..................................................................................................................4
What Is Data Science? ............................................................................................5
Conceptual Map of Data Science ............................................................................8
Data Scientists Across All Industries .....................................................................11
Data Science in Insurance.....................................................................................12
Actuaries vs. Data Scientists .................................................................................13
Research Methods ....................................................................................................14
Survey....................................................................................................................15
Preliminary research ..........................................................................................15
Survey design ....................................................................................................20
Survey distribution..............................................................................................21
Survey analysis ..................................................................................................23
Clustering analysis .............................................................................................46
Discussion of survey limitations .........................................................................47
Analysis of Job Advertisements.............................................................................49
Choice and collection of document corpus ........................................................49
Text analysis ......................................................................................................51
Conclusions...............................................................................................................54
Main findings..........................................................................................................54
Managerial implications .........................................................................................55
Limitations and open questions .............................................................................56
Acknowledgements ...................................................................................................57
References ................................................................................................................58
Appendix: Full Survey Report from Qualtrics.........................................................61

3
Abstract
Data analytics is among the core challenges for insurance companies today, with possible use cases
spanning all business functions. Consequently, the insurance industry is investing significantly in data
science. Despite this, little research exists about data science in insurance, with studies generally
targeting the wider data science universe instead. This study explores the current landscape of data
science in insurance, with worldwide scope. The two juxtaposed research methods are a survey of data
scientists employed by insurers and a text analysis of a corpus of job advertisements for data scientists
in insurance. Insurance data scientist are found to be extremely concentrated with few big insurers
worldwide. An insurer’s propensity to enact partial externalizations of data science capabilities is found
to be positively correlated to, and plausibly be an instrument for an insurer’s outperformance in data
science capabilities. Insurers collectively are found to potentially have a problem in retaining data
science talent against the lure of career chances in other industries.
Keywords: insurance, “data science”, analytics

4
Introduction
According to a recent study by BCG (2017), digitalization and data analytics are amongst the core
challenges of the insurance industry today. Insurance companies collect and store large amounts
of data that have a huge potential of being transformed into actionable insights about risk and,
ultimately, society. From a data science perspective, the insurance industry can be regarded as a
greenfield; be it about automating repetitive manual data handling tasks, finding better ways to
quantify and price risk, detecting fraudulent claims more efficiently, or improving customer services
through technologies such as machine vision and speech recognition, the opportunities to apply
data science in the insurance industry are vast. Recognizing these opportunities and the need to
remain competitive, many insurance companies have built their own data science divisions in the
past few years. Teams ranging from a handful to several hundred data scientists are now working
in the context of the insurance business. Data scientists, often coming from an academic
background, bring with them state-of-the-art technology and novel approaches, which are slowly
changing a business that is hundreds of years old.
A comprehensive overview of all the aspects of the insurance industry that are being changed or
could be changed by data science is provided in “Analytics for Insurance: The Real Business of
Big Data” by former worldwide IBM executive and Insurance Analytics Expert Tony Boobier (2016),
which is based on the author’s extensive business experience as a consultant to the insurance
industry. Virtually all business functions of the insurance industry can be improved or, often,
radically transformed by the adoption of data science and other related recent technologies.
Boobier (2016) goes on to strongly advocate in favor of this change and warns about the risks of
lagging behind, while at the same time being aware of the potential business and ethical challenges
implied.
Unfortunately, despite the large investment by the insurance industry in data science, too little is
known about what problems are being tackled and which approaches have proved successful so
far. There is, at present, no comprehensive empirical study of data science in the insurance
industry, while studies on data science in general and surveys of data scientists across all
industries are not hard to come by. This thesis aims at filling this gap by taking a snapshot of the
current situation across multiple dimensions, with a worldwide scope.

5
What Is Data Science?
A 2012 article in the Harvard Business Review famously declared that data scientist is the sexiest job
of the 21st century (Patil & Davenport, 2012), but what exactly do data scientists do?
A definition of data science is needed as the very start of this research, but already this simple step
leads to difficulties. Namely, data science is a very new field. Therefore, it necessarily has blurry borders
and there exist divergent views about how to define it. An added challenge is that some popular terms
related to data science are so vaguely defined they can be suspected to be just catchy labels used by
businesspeople and data science advocates to popularize data science techniques; consequently,
opinions will legitimately differ on the topic of which data science-related terms are meaningful and
which are buzzwords.
A few of the different views on what constitutes data science are briefly reviewed here, without any
unrealistic ambition to reach a precise and uncontroversial definition for such a new, manifold and fluid
field.
In 2010, data scientist Drew Conway elaborated a visualization of the macro-skillsets that define data
science in the form of the Venn diagram shown in Figure 1 (Conway, 2010).
Figure 1: Conway, D. (2010). The data science Venn diagram.
While “Math and Statistics Knowledge” has a straightforward meaning, the other two of Conway’s
skillsets need some explanation. Conway (2013) defines hacking skills as applied, practical skills in
data manipulation, not necessarily acquired through an academic education; in his words “For better or
worse, data is a commodity traded electronically; therefore, in order to be in this market you need to
speak hacker. This, however, does not require a background in computer science—in fact—many of
the most impressive hackers I have met never took a single CS course. Being able to manipulate text
files at the command-line, understanding vectorized operations, thinking algorithmically; these are the
hacking skills that make for a successful data hacker” (Conway, 2013).
Conway’s skillset “Substantive Expertise” is substantially synonymous with the much more common
phrase “domain knowledge; in Conway’s words, this category is about “motivating questions about the
world and hypotheses that can be brought to data and tested with statistical methods” (Conway, 2013).
There exist today university study programs dedicated to data science, all of them recent creations, like
for example the master in data science offered by ETH Zurich (D-INFK ETH, 2017).
Comparing Conway’s Venn diagram with the curriculum of the ETH Zurich Master’s Program in Data
Science (ETH, 2017), and interpreting “hacking skills” as programming skills, then the skills taught in

6
the master correspond to the skillset “hacking skills” AND “math & statistics knowledge” = “machine
learning” of Conway’s Venn diagram, i.e. all of Conway’s data science skills which are transversal and
transferable across application domains. The concept that data science entails both domain-specific
and transferable skills gives rise to the research question of whether data science in insurance, or
insurance data science for short, is a separate career path as compared to the wider data science area;
this question will be among those investigated in this study.
Another attempt to map out the skills entailed in data science was made by Rachel Schutt (Schutt &
O’Neil, 2013) by polling the students of her course “Introduction to Data Science” at Columbia University
about their skills. Therefore, the following profile corresponds to the typical student who is attracted to
a data science course (Schutt & O’Neil, 2013).
Figure 2: Survey by Prof. Rachel Schutt among her data science students (Schutt & O’Neil, 2013)
Schutt’s simple approach is notable for using statistics, i.e. a skillset strongly suspected to belong to
data science, in order to investigate the definition of data science itself. A similar attempt, but using
more sophisticated data science methods, was made by Harlan Harris in mid-2012 by running a
clustering analysis on the responses of a survey of several hundred data science practitioners (Harris,
Murphy & Vaisman, 2013), which is visualized in Figure 3.

7
Figure 3: Harlan Harris’s clustering and visualization of subfields of data science
(Harris, Murphy & Vaisman, 2013)
This research will follow an approach similar to Harris’s, i.e. using data science methods to search
for the profile of insurance data scientists within their responses to a survey and within the text of
job advertisements for insurance data scientists.

8
Conceptual Map of Data Science
All the previously shown tentative definitions of data science suffer from two shortcomings: first, they
concern themselves with the macro-skillsets that data scientists possess or should possess, without
articulating which analytical methods belong to data science; secondly, none of them is more recent
than 2013, which is a long time ago for a rapidly evolving new field like data science.
A relatively recent conceptual map of data science methods can be found in a March 2016 blog post
on KDnuggets, the seminal online data science resource, authored by Matthew Mayo, a data scientist
and an editor of KDnuggets itself.
In the blog post, Mayo proposes the conceptual map of data science methods, techniques and concepts
which is visualized in Figure 4.
Figure 4: Conceptual map of data science (Mayo, 2016)
Mayo chooses the concepts to put on the map in part by using Google trends to track the search
frequency of various data science-related phrases (Mayo, 2016). Of course, the concepts involved
here are heavily overlapping; therefore, observed changes in search popularity will reflect not only
changes in interest for the searched concepts, due to technological or social evolution, but also
changes in the popularity of using different phrases to refer to those concepts, due to language
use changing over time. It is necessary to remember this inevitable confusing factor when looking
at Figure 5, which is derived from Mayo’s original Google trend search (Mayo, 2016) by updating
it till 22-Sep-2017. The graph in the figure shows how searches for “data mining” have declined
while all other searches have increased, with the three searches growing most rapidly on a
proportional basis being, in order: “deep learning”, “machine learning” and “data science”.

9
Search keywords
Relative popularity
% change@ period
start
@ period
end
"deep learning" 1 40 +3900%
"machine learning" 17 97 +471%
"data science" 11 61 455%
"artificial intelligence" 28 62 +121%
"data mining" 39 30 -23%
Figure and table 5: comparison of trends (Google Trends, 2017), following the original idea in (Mayo, 2016)
The explosive growth in interest for “deep learning” may have been plausibly caused by deep learning
being the subset of machine learning methods that has experienced groundbreaking progress in the
last few years (LeCun, Bengio, & Hinton, 2015), enabling such impressive advances as computer vision
algorithms now surpassing humans in benchmark tests for the quintessentially human task of face
recognition (He, Zhang, Ren, & Sun, 2015).

10
The phrase “machine learning” is now the most searched among those phrases and the phrase “data
science” has risen proportionally with it. This suggests that a definition of data science today cannot
omit a definition of the closely related concept of machine learning; therefore, it is opportune to produce
a taxonomy of machine learning methods.
Following the survey paper of machine learning methods by Qiu, Wu, Ding, Xu, & Feng (2016), machine
learning methods can be classified as shown in Figure 6.
Figure 6: taxonomy of machine learning methods, my own diagram
visualizing concepts by Qiu, Wu, Ding, Xu, & Feng (2016)
In this taxonomy, machine learning methods are classified based on the high level-way in
which information is provided and processed by machine learning algorithms. By and large,
all the same methods can be applied to different data formats for different purposes, giving
rise to technical areas such as text mining, natural language processing, speech processing,
computer vision, etc. Therefore, in each of these areas, it will be possible to find applications
of supervised, unsupervised, reinforcement and deep learning methods.
While machine learning has come to currently have an important role in defining data science,
machine learning methods are by no means the only arrows with which data scientists can
hunt for information and insights in the thick forests of data. In this study, a data scientist’s
quiver of arrows will be thought to contain also methods and techniques outside the realm of
machine learning, such as traditional statistics – like already mentioned – but also operational
research and econometrics, data mining, data visualization, and potentially yet many more.
However, it will also be assumed that machine learning methods are the essential, defining
tools of a data scientist.

11
Data Scientists Across All Industries
In a recent report on the strong demand for data science skills in the US job market, co-sponsored by
IBM and aptly titled “The Quant Crunch”, Data Science and Analytics (DSA) jobs are defined as a
stratification of job categories, ordered by increasing analytical rigor, as shown in Table 7; at the top of
the pyramid, we find the category of “Data Scientists & Advanced Analytics”, whose functional role is to
“create sophisticated analytical models used to build new datasets and derive new insights from data”,
and “Data Scientist” is one sample occupation within this job category. In other words, the occupation
of data scientist is classified into a set of jobs called “Data Scientists & Advanced Analytics”, which in
turn is collocated into a larger superset named “Data Science and Analytics” (Miller & Hughes, 2017).
Table 7: Description of DSA Jobs (Miller & Hughes, 2017)
Leaving aside the somewhat confusing similar names, it is illuminating to think of this stratification as
of a pyramid, because Miller & Hughes report job market statistics showing2 how the above mentioned
job category with top analytical rigor, “Data Scientists & Advanced Analytics, is much rarer, much faster
growing and significantly better paid than any other category except for Analytics Managers, as shown
in Table 8.
Table 8: Summary Demand Statistics (Miller & Hughes, 2017)

12
In this research, the group of interest is the occupation of data scientist, i.e. the smallest group based
on the most restrictive and most specific definition. In other words: a small professional elite.
Two interesting questions to ask about this occupation are: how many are they worldwide, and how fast
is their worldwide number growing? Here Miller & Hughes (2017) are of no help, as their data is only
for the USA and does not disaggregate below the job category level.
For the first question, we find some help in LinkedIn which, as of 11-Oct-17, counts 51'491 profiles with
a present or past job title containing the phrase “data scientist”. This is a naïve way of counting data
scientists, because it ignores data scientists by any other job titles, but let us accept it for the moment;
we will go back to this issue later.
In a Quora answer dated 08-Mar-2014, Peter Skomoroch, then Sr. Data Scientist at LinkedIn,
approached the question of how many data scientists existed by any other name at that time using a
method that I deem too lax for the purposes of this research. However, in the process, he quoted an
interesting data point, i.e. that on that day LinkedIn had 6'896 profiles with a current or past title
containing the phrase "data scientist" across any industry (Skomoroch, 2014). This number can be
combined with the result count of 51'491 for the same query on 11-Oct-17, allowing to estimate a
compound annual growth rate of +75% for data science jobs during the last 3.5 years ending in October
2017. While this number might conceivably have been inflated over that period by some people
changing their job title on LinkedIn to “data scientist” as a self-marketing tactic, it is anyway impressive.
Data Science in Insurance
Of course, this thesis is about data science not in the whole economy, but just in the insurance industry.
Therefore, out of the small worldwide elite of data science professionals, only the share belonging to
the insurance sector will be considered. How large is this share? Knowing that in 2015 there were
2'540'000 insurance jobs in the USA (Statista, 2017a) out of 121'490'000 total jobs (Statista, 2017b),
we can derive that insurance represents about 2% of overall jobs in developed economies, which are
the economies where the insurance industry has a significant presence according to Boobier (2016).
Therefore, even if data scientists were disproportionately concentrated in insurance, we should not
expect a large population.
But enough has been said for now about the number of insurance data scientists; let us briefly consider
how data science is applied in insurance today. Writing a comprehensive analysis of this topic is beyond
the goal and possibilities of this work, and it would also be a needless duplication of the already cited
book by Tony Boobier, having the full title of Analytics for insurance: The real business of Big Data and
which I can only recommend heartily (full disclosure: Tony Boobier is also one of the subject matter
experts who were consulted for this study and was one of the endorsers of the survey being part of this
research). So, let it suffice here to extract a few bullet points from Boobier (2016):
- data science, together with various complementary or enabling technologies like cloud computing
and IoT, has a very long and growing list of applications in the insurance industry, pervading all its
business functions;
- some particularly noteworthy high-level applications and consequences of data science in
insurance are as follows:
o insurers gradually become able to practice active risk management, i.e. enact measures to
prevent the negative outcomes that have been insured against, which is potentially a huge
win-win scenario for insurers and their clients;
o risk estimation and pricing gradually become more and more granular, up to case by case
customization, with ambiguous social and economic consequences;
o many instances of fraud, no matter if perpetrated by clients, suppliers or dishonest
employees, can be prevented;
o claims processing can be made more efficient and more effective;
o new business models can be developed;
o in marketing, all the usual, non-industry-specific applications to acquisition, retention and
upselling can be implemented;
- use-based telematics car insurance is notable as an example of advanced application, combining
several of the above aspects, which has already become reality with many insurers;

13
- there can be not only upsides, but also downsides for individuals and society, such as:
o possible de facto abolition of the right to privacy;
o increased granularity of risk estimation potentially leading to some risks becoming
economically uninsurable;
o algorithmic bias, which, simplifying, means discrimination against groups of people enacted
by algorithms;
o moral hazard, i.e. the human tendency to compensate increased security with increased
risk taking, possibly voiding the benefits of active risk management;
o the usual, non-insurance-specific risks of unethical manipulations that appear whenever
marketing is made more effective;
- finally, there can also be downsides for insurers, such as:
o insurers as a whole category could lose entire businesses, for example car insurance could
be taken over by car manufacturers;
o a single insurer could slide into oblivion by missing some train of innovation;
(Boobier, 2016).
In many fewer words, the applications of data science in insurance have deep transformative potential
both for the insurance sector and for society at large.
Boobier (2016) is the most important source of this research, because it provides the overall conceptual
framework to think about all this ongoing transformation, as well as the direct basis or the inspiration
for segmenting the phenomenon in various dimensions.
Actuaries vs. Data Scientists
Data scientists are not the only data professionals typically working in modern insurance.
Boobier (2016) notes that the insurance industry embraced data much earlier than most other
industries, in the form of statistics and actuarial science, then in recent times it has lagged other industry
sectors in adopting data science. It would not be implausible if the former fact had caused the latter.
The statistical concepts needed to scientifically measure and mitigate risks have their origins in the 17th
century studies of probability and annuities (Heywood, 1985). What actuaries do today for insurers is
to manage risk, assets and liabilities, as well as to determine insurance premiums (BeAnActuary, 2017)
and a very important toolset for them are generalized linear models (Haberman & Renshaw, 1996).
Whereas data science is vaguely defined and universities have started offering data science curricula
only in recent years, in almost all countries the processes to become actuaries share a rigorous
schooling or examination structure and take many years to complete (Feldblum, 2001). As recently as
2010, actuary was ranked to be the best job based on the criteria of environment, income, employment
outlook, physical demands and stress (Needleman, 2010).
Unsurprisingly, Miller & Hughes (2017) classifies the occupation of actuaries within the stratification of
“Data Science and Analytics” jobs; surprisingly, it includes them in the job category of “Functional
Analysts”, which is 4 levels of analytical rigor below the job category including data scientists.
All things considered, actuaries appear to be a different professional role compared to data scientists,
with sharp borders, though of course career transitions are possible. After all, recalling Conway’s Venn
diagram of data science, actuaries share with insurance data scientists at least two of the three
ingredients of data science, namely: math and statistical knowledge, plus substantive knowledge in the
insurance domain.

14
Research Methods
This is an exploratory study on the data science landscape in the insurance industry, on which not much
was known beforehand. There were no hypotheses to test, the goal was just to learn as much as
possible about the study object.
Data on the data science landscape in insurance was collected in two independent ways: on the one
hand, executing a traditional survey tailored to data scientists employed by insurers; on the other hand,
collecting a corpus of job advertisements for data scientists in insurance.
The study was largely exploratory also on a meta-level, as the exact methods to be used were not fully
defined in advance, but adjusted on the go as more information on their effectiveness was gained.
To design the survey, interviews and discussions with several subject matter experts – thereby including
my thesis supervisors – were used for discovering the main descriptive variables and their domains of
definition, so as to learn what should be asked.
The survey distribution strategy was fine-tuned by trial and error, except for the pre-defined feature of
using the promise to publish all anonymized survey data as an incentive to participation.
The survey data was then explored with several statistical and data science methods.
For the job ad corpus, it was known from research literature - such as Szabó (2011) and Wowczko
(2015) – that, in general, it is possible to extract meaningful information from job ad corpora with text
analysis methods. Unfortunately, the failure to find large and legally collectible sources of job ads forced
to adopt simple text analysis methods.
There were no strong a priori expectations for how exactly the results of the two research methods
should relate to each other, except that they should be somehow comparable, due to their underlying
connection. Again, the research approach was exploratory on a meta-level, desiring to check out the
potentialities of juxtaposing the two methods. In the end, survey and corpus were compared with simple
text analysis methods and qualitative observations.
Since one image is worth more than 1’000 words, the whole exploration process is schematized below.
Figure 9: Exploratory research roadmap

15
Survey
Preliminary research
Various forms of preliminary research were employed to explore the research topic with the goal to
improve the design of the planned survey targeted at data scientists. These included: brainstorming
with my thesis supervisors, both of whom are subject matter experts on data science in insurance;
interviewing and discussing with several other subject matter experts, including four insurance data
scientists, a data science manager and a data science business analyst in insurance, three consultants
specialized in data science for insurance, and an ETH researcher with a background in data science.
Thus, a total of twelve people kindly gave me input that I considered and factored into the design of the
survey.
Most of these people also accepted to act as test respondents in a pilot survey. Moreover, two more
data scientists and one more data science manager helpfully gave me input for how to analyze the
survey data while the survey was running, in addition to my thesis supervisors.
Therefore, this research benefitted to various degrees from the advice of a diverse group of fifteen
subject matter experts, employed with different roles at six different private companies active in the
insurance sector and at one university, residing in four different countries and representing different
viewpoints. Crediting each of them for their exact contribution would be too complex and it is not even
allowable because most conversations were held under my promise of confidentiality, but I believe their
contributions were valuable and I am thankful for them.
The subject matter experts were sourced through the professional networks of my thesis supervisors,
as well as through my personal networking at data science-related meetups in Zurich.
The most relevant information deriving from preliminary interviews was that the returns on investment
of a successful data science project in insurance can be very high.

16
Characteristics of the target population
Identification and size estimation
The total size of the target population, i.e. data scientists working in insurance in any world country,
was estimated with a two-step method.
The first step, designed to find data scientists using any job title, was a bottom-up search for LinkedIn
profiles worldwide registered with industry = “insurance” and containing appropriate keywords in the job
titles. The keywords were collected partly by asking subject matter experts and partly iteratively from
previously identified keywords, i.e. by looking at what other words were contained in job titles of already
found profiles, as well as by trying with linguistic variations. For each set of profiles matching a keyword,
10 to 30 profiles were randomly sampled in order to estimate the probability of profiles within that set
being data scientists. Keywords were searched only within the job title, and not in the full profiles,
because that would not have been discriminating enough.
The results are shown in the table below.
Table 10: Estimated potential survey targets on LinkedIn, last updated as of 22-Aug-17
Job profile keywords,
by decreasing
specificity
Nr of profiles Estimated
probability
of being
on target
Estimated Nr
of added
target
respondents
containing
keywords
added by
keywords*
"data scientist" 874 874 99% 865
"data science" 305 294 99% 291
“machine learning” 12 9 99% 9
"predictive modeler" OR
"predictive modeller"
45 41 90% 37
"predictive modeling" OR
"predictive modelling"
66 63 90% 57
"research scientist" 83 81 60% 49
"predictive analyst" 13 11 50% 6
"predictive analytics” 79 68 50% 34
"data engineer" OR
"data developer"
124 114 30% 34
"data consultant" 155 150 30% 45
Profiles above
this threshold
were then indeed
targeted
"data analytics" 636 633 30% 190
“big data” 181 133 30% 40
"data analysis" 188 173 10% 17
"data analyst" 3’549 3’293 10% 329
"analytics" 4’699 4’217 5% 211
"data" 14’993 8’956 1% 90
"data consulting" 1 0 - 0
“text analytics" 1 0 - 0
"text mining" 0 0 - 0
"predictive models" 0 0 - 0
TOTALS - 19’110 - 2'304
* i.e. eliminating profiles already found with any of the previous keywords
The above estimate suffers from various limitations.

17
First, some other job titles of data scientists working in insurance may have been missed. A
mitigating factor is that this research has chosen a restrictive definition of data science, i.e. just the
top of the data science and analytics job pyramid; these people, being an international elite, are
very likely to mostly have only a few well recognizable job titles. Despite this, many possible job
titles have been considered, assessing their association with data science in a rather lax way, thus
potentially introducing even false positives into the bottom-up estimate.
Secondly, there are country-specific cultural and social factors in play. On the one hand, LinkedIn
does not have the same penetration rate all over the world. In some countries, like Russia, it is
even forbidden (WEF, 2016). In other countries, competing professional social networks have
considerable market shares: for example, Xing is popular in German-speaking countries. However,
LinkedIn is by far the most dominant social network worldwide, especially so among advanced
economies (Alianzo, 2014) where insurance data scientists can be expected to cluster.
Another cultural factor is that, in some countries, insurance data scientists may be present on
LinkedIn with some of the identified job titles, but in local languages other than English; however,
this second limitation is mitigated by English being the language of science and of mobile elites.
Thirdly, even in countries where LinkedIn is the de facto standard professional social network,
some data scientists will decide not to be registered on it, due lack of interest in alternative job
opportunities, dislike of social networks in general, or any other personal reason. However, this
study will assume that data scientists are, by and large, a highly digital and career-mobile
demographic.
In conclusion, I estimate the shares of profiles missed due to these issues as follows.
Table 11: Estimates of missed shares
Source of inaccuracy Missed
share
Estimate Reasoning
job titles
missed in search
𝑀𝑖𝑠𝑠𝑒𝑑𝑆ℎ𝑎𝑟𝑒* 0% - data scientists are a well-recognizable
international elite mostly identified by
well-recognizable job titles in English
- many job titles were considered
- hopefully, missed profiles and false positives
will compensate each other
country-related
cultural factors
𝑀𝑖𝑠𝑠𝑒𝑑𝑆ℎ𝑎𝑟𝑒+ 30% - affected countries are mostly developing
economies, while insurance data science is
practiced mostly in advanced economies
- English is the international language of
science and of mobile elites
data scientists
voluntarily missing
from LinkedIn
𝑀𝑖𝑠𝑠𝑒𝑑𝑆ℎ𝑎𝑟𝑒, 10% - data scientists are highly digital
- data scientists are highly career-mobile
Plugging those numbers into the equation
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝐹𝑜𝑢𝑛𝑑𝑆ℎ𝑎𝑟𝑒 = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 1 − 𝑀𝑖𝑠𝑠𝑒𝑑𝑆ℎ𝑎𝑟𝑒:
,
:;*
yields a FoundShare of 63% and a worldwide estimate of 3’700 data scientists working in insurance,
based on data from August 2017.
However, it must be noted that surely not all of them work for insurers. A person can deal professionally
with insurance also while being employed by other types of companies, such as:
- Insurance brokers;
- Providers of software, consulting and services for insurers;
- Non-insurance companies that offer some form of insurance, such as warranties, as a side business
or complementary service;
- Large insurance buyers, who need the advice of buy-side experts;
- Insurance comparison websites;
- InsurTech startups potentially having some business model which does not fit any of the above
categories.

18
Based on viewing hundreds of LinkedIn profiles included in the earlier counts, I estimate that about 30%
of insurance data scientists work for companies other than insurers. Different company types would
require different sets of survey questions; the survey in this study concentrated on insurers, who employ
about 70% of insurance data scientists.
Finally, LinkedIn users have the option to make their profile “private”, i.e. neither fully visible nor
contactable for people who are not close enough to them in social network topology. Based on trying
to view many of these profiles, I estimated that about 10% of insurance data scientists present on
LinkedIn were unreachable from my own position in the network topology at the time when I launched
the survey. Since I chose LinkedIn as the main distribution channel for the survey, this fact was a further
restriction on the number of insurance data scientists potentially reachable by the survey. Table 12
summarizes the situation.
Table 12: Summary of population estimates
Population estimates
as of 22-Aug-17
a. Overall population
b. Subset of a. present
on LinkedIn
= 63% of a.
c. Subset of b.
theoretically
reachable on
LinkedIn
= 90% of b.
A) Data scientists
working in insurance
industry ~3'700 ~2'300 ~2'100
B) Survey core target =
Subset of A) working for
insurers
= 70% of A)
~2'600 ~1'600 ~1'500
These worldwide totals may seem small, but they are not implausible if one remembers 1) that a very
restrictive interpretation of who is a data scientist has been chosen for this research, linking the role to
high analytical rigor; 2) that insurance employees are only about 2% of all employees in the USA, taken
as a representative of advanced economies; 3) that people bearing the phrase “data scientist” in their
job title on LinkedIn were around 50’000 worldwide on 11-Oct-17. In fact, if data scientists are assumed
to work in insurance at the same rate as all employees, then a worldwide total of 3'700 insurance data
scientists implies the existence of 185'000 data scientists worldwide by any name, i.e. almost 4 times
those that are found on LinkedIn by querying for the job title “data scientist”, which sounds to be
reasonable and most certainly is in the correct order of magnitude.
As a final observation on estimating these totals, it is necessary to note that, if the worldwide number
of data scientists really grew by +75% yearly in the 3.5 years till October 2017 and if it is still growing
so explosively, then any estimated total would age fast, as a +75% yearly growth rate corresponds to
+4.8% with monthly compounding.

19
Distribution of the target population
The below table was derived on 22-Aug-2017 from LinkedIn by searching all employees having the
keywords “data science” or “data scientist” in their current job titles and attributed by LinkedIn to the
insurance industry, then manually eliminating from the list all companies that are not actually insurers.
Based on the earlier classification of keyword suitability, using only those two search phrases is
equivalent to selecting only the employees that have an extreme likelihood of really being data scientists
according to this study’s restrictive definition.
Therefore, this table approximates the “state of the race” among global insurers or insurance groups
for achieving data science capabilities, without considering outsourcing. Worryingly for small to
medium-sized insurers, there emerges an extreme concentration of these capabilities within the
industry, with just the top 14 companies in this list accounting already for more than half of all “core”
data science employees.
Table 13: distribution of data scientists by insurer
Rank Insurer or insurance group Nr. of entities
Data
scientists
found Cumulative %
1 AXA 1 66 8.85%
2 Allstate 1 53 15.95%
3 Liberty Mutual Insurance 1 38 21.05%
4 Aetna 1 37 26.01%
5 The Hartford 1 32 30.29%
6 Zurich 1 27 33.91%
7 AIG 1 23 37.00%
8 Allianz 1 20 39.68%
9 Generali 1 17 41.96%
10 Humana 1 17 44.24%
11 Blue Cross and Blue Shield 1 14 46.11%
12 MetLife 1 14 47.99%
13 Aviva 1 12 49.60%
14 Nationwide Insurance 1 11 51.07%
15 insurers with 10 DSs 0 0 51.07%
- TOTAL 220 746 100.00%

20
Survey design
Design process
The survey was implemented in the web-based survey software Qualtrics.
The process of survey design started with brainstorming sessions between my thesis supervisors and
me, where we identified numerous dimensions and topics that data scientists could be asked about,
without considering priorities yet. The question topics included both topics at the individual scope, i.e.
referring to a single data scientist, and questions at the company scope, i.e. referring to the employer.
To articulate several questions, it was necessary to produce approximate taxonomies, for example of
insurance products, of data science methods, of data types and formats, etc.
This first design phase resulted in a first draft of survey that was judged to be considerably longer and
more complex than what an average respondent might accept.
The first draft was pared down into a shorter and simpler pilot survey by cutting or simplifying those
questions that appeared too prone to identify individual respondents; or too subjective; or too complex;
or just of too low priority. I aimed for a survey that could be completed within ten minutes and which
would not impose too high a cognitive load on respondents, even accounting for the fact that the
targeted demographics is highly educated.
Based on the precious feedback given by test respondents, several further improvements were
implemented: the survey overall was further simplified and made more homogenous; questions were
improved in their structure, wording and in the quality of the taxonomies used; user-friendliness was
upgraded. This phase resulted in the final survey.
Design principles
Throughout the design process, basic survey design principles were carefully followed, such as
avoiding the pitfalls of leading, unacceptable and double-barreled questions. The wording of questions
and instructions was chosen to be simple, precise and neutral. Different type of questions were chosen
for different subtopics based on best fit and alternated so as to provide some variety, while taking care
not to overburden respondents with too many different measurement scales or with slightly different
numbers of measurement levels; thus, the same or analogous measurement scales were reused for
multiple questions.
In order to collect as much information as possible in an efficient and user-friendly way, i.e. limiting the
quantity of instructional phrases and the overall cognitive load involved in interpreting the survey,
questions were phrased so that they could be grouped into just a few homogeneous grids.
Wherever applicable and practical, the survey asked not only about the present, but also about the
expected or planned future situation of the respondents and of their companies, with the goal to enable
the detection of current trends. Questions about the future were phrased in terms of plans for the next
twelve months, using that somewhat arbitrary time horizon to distinguish concrete plans from vague
aspirations.
To get rich information about use cases of data science in insurance, it was desirable to learn how
several dimensions combine with each other, i.e. for example which data science methods are used on
which data types in which data formats for which business functions, in the present and in the planned
future. To reduce complexity, it was decided to ask only one of such combinatory questions, namely
which data types are used for which business functions in currently implemented use cases.
Questions were arranged in a logical flow between a brief introduction presenting the research and a
short thanks message at the end, as customary and necessary.
No question was made mandatory to respond, deferring to popular survey service Survey Monkey’s
warning (SurveyMonkey, 2016) that mandatory questions increase survey abandonment and incorrect
responses. Furthermore, along the same line of thinking, every factual question included a “don’t know”
option, thus making sure that respondents could leave it unanswered by either skipping it or answering
“don’t know”, and that they could also void an answer given by mistake; finally, every evaluative
question using a bipolar scale deliberately included a neutral measurement level, to allow the explicit
expression of a neutral opinion.

21
Survey distribution
Due to the limited size of the target population of the survey, it was not possible to reach a good sample
size just by contacting more target respondents, and it was fundamental to achieve a high response
rate instead. The situation was further complicated by the lack of a database of contacts for those target
respondents. Therefore, an appropriate design of the distribution and promotion campaign for the
survey was a key step of this research; moreover, tradeoffs had to be accepted in order to increase the
response rate.
First of all, it was decided to distribute the survey through an anonymous link, so that engaged
respondents could actively redistribute it, at the cost of losing control on the selection of participants. In
fact, the survey promotion was designed so as to engage survey participants, in the hope to make it go
viral within the global community of insurance data scientists.
The Qualtrics survey was made anonymous, collecting no information from respondents’ browsers
except for the IP geolocation-inferred country of origin of each response. This choice was made
specifically so that respondents could be informed that strong measures were being taken to protect
their anonymity, so as to remove one possible reason of refusing to participate to the survey.
During the four weeks between 10-Aug-17 and 07-Sep-17, I progressively contacted all the previously
identified target respondents over LinkedIn, asking them at the same time to add me as a contact and
to take part in the survey. I made no manual effort to double-check LinkedIn’s classification of the target
respondents within the insurance industry based on their employers; I made this choice for practicality’s
sake – i.e. for time management reason, but also out of trust in the quality of LinkedIn’s classification
algorithms and, finally, in recognition that non-insurance companies may sometimes offer insurance
products as a side business or complementary product, or they may be buyers of insurance and thus
need buy-side experts.
In order to be able to access as many LinkedIn profiles as needed, I had to upgrade my LinkedIn
premium plan for the whole period of survey distribution, at a cost of about 70 CHF.
Survey promotion deliberately made use of all the six key principles of persuasion identified by Robert
Cialdini’s classic work on the psychology of persuasion: reciprocity, commitment and consistency,
social proof, authority, liking, scarcity (Cialdini, 2001).
So, it was decided that all survey results, including the anonymized raw data, would be shared publicly
on a web page after this study, unconditionally on survey participation and in line with the spirit of
research transparency as well as of GitHub and of the open source movement – i.e. in line with ideals
presumably popular in the community of data scientists. This was suitable to activate the persuasion
principles of liking and reciprocation. Liking was amplified by always addressing target respondents
respectfully and politely (as demanded by good manners, too) and using German, French or Italian
whenever one of these languages could be presumed to be the target respondent’s mother tongue
(while everybody else was addressed in English and the survey itself was in English for everybody).
The survey introduction mentioned the endorsement of three reputable testimonials, namely an
international book author and two heads of data science teams, plus my affiliation with highly ranked
university ETH Zurich, so as to signal survey quality through the approval by credible authorities.
All target respondents expressing appreciation for the survey were personally and politely asked to
commit to taking more actions to spread the survey further among suitable respondents, and one
suggested action was to “like” or share the survey link on the social network LinkedIn. This way, both
persuasion principles of social proof and of commitment and consistency were put in service of the
survey’s promotion. Social proof further benefitted survey promotion thanks to the fact that every data
scientist accepting to add me to their professional network contributed to make me closer, in social
network topology, to the remaining target respondents (by reducing our degree of separation and/or
increasing the number of shared contacts), thus contributing to convince the remaining target
respondents.
Finally, the survey was initially announced to close on 31-Aug-17, then extended till 07-Sep-17, and in
both cases, during the preceding seven days my communication emphasized the scarcity of the
remaining time. This way, none of Cialdini’s persuasion principles remained unused.
During the last 14 days, the survey promotion was further reinforced with an ad campaign on LinkedIn
in favor of the survey, targeted at worldwide insurance employees whose job descriptions contain “data
science” or “data scientist”, i.e. the very core of the target population, counting few more than 1’000
people. For a small cost of just 76 USD, this campaign generated over 33’000 impressions, i.e.
potentially enough to reach each ad target 30 times, so the message was strengthened by a robust

22
dose of repetition. LinkedIn further reported that those impressions resulted in 68 clicks on the survey
link, i.e. 18% of all 377 survey accesses registered by Qualtrics, although unfortunately no information
is available on the survey completion rate resulting from these clicks, due to the impossibility to integrate
LinkedIn’s ad tracking features with the anonymous survey run in Qualtrics. In fact, LinkedIn must be
relied upon for all information about the efficacy of the ad campaign, despite their obvious conflict of
interest in reporting such statistics.
All survey promotion was carefully worded to motivate the target respondents’ participation while at the
same time evoking their freedom to not participate, thus avoiding the triggering of reactance, following
the advice in the research of Guéguen, Joule, Halimi‐Falkowicz, Pascual, Fischer‐Lokou, & Dufourcq‐
Brana (2013) on how to obtain higher compliance.
In fact, I went one step further in this direction and created two separate versions of the final thanks
message, randomly chosen with equal probability, and worded one of them in such a way as to try to
trigger reactance against my stated assumption that respondents would be inclined not to help,
following the indications of Guéguen (2016). However, due to lack of time and of relevance to the
research topic, I did not try to monitor whether either version of the thanks message was leading to
more referrals, so this last step remained only a diversification in the messaging.

23
Survey analysis
Standards for survey analysis
Out of the 377 times that the survey link was clicked, the survey was started only 300 times and then it
also suffered from a high abandonment rate, with only 181 respondents reaching the end and with a
minimum of only 174 responses for the most complex question, Q07. This is not totally negative,
because it might have been caused, at least in part, by off-target or unmotivated respondents being
driven away by the complexity and target-specificity of the survey questions.
The full results, as reported from Qualtrics and without any statistical treatment, are reported in
Appendix: Survey Report from Qualtrics.
In this section, an 𝛼 level of 0.05 will be used for all analyses and confidence intervals will be indicated
whenever possible. The lower and upper bound for confidence intervals will be computed according to
standard statistical methods; in particular:
Table 14: Summary of confidence interval calculations in frequently occurring cases
For means For frequencies For t-tests
𝑥 ± 𝑧*@∝/+
𝑠
𝑁 𝑝 ± 𝑧*@∝/+
𝑝 1 − 𝑝
𝑁
Directly from the t-tests
The number of cases N will be estimated for each sub-question in the most conservative possible way,
i.e. no missing answer will be considered as a negative answer even where it is plausible that
respondents may have sought to minimize the number of clicks needed to complete a question. This
choice aims at avoiding an underestimation of confidence interval amplitude.
Considering the earlier estimated small size of 2’600 for the target population, the estimated amplitude
of confidence intervals could be driven down modestly by applying the finite population correction factor:
N − n
N − 1
where N is the population size and n is the sample size. This would make confidence intervals from 3%
to 6% narrower, depending on the question. However, I will conservatively abstain from doing so, in
consideration of the various limitations affecting this survey, which are listed in a later dedicated section.
For each statistic being examined, not only the statistical significance, but also the effect size will be
considered, to make sure that only results statistically significant and of meaningful size are reported.
Data visualization will be used to make this analysis easier and results more interpretable in general.
All “other” answer options provided by the respondents as free text will be treated separately and their
frequency will not be compared with those of the “standard” answer options, because too few
respondents provided such input to allow for meaningful comparisons with the other non-free-text
options.
Before comparing different variables explicitly or implicitly through mathematical methods or graphically,
care will be taken that they are appropriately expressed in comparable scales. To make this easier,
most variables have been rescaled beforehand to a unitary range, as is appropriate for most variables
involved in this survey.

24
Response counts
The following table summarizes answer statistics by question, including the number of free-text answers
given by respondents where that possibility was offered.
The segmentations underlying many sub-questions and answer options seem to have performed rather
well, as respondents did not feel the need to input additional categories too often. The few partial
exceptions will be briefly discussed question by question.
Table 15: Response counts
Question Topic
Nr of
answers
to any
sub-
question
% on
answers
to Q01
"Other" answers
given in form of free
text
N %
Q01
Insurance product types
offered by the company
300 100.00% 44 15%
Q02
Characteristics of the
organization of data science
244 81.33% - -
Q03
Externalizations of
data science capabilities
239 79.67% - -
Q04
Data science project
success factors
248 82.67% 30 12%
Q05
Business functions
using data science
206 68.67% 17 8%
Q06
Data types
used by data science
209 69.67% 10 5%
Q07
Combinations of business
functions and data types
174 58.00% - -
Q08
Data formats
used by data science
181 60.33% 2 1%
Q09
Data science
method types used
181 60.33% 9 5%
Q10
Career background
and aspirations
178 59.33% - -
Q11 Educational level 182 60.67% 6 3%
Q12 Educational field 182 60.67% 29 16%
Q13
Time consuming job activities
and duties
182 60.67% 10 5%
Q14
Opinions on the future of
data science in insurance
181 60.33% 30 17%

25
Countries of origin of responses
During the survey, the survey software was set to not collect any potentially identifying data from
respondents’ browsers, except for the country of origin of responses. This information was monitored
to check for a minimum of geographical balance and was later separated from the main dataset, as part
of the measures to protect the respondents’ privacy, leaving only the country group information in the
main dataset.
The figures below visualize the 40 different countries that the responses came from and how they were
aggregated into groups of countries.
Figure 16: Map of response countries, created with mapchart.net Ó
Figure 17: Countries and country groups
The country of origin of a response is not necessarily the same as the respondent works in, especially
considering that the survey was distributed in August; however, in many cases it will be it. I have reason
to believe that some responses arrived from China, too, despite Qualtrics being blocked in China (WEF,
2016) – which forced motivated respondents to use a VPN.
With later analyses in view, I grouped together European countries, on the one hand, and the USA with
Canada, on the other hand, based on an expected modicum of cultural and economic homogeneity.
The remaining 19 countries, from which 15% of responses were recorded, are unfortunately very
heterogeneous.

26
Q01: Insurance product types offered
Question Q01 simply aimed at mapping which types of insurance products are offered more frequently
and in which combinations among insurance companies using data science.
The following table indicates all frequencies with their 95% confidence interval. Categories are in order
by decreasing frequency.
Figure 18: Q01 frequencies
Due to the great number, heterogeneity and complexity of existing insurance products, the proposed
taxonomy of types is not totally rigorous; however, it seems to be comprehensive enough, because no
obvious cluster appears in the respondents’ 44 free-text submissions.
It is important to stress that this data is not from a sample of insurers, but from a sample of data
scientists working in insurance. Therefore, these frequency of product types are implicitly weighted by
the tendency of each product type to generate the employment of data scientists. So, for example,
finding commercial insurance at the top of the list is probably due to this product generating a
disproportionate high amount of work for insurance data scientists, as it emerged in preliminary
interviews with subject matter experts (e.g. for classifying business types and estimating risks,
classifying complex or even bespoke contracts by conditions, etc.).

27
Q02: Characteristics of the organization of data science activities
Question Q02 aimed at discovering the main characteristics of how insurance companies organize data
science activities.
Each sub-question of question Q02 asked respondents to rate one organizational dimension on a
bipolar scale having 5 levels, which were coded as −0.5, −0.25, 0, +0.25, +0.5 from leftmost to
rightmost, with zero being neutrality. Therefore, for each sub-question it is possible to calculate the
mean and then run a one-sample t-test to examine the null hypothesis that the mean is zero against
the alternative hypothesis that that mean is different than zero. The following table shows all the means
and the respective t-tests, with 95% confidence intervals, ordered by decreasing statistical significance;
means that are non-zero at the 95% confidence level are underlined.
Figure 19: Q02 means
Four means are non-zero with statistical significance at the 95% confidence level, but three of them are
very close to zero anyway. Therefore, these results can be summarized as follows:
1. data scientists working in an insurance company are more likely to have been hired from outside,
rather than to be previous employees who have been retrained as data scientists;
2. on all 5 other dimensions, the average insurance company lies about in the middle of the scale.
A look at the answer distributions in Appendix: Survey Report from Qualtrics further shows that the
dimension of centralization VS. decentralization of data scientists is peculiar compared to all others,
because insurance companies are polarized towards the two extremes in this dimension; which is
unsurprising, as striking a compromise in the middle seems hard and counterintuitive. For convenience’
sake, the distribution of centralization vs. decentralization is also reported below here.
Figure 20: distribution of answers (counts) on centralization VS. decentralization of data science activities

28
Q03: Externalizations of data science capabilities
Question Q03 investigated whether the respondents’ employers have already implemented or plan to
implement various types of externalizations of data science capabilities in the next 12 months.
The following table indicates all frequencies with their 95% confidence intervals. Categories are in order
by decreasing frequency of the case “implemented already”.
Here, the notable facts are that collaborating with consultants on data science capabilities is very
widespread among respondents’ employers and that externalizations of data science capabilities are
rather widespread overall, with even the least popular type, i.e. collaboration with other insurers, being
reported by more than 30% of respondents.
It must be noted that all cases reported here are not full, but partial externalizations of data science
capabilities, if full externalization of data science capabilities is defined as an insurer getting its data
science capabilities from external providers without employing even one single data scientist liaising
with those providers. If such cases of full externalization exist, they will not emerge in a survey targeted
to data scientists employed by insurers, like this. However, based on the complexity of data science
and on the need to use a company’s internal data, it would be surprising if many cases of full
externalizations existed.

29
Q04: Success factors for data science projects
Q04 aimed at discovering the main success factors of data science projects in insurance.
Each sub-question asked respondents to rate the importance of a possible factor, including a free-text
“other” factor that respondents could input themselves, on a 101-level scale from 0 to 100 which was
then rescaled to a range 0, 1 . To avoid an excessive cognitive load on the respondents, the factor
ratings were not required to sum up to 100 or any other constant sum, thus accepting the risk that some
respondents might not really discriminate between the factors.
The following table shows the means of all factor ratings, in decreasing order, with their 95% confidence
intervals.
The average respondent rated all factors quite highly; this could be explained by all factors being quite
important, or by respondents being unwilling or unable to choose. What stands out the most is the
relatively low importance that respondents attributed to agile development and prototyping; this is also
surprising, considering industry trends for software development and many other activities. The four
success factors that are basically tied at the top are support from management, good communication,
and then two factors deeply endogenous to data scientists, i.e. their technical skills and their
understanding of insurance business issues.
It is also noteworthy that 12% of respondents deemed it necessary to manually input additional factors,
with a plurality of them relating to the quality and sophistication of the IT infrastructure; therefore, a
good IT infrastructure may be an important success factor that was not captured by the proposed
options.

30
Q05: Business functions using data science
Question Q05 investigated in which business functions the respondents’ employers have already
implemented or plan to implement data science use cases in the next 12 months.
The following table indicates all frequencies with their 95% confidence intervals. Categories are in
order by decreasing frequency of the case “implemented already”.
What is most notable here is that no business function is being left behind by the penetration of
data science; and that the penetration, already around 50% across the board, is to grow strongly
in the next 12 months, according to insurance companies’ plans.

31
Q06: Data types used by data science
Question Q06 investigated with which data types the respondents’ employers have already
implemented or plan to implement any data science use cases in the next 12 months.
The variation across data types is very high. At one extreme, large majorities of insurance companies
already use internal data on clients, claims and contact center data, and demographic data from
governmental sources. At the other extreme, only rather small minorities already use building
information data, personal fitness tracking data and mobility data other than car telematics. The data
types most likely to start being used in the next 12 months seem to be data from social media, car
telematics, news and trends and personal fitness tracking, though the many overlapping confidence
intervals do not allow a precise ranking.

32
Q07: Combinations of business functions and data types
Q07 was the most complex question in the survey, both to design and to respond. It asked respondents
to list which data types their companies use in which business functions for implemented use cases.
This comes as close as possible to asking to describe the use cases themselves, while still having
regard for the fact that they are trade secrets of the insurance companies.
The tables below indicate the frequencies of all combinations and the semi-amplitudes of their
confidence intervals respectively. Both dimensions of the matrix are ordered by decreasing average
frequency.
The only interesting fact that I could notice from this table is that the business functions consuming the
most heterogeneous sets of data types, on average, are product development and risk management
1
.
Arguably, these are crucial functions for an insurer.
However, the main value of this information potentially is that it might give insurance companies and
insurance consultants some clue on which data science use cases they may be missing.
1
both at underwriting time and afterwards

33
Q08: Data formats used by data science
Question Q06 investigated with which data formats the respondents’ employers have already
implemented or plan to implement any data science use cases in the next 12 months.
The following table indicates all frequencies with their 95% confidence intervals. Categories are in order
by decreasing frequency of the case “implemented already”.
The variation across data formats is very high. At one extreme, almost all the respondents’ employers
already use numeric and structured data, with text not far behind. At the other extreme, only very small
minorities already use video and speech, while a few more use images. The data formats most likely to
start being used in the next 12 months are clearly identifiable as speech, images and text, with only few
companies planning to start using video so soon.
This snapshot is compatible with the abilities to process text, images, speech and video having reached
different stages in their technology adoption curves, in order from the ones most advanced towards
market saturation to the ones still in the early adopters’ stage.

34
Q09: Types of data science methods used
Question Q09 investigated with which types of data science methods the respondents’ employers
have already implemented or plan to implement any data science use cases in the next 12 months.
The variation across method types is rather high. At one extreme, large majorities of the
respondents’ employers already use classification, clustering and generalized linear models.
However, even the least popular method type, i.e. graph-based models, are already used in at
least about 30% of cases.
The method types most likely to start being used in the next 12 months are, more or less, the same
as the method types least used now; in other words, laggards seem to be planning to catch up
across the board. There are probably no hard barriers to doing that, since these methods are
mostly based on public research literature.
Due to the great number, heterogeneity and complexity of data science methods, the proposed
taxonomy of types is not rigorous, but it seems to be comprehensive, as respondents’ free-text
submissions do not include any other significant type.

35
Q10: Career background and aspirations of insurance data scientists
Question Q10 asked insurance data scientists for very high-level information about their past career
background and their future career aspirations.
The following table indicates all frequencies with their 95% confidence intervals.
A plurality of about 45% of respondents, with statistical significance at the 95% confidence level,
have previous professional experience in data science for industries other than insurance. The fact
that the insurance industry has hired a significant number of data scientists from other industries
appears to confirm the assertion in Boobier (2016) that the insurance industry has been lagging
other industries in exploiting data science.
Also interesting is that the two favorite career directions (with no evidence of a statistically
significant difference between them) are data science in insurance or data science outside of
insurance: based on this, data scientists currently working in insurance do not appear committed
to the industry, although they clearly do not dislike it.

36
Q11: Educational level of insurance data scientists
Question Q11 inquired about the level of the respondents’ most advanced educational achievement.
The following table and figure show all frequencies with their 95% confidence intervals.
Figure 30: Educational level of insurance data scientists
Among educational levels, Master’s degree is the top-ranked with a margin that is statistically significant
at the 95% confidence level, followed by PhD and Bachelor’s degree at about the same level.
Interestingly, all other categories are very rare, including online degrees despite their growing
popularity, though this survey result might have been affected by question wording explicitly mentioning
a “degree”.

37
Q12: Educational field of insurance data scientists
Question Q12 inquired about the field of the respondents’ most advanced educational achievement.
The following table and figure show all frequencies with their 95% confidence intervals.
Figure 32: Educational field of insurance data scientists
In the educational fields, there is a lot of dispersion and, based on these statistics alone, it is not possible
to be confident at the 95% level in the above ranking. What is certain is that a frequent field of study is
statistics or actuarial science.
Though variety was expected, the survey question clearly did not offer enough categories to choose
from, so 29 respondents (i.e. 16%) submitted free-text answers under the category “other”. The above
statistics are based on new categories that were created to classify all these free-text answers, too.
The fact that most current data scientists have not specifically studied data science, coupled with the
explosive growth in data science jobs, suggests that there may still exist strong demand for good
educational curricula in data science.

38
Q13: Time-consuming job activities and duties
Q13 aimed at discovering which job activities and duties take up most of an insurance data scientist’s
working time.
Each sub-question asked respondents to rate the time consumption of an activity, including an “other”
activity that respondents could input as free text, on a 101-level scale from 0 to 100 which was then
rescaled to a range 0, 1 . To avoid an excessive cognitive load on the respondents, the activity ratings
were not required to sum up to 100 or any other constant sum, thus accepting the risk that some
respondents might not really discriminate between the activities.
The following table shows the means of all activity ratings, in decreasing order, with their 95%
confidence intervals.
In line with expectations based on the interviews with subject matter experts’, a data scientist’s most
time-consuming activity is collecting and pre-processing data. After that activity, the next most time-
consuming activities are, at similar levels:
- building models and analyzing data;
- communications of results and reporting to management;
- analyzing business processes and finding use cases.
While tasks for communication and coordination within the organization were expected to absorb
significant time, it is interesting that insurance data scientists do indeed spend a significant amount of
their time building models and analyzing data, i.e. practicing the defining core of their profession.

39
Q14: Opinions on the future of data science in insurance
Question Q14 aimed at discovering what insurance data scientist think about the future of their
professional field and industry.
Each sub-question of question Q14 asked respondents to rate their agreement with a statement on a
bipolar scale having 5 levels, which were coded as −0.5, −0.25, 0, +0.25, +0.5 from full disagreement
to full agreement, with zero being neutrality. Therefore, for each sub-question it is possible to calculate
the mean and then run a one-sample t-test to examine whether it is different than zero. The following
table shows all the means and the respective t-tests, with 95% confidence intervals, ordered by
decreasing statistical significance; means that are non-zero at the 95% confidence level are underlined.
Abstracting from individual positions, which of course include many contrarians, insurance data
scientists’ average outlook could be summarized as follows:
1) data scientists in insurance are considerably optimistic about the overall future of their professional
field, because:
a) they strongly agree that data science should and will become more important in the industry,
and that it should and will be used to the ultimate benefit of insurance clients; and
b) they feel less strongly about the need to face possible social and ethical problems connected
to the rise of data science in insurance, as can be seen by the lower agreement on these topics;
2) despite their considerable optimism, data scientists are only lukewarm in recommending that
insurers should consider radical changes rather than gradual steps;
3) data scientists favor the introduction of more flexible working practices by insurers;
4) data scientists have no opinion as to whether they should be certified, just like actuaries are (one
respondent submitted an interesting argument against it, as will be quoted next).

40
Qualitative analysis of freely submitted opinions
To complete question Q14 and the whole survey, respondents were offered the opportunity to submit
their own free-text opinion or forecast about the future of data science in insurance. Out of the 181
respondents who carried through to the very end of the survey, 30 (i.e. 17%) took the time to write their
own free-text submission. After analyzing these submissions carefully, as it behooved me to do, here I
will briefly showcase and comment on a select few.
A few respondents wrote interesting reflections on the possible consequences for the insurance industry
from the advent of data science. They range widely from consolidation among insurers or the rise of
startups, through the birth of radically new business models, up to changing roles for actors like large
insurance buyers or even non-insurance companies, like car manufacturers, gaining some role in the
provision of insurance. These thoughts find echoes in Boobier (2016).
Here are a few of those submissions, reported verbatim:
- “Massive consolidation is coming”
- “Data science and statistical modeling in the credit card business resulted in a major shake-up and
consolidation of that industry. Credit card pricing, rewards, and availability are now tailored to
individual risk and attributes, with rapid decision making on line. As data science expands in
insurance, pricing will similarly become faster, more finely tuned, and more competitive, resulting
in consolidation and the demise of those slow to adopt more advanced data technologies.”
- “I could envisage some of the data driver insurance start ups to start taking a share of business,
but the most successful will be those which work closely together with an established name so that
data and money is less of an issue.”
- “In the absence of robust government regulation I think you will see the further segmenting of
populations for coverage purposes and increasingly tailored products for different risk profiles. I
am unsure whether this push will come from the large established players in the market or smaller
entrants .”
- “In commercial lines insurance, I anticipate using big data collected by individual large insureds
(rather than by the insurer) to help them with their specific risk management concerns and more
accurately price their policies.”
- “Cars gives a open IPA. You jump into your, your cell phone connects to the car and you decide
what company you what to use for the next drive, week month etc ... But car manufacturer must
open a car interface”
Another interesting angle emerging from free submissions is the relationship between insurance data
scientists and actuaries. This topic was not totally absent from the survey, as the preliminary interviews
with subject matter experts had already cast light on it. The overall impression is that insurance data
scientists are clearly a different profession than actuaries, with different educational needs and different
roles within insurers; that actuarial science is static and well-defined, while data science is fluid and
evolving fast; and that, over time, data scientists and similar roles may gain in importance for insurers,
relative to actuaries.
Below are two related submissions, also reported verbatim:
- "Data science changes too rapidly for certifications like those of the SOA or CAS to keep up (opinion
of a former Actuary). A data scientist's time would be better spent keeping up with these changes.”
- “The availability of softwares that remove the need for coding and time-consuming manual
computation is rapidly increasing. I think this is going to be an undeniable threat to statisticians who
are too submerged in traditional methods. On the other hand, this could open wider door of
opportunities to professionals specializing in business intelligence who are highly intuitive and
creative.”

41
Definition of auxiliary measures
The available dataset is based on a relatively small sample, but very high in dimensionality and
complexity. To manage this complexity, I chose to perform only analyses targeted at answering
questions inspired by subject matter experts’ information. To reduce the dimensionality, I defined a few
simple auxiliary variables summarizing some key characteristics both of insurers and of insurance data
scientists. Precisely, for an insurer I considered the following synthetic characteristics:
- product breadth, measured as the number of product types offered;
- breadth of data science externalizations, measured as the number of types of externalizations of data
science capabilities already enacted;
- breadth of data science capabilities measured in four different dimensions, i.e. the numbers of
business functions, of data types, of data formats and of method types involved.
Since simpler data science capabilities can be assumed to be more widespread than more
sophisticated ones, the breadth of data science capabilities is a proxy for data science sophistication;
and since implementing each type of externalization of data science capabilities requires effort and
commitment, the breadth of data science externalizations can be considered as a proxy of an insurance
company’s propensity to invest into searching for new data science capabilities outside the firm.
For an insurance data scientist, I defined these measures:
- the level of educational degree, considering only Bachelor, Master and PhD, for which there is a
clear ranking from lowest to highest (thus excluding the 6% of cases having another type of
degree);
- the propensity to changes in career direction, defined as either moving between data science
and another function, or moving between insurance and another industry. I considered actual
past changes from other industries or roles into insurance data science and stated preferences
for potential future changes from insurance data science to other fields or roles, yielding a total
of four dimensions.
The following table lists all 11 measures defined.
Table 35: Summary of additional synthetic measures
Measured
object
Measured
characteristic
Measure or measures Measure range
an insurer’s product breadth Nr of product types offered normalized to
the range
[0, 1]
through
min-max
normalization
breadth of
data science
externalizations
Nr of types of externalizations of data science
capabilities already implemented
breadth of data
science capabilities
(multidimensional)
Nr of business functions already using data science
Nr of data types already used by data science
Nr of data formats already used by data science
Nr of already used data science method types
an insurance
data
scientist’s
level of “standard”
educational degree
Bachelor = 0, Master = 0.5, PhD = 1;
undefined otherwise (i.e. eliminating 6% of cases)
[0, 1]
by construction
propensity to
changes in career
direction
(multidimensional)
Existence of some past career background
outside data science, regardless of industry
boolean:
0 = false,
1 = trueExistence of some past career background
outside insurance, regardless of function
Existence of some future career aspiration
outside data science, regardless of industry
Existence of some future career aspiration
outside insurance, regardless of function

42
Differences between country groups
Based on the subject matter experts’ information, there were reasons to expect differences between
country groups; therefore, such differences were searched between Europe, on the one hand, and USA
and Canada on the other hand. The group “Other countries” was omitted from this analysis because it
is senseless to look for its common characteristics, due to its small sample size and to its extreme
heterogeneity.
The variables that might depend on country group, due to economic or cultural differences, were
hypothesized to be the following 17:
- for insurance companies, all 6 characteristics of the organization of data science that were asked
about in Q02, plus all 6 measures of product breadth, breadth of data science externalizations and
breadth of data science capabilities;
- for insurance data scientists, their propensities to changes in career direction, in all 4 dimensions,
and finally their educational level.
Since these variables are all ordinal and normalized to be defined over a unitary range, it is possible to
use a battery of t-tests to look for mean differences between the two groups that are both statistically
significant and of appreciable effect size. The following Table 36 visualizes all these t-tests, ordered by
decreasing statistical significance, with mean differences statistically significant at the 95% level
underlined. Differences are calculated as USA and Canada minus Europe and t-tests are shown as
executed either assuming or not assuming equal variances in the two groups, depending on the result
of a previous Levene’s test.
There are 7 characteristics for which a statistically significant difference is found; however, after the
closeness to zero of the lower or upper bound of the confidence interval is also considered, only two
differences are worth reporting, namely:
- insurance data scientists’ educational level tends to be somewhat higher in Europe;
- insurers’ product breadth tends to be slightly higher in Europe.
However, the main overall conclusion from this analysis is that data science at European insurers is not
so different than at US or Canadian insurers, at least under the dimensions examined here. Based on
interviews with subject matter experts, a significant difference would be found between Europe and the
USA in regulations and consumers’ attitudes on data protection and privacy, with both regulation and
consumers’ attitude being significantly stricter in Europe. The existence of cultural differences between
USA and European countries in consumers’ attitude to privacy concerns is confirmed for example by
the research of Bellman, Johnson, Kobrin, & Lohse (2004), while in the European Union a General Data
Protection Regulation (GDPR), aiming at strengthening data protection, is scheduled to take effect in
2018 (Council of the European Union, 2015).

43
Table 36: t-tests on differences between Europe vs. USA and Canada

44
Correlations between organizational characteristics
An interesting question is how the main characteristics of data science organization depend on
each other, if at all. I identified these characteristics as the 6 synthetic measures of breadth of
product range, of data science externalizations and data science capabilities, plus the 6
organizational dimensions asked about by question Q02.
A simple method to analyze the dependency structure between variables is calculating their Pearson’s
correlations, but since none of these variables is a proper interval scale, it is questionable whether this
is the most appropriate tool here.
Parametric statistics such as Pearson’s correlation are known to be widely robust to violations of their
assumptions: for example, Norman (2010) finds that “parametric statistics can be used with Likert data,
with small sample sizes, with unequal variances, and with non-normal distributions, with no fear of
“coming to the wrong conclusion”. These findings are consistent with empirical literature dating back
nearly 80 years.”
However, not wanting to take any chances to misinterpret the data, for each couple of variables I
computed both Pearson’s correlation and the non-parametric Spearman’s correlation, finding
remarkably similar values across the board. The agreement between the two statistics confirms that
Pearson’s correlation is sufficiently robust here and further tells us that all dependencies are linear,
since Pearson’s correlation is a measure of linear dependency, while Spearman’s captures monotonic
dependencies of any shape. Pearson’s correlations are reported here with all values color-coded by
positive or negative intensity (green or red respectively) and values statistically significant at the 95%
underlined.
Figure 36: Pearson’s correlations between main organizational characteristics

45
Many correlations are found to be statistically significant at the 95%; in fact, several have p-values
<0.01 or even <0.001, as shown in the following figure.
Figure 37: 2-tailed significances of Pearson's correlations between main organizational characteristics
However, when effect size is also considered, most of these correlations are too small to be interesting,
and there are only three main facts worth reporting:
1. all 4 measures of breadth of data science capabilities, in terms of business functions, data types,
data formats and method types used, have medium or large positive correlations with each other;
2. breadth of data science externalizations also has medium or large positive correlations with all
measures of breadth of data science capabilities;
3. no other medium or large correlations exist.
While point 1. is hardly surprising, point 2. is interesting, as it appears that insurers with the largest data
science capabilities tend to be those that invest more into acquiring data science capabilities from
outside the firm, be it from consultants, or from academia, or from other insurers, or by using data
science APIs. While correlation does not imply causation in either direction, research on innovation by
Love & Mansury (2010) found that “external linkages, particularly with customers, suppliers and
strategic alliances, significantly enhance innovation performance” (p. 1), a conclusion also strongly
supported by IBM (2006); therefore, it is plausible that the breadth of data science externalizations may
be instrumental to achieving a higher breadth of data science capabilities.
Point 3. is also partly interesting, as it suggests that there may be different ways to organize data
science activities with similar outcomes; particularly surprising is the relative lack of influence of an
insurer’s product range on its data science capabilities.

46
Clustering analysis
It is interesting to look for the existence of natural and interpretable clusters in the dataset, at least in
some sets of features. After making numerous attempts with different algorithms, settings and sets of
features, the best clustering I could find was the one illustrated below, obtained using the Euclidean
distance over only three dimensions and choosing the number of clusters with the Bayesian information
criterion, which results in 2 clusters.
The clustering is of relatively good quality, with a silhouette measure of 0.5004. The intuitive
interpretation is that cluster 2, which is four times smaller than cluster 1, is the group of data scientists
employed by the insurers having the broadest data science capabilities – as they use the most data
types in the most business functions – and the broadest externalizations of data science capabilities,
too.
Figure 38: Clustering predictor importance Figure 39: Relative distribution of features

47
Discussion of survey limitations
The validity of survey results is affected by several limitations. Here, they are listed roughly in order by
decreasing estimated severity and weighed against their mitigating factors:
1. Self-selection of participants implies that, ceteris paribus, people with certain socio-economic-
demographic characteristics (younger, lower-ranking on the job, with more free time) or with certain
psychological characteristics (more helpful, curious or sociable, etc.) were more likely to participate.
Mitigating factors are having adopted a carefully studied campaign to persuade people to
participate, which did result in a rather satisfactory number of participants in the end; and the fact
that possibly having a sample that skews towards being more helpful, being younger and having
more time to answer the survey would not be all bad, in terms of information gain.
2. Anonymous distribution of the survey link implied that the responsibility for judging people’s
belonging to the target was outsourced to the respondents themselves, and the respondents could
have made either type I or type II mistakes. Mitigating factors are that the survey target was clearly
defined in the survey introduction as “data scientists employed by insurers” and that questions were
complex and specific, such that they would have encouraged off-target respondents to drop off the
survey.
3. Since most respondents were probably invited through LinkedIn after being found through keyword
searches there, using a wrong set of keywords might have spoiled the results by missing large
numbers of target respondents or by inviting large numbers of off-target people. A mitigating factor
against possibly missed respondents is that the keywords were chosen with great effort to cover
most job titles, while a mitigating factor against possibly inviting off-target people is that questions
were complex and specific, such that they would have encouraged off-target respondents to drop
off the survey.
4. Since the survey contained questions at the scope of a data scientist’s employer, but every
insurance company employing more than one data scientist could potentially have been sampled
more than once, answers at the company scope are not statistically independent from each other,
undermining the assumptions of many common statistical methods and making it impossible to
interpret company-scope answers as representative of the average insurer. Mitigating factors are
that, in fact, it makes economic sense to study the characteristics of the average insurer weighted
by size; and that commonly used statistics are generally very robust to violations to their
assumptions (Norma, 2010).
5. Since survey questions were tailored for traditional insurers (as opposed to insurtech companies)
having acquired data science capabilities, several other types of organizations participating in the
insurance industry were explicitly or implicitly excluded from the research, as their employees would
not have found appropriate questions for them; for example, questions were not suitable for
traditional insurers possibly still lacking data science capabilities, for large insurance buyers, or for
providers servicing insurers, and probably also not very suitable to insurtech startups. Therefore,
information is missing on a large part of the insurance industry. Mitigating factors are that traditional
insurers are still a majority of the industry, and that focusing only on this group with a specific survey
allows to understand it very accurately.
6. Since the survey was distributed from 10-Aug-17 to 08-Sep-17, many target respondents may have
been missed due to their being on holiday; mitigating factors are that few people take holidays for
four full weeks, and that such randomly missed respondents, while being a net loss of information,
are very unlikely to bias the results.
7. Using LinkedIn as almost exclusive distribution channel might have spoiled the survey by excluding
people outside of LinkedIn. A mitigating factor is that LinkedIn can be assumed to have a very high
overall worldwide penetration rate among data scientists.
8. Since the survey was distributed in a totally anonymous way, it would have been possible for
careless or malicious respondents to respond multiple time. Mitigating factors are that there are no
reasons to assume carelessness of malice from respondents, and that the survey was in fact
protected against attempts at “ballot-stuffing” by humans or bots through the hurdle of a
NoCAPTCHA reCAPTCHA placed before the first question.
In conclusion, for every limitation there exist rather sound counterarguments based on mitigating
factors; therefore, the survey can be considered reasonably reliable in giving information about its
designated target population of data scientists employed by traditional (=non-insurtech) insurers.

48
Anonymization of survey data
As part of the promotion campaign for the survey, I committed to publish all results and all anonymized
raw data, too. The qualification ‘anonymized’ is necessary because there exists tension between
distributing all results in full transparency and protecting the respondents’ privacy through anonymity.
In fact, if the full raw data were published without any anonymization step, then a determined malicious
villain could conceivably piece together information about respondents across questions and combine
it with publicly available social network information till many respondents can be identified.
Though the existence of such a villain is unlikely, I consider it important to keep my promises and also
to protect those few respondents who might potentially have trouble with their employers, were they
identified.
For these reasons, I have enacted these privacy-protecting measures on the survey dataset:
- unbundling all free-text answers from the main dataset, to publish them reordered alphabetically in a
separate file;
- unbundling also the country of origin of each response, i.e. a highly identifying data point which was
collected intransparently from respondents’ browsers purely as a control variable, and leaving in the
main dataset only a CountryGroup variable identifying the three large groups “Europe”, “USA and
Canada” and “Other countries”;
- grouping together all categories containing less than 10 respondents each into an “other” category,
when dealing with the sensitive data that might lead to an easy identification of individuals, i.e. education
level & degree.
With these measures, I believe to have reasonably conciliated the need to protect privacy with the
desire to not destroy information.

49
Analysis of Job Advertisements
Choice and collection of document corpus
Job advertisements looking for data scientists in the insurance sector were chosen as the type of
documents to be collected because their content can be plausibly expected to be related to the survey
topics, thus potentially offering an opportunity for cross-checking or complementing the findings of the
survey.
Since the survey aimed at taking the current snapshot of the data science landscape in insurance, only
current or recently expired job ads were collected.
Collection was limited to websites not requiring any login in order to access the job ads, so as to remain
within the law. Where a login is needed, the act of logging in implies the acceptance of the Terms of
Service, which usually strictly prohibit scraping data, as in the ToS of LinkedIn (2017).
This constraint put LinkedIn’s treasure trove of 1’000+ job ads off limits and restricted the set of potential
source websites to Kaggle, the popular website for data science contests, and the websites of major
insurance groups; websites of smaller companies active in the insurance sector were not searched for
reasons of time management, as they could not be expected to yield a number of job ads sufficient to
reward the effort involved in scraping them. Based on the previously generated list of the insurers
employing the highest number of data scientists, the websites of all top employers were searched.
I scraped the job ads manually from those websites containing only very few of them; for the other
websites, I used the Chrome extension plug-in “Data Scraper - Easy Web Scraping 3.278.0” for scraping
together with its companion plug-in “Recipe Creator 3.277.5” for generating the needed data schemas
of web pages. This operation required me to buy Data Scraper premium with the entry-level subscription
plan “Solo” for one month, at a cost of 20 USD.
This effort yielded a total of 215 job ads. Of these, 149 originate from large insurers, while the remaining
66 originating from Kaggle belong to a more varied group of 41 companies active in insurance, including
9 providers to insurers, 5 non-insurance companies offering insurance products as side businesses and
3 insurtech startups (based on a rough manual classification). Therefore, the distribution of employers
in the corpus is only imperfectly similar to the distribution of insurance data scientists’ employers,
probably having a bias towards bigger insurers, among other things.
The following table summarizes the sources of the job ad corpus.

Data science landscape in the insurance industry

Data science landscape in the insurance industry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data science landscape in the insurance industry

Similar to Data science landscape in the insurance industry (20)

Recently uploaded

Recently uploaded (20)

Data science landscape in the insurance industry