This document analyzes the skills required to become a data scientist by examining over 8,000 job listings from Dice.com. It finds that the most commonly required skills are Python, SQL, R, Java, Hadoop, Spark, C/C++, Scala, NoSQL, Tableau, MATLAB, Hive, Excel, Cassandra, MapReduce, and TensorFlow. Python is popular due to its libraries for machine learning and data analytics. SQL is essential for querying databases. R and Java are useful for statistical analysis and integrating models. Hadoop, Spark, and Hive allow distributed processing of large datasets.
3. Data science - a
multidisciplinary
career
statistics
computer
science
social science
Designing
4. This job also happens to be the
fastest growing job in the United
States, according to LinkedIn.
5. It also commands lucrative median
salary of $113,000 among other
fast-growing career paths.
6. However, there is a
shortage of workers.
As per a report
by McKinsey, we
might soon see a
shortage of up to
250,000 data
scientists.
7. Hence, it would be very interesting
to look at the type of skills that
someone needs to master in order to
become a data scientist.
8. Since JobsPikr extracts job data from some
of the popular job boards, we selected the
job listings posted in March, 2018 on
Dice.com (a leading U.S.-based job portal).
9. The next step involved segregating the job
ads with job title as “Data Scientist”. Finally
we got a data set of close to 8,000 job
listings for data scientists in the US region.
10. In order to analyze the
skills required for this
role, we found out the
terms present in the
“job requirement”
section of the job ad
13. Python
Python has amassed a lot of interest
recently as a choice of language for
data scientists because of the following
factors:
• Open Source
• Rich community
• Lower learning curve
• Powerful libraries for data analytics
• Easier integration with databases
14. For example, scikit-learn is used for machine learning
algorithms, PyBrain for building Neural Networks, matplotlib for
plotting and iPython notebooks to present the analyses.
15. SQL
Structured Query Language
(SQL) is essential for data
scientists as it is the standard
language to communicate with
relational database
management systems
(RDBMS).
16. As a data scientist one has to write both simple and
complex queries to select data from tables apart from
understanding of different data formats for data
management and filtering.
17. R
R is a powerful language
developed in the early 90’s;
currently it is used widely for
data science, analysis and
statistical computing.
18. Its popularity can
be largely
attributed to the
following:
Wide range of
libraries
Strong online
community
Open source
Lower learning
curve
19. Java
Since Java is an old
programming language, many
enterprises already have
systems developed with this
language. This makes it easier
for the models in Java easier to
integrate.
20. Apart from that leading Big Data frameworks/tools like
Spark, Hive, and Hadoop are written in Java. It is also a
great choice when it comes to scalability and speed.
21. Hadoop
As a framework Hadoop has
gained massive popularity and
has become the de facto open
source software for reliable,
scalable, distributed computing
involving big data analytics.
22. SAS
This tool is a leader in the
commercial analytics space. It
has a huge set of in-built
statistical functions, good UI
(Enterprise Guide & Miner) for
any user to quickly learn and
delivers superior technical
support. However, it is
expensive and its certification
programs can also cost a lot.
23. Spark
Apache Spark is open source and it has
the ability to keep data resident in
memory, which can lead to faster
iterative machine learning workloads.
24. In addition to this, what makes it adoption stronger in
data science community is its base on Scala and in-
built machine-learning library, MLlib.
25. C/C++
Similar to Java, C/C++ is also
used write models and it is
critical for writing the
algorithmic extensions for R
and Python.
26. Scala
Any data scientist looking to
work on large data sets in a
JVM-centric stack will be using
Scala. Many of the high
performance data science
frameworks are written using
Scala owing to its amazing
concurrency support.
27. NoSQL
Unlike SQL, NoSQL offers an
architectural approach with
lesser constraints. In general, it
is easier to break down NoSQL
data stores, but more
complicated to query them for
complex results.
28. For data scientists, NoSQL can be somewhat tricky —
although the technology makes it absolutely easy to
rapidly accumulate massive data sets and rapidly
scale data stores to meet demand, it requires de-
normalization of data.
29. Tableau
VizQL (Visual Query Language)
is Tableau’s database
visualization language which
queries relational databases,
cubes, cloud databases, and
spreadsheets, and then
generates wide range of graphs
and chart.
30. MATLAB
Although MATLAB is not as
popular as R or Python in the
data science space, it still has a
lot of traction in the academia.
Also, it is a commercial app
with high cost and good
customer support.
31. Hive
This is a popular data warehouse
software in the Hadoop
ecosystem that helps data
scientists in data transformation
and analysis.
32. It provides an SQL-like interface to query data stored
in various databases and file systems that integrate
with Hadoop.
33. Excel
Microsoft Excel can be
considered as a bridge
application for very quick
filtering and data analysis using
in-built statistical methods.
However, it becomes powerful
when combined with Visual
Basic. Check out the examples
for building your own Excel-
based neural
network and Monte Carlo
simulations.
34. Cassandra
Apache Cassandra is an open source
distributed NoSQL database
management system designed to
handle large amounts of data across
many commodity servers.
35. As this database was developed for Facebook, where
millions of reads and writes happen at each given
second, its performance is far superior.
36. MapReduce
It is a programming model that
allows for massive scalability
across hundreds or thousands
of servers in a Hadoop cluster.
37. Simply going by the name, MapReduce
consists of two steps: Mapping and
Reducing the data:
Mapping sorts and filters
a data set
Reducing it allows a
certain calculation on the
resulting information
38. TensorFlow
This is the open source
framework developed by
Google Brain Team for machine
learning and deep neural
networks research.
39. Pig
It is a high level scripting
language used for operating on
large data sets inside Hadoop.
It primarily used to apply
schema and transform data.
40. JobsPikr
Clean and up-to-date job feeds directly from company websites and job
boards
www.jobspikr.com | sales@promptcloud.com