What is a Data Scientist?
The term "Data Science" is a fairly new discipline and has garnered a wealth of attention in recent years as workers from many different backgrounds descend upon the field to try and take advantage of the favorable labor market and carve out a repuation in this high-demand field. The motivation for writing this post lies in defining what we mean by the field of Data Science and by the Data Scientists who practice this discipline.
We work with a number of companies who are new to the idea of using data to enhance their decisions and they often ask "what exactly is a data scientist?" Even my parents have asked what exactly it is that I do, so this post is dedicated to answering this question in a way that many different people can understand.
The objective here is to offer these two terms and their definitions so that our clients, prospects and others understand what we mean when we use them. The objective is to offer clarity in the face of a lot of ambiguity. Let's face it, there are a lot of people who use these terms loosely, interchangibly with other terms or inaccurately. Further, there are those who practice under the guise of the Data Scientist role who are aspirational at best.
Definitions
Why not cut right to the chase and define these terms:
* Data Science is "a professional discipline that uses scientific methods, processes, algorithms and computer systems to extract knowledge, understanding and insight from data to improve a business or other system."
* A Data Scientist is "a person employed to identify and understand problems, extract and manipulate raw data, conduct analysis and deliver insight to assist business stakeholders in their decision-making."
What is Data Science?
I believe that our definition of data science must have a few important characteristics so I will break down each of the components here and address why they are important for a meaningful definition. Data Science is a professional discipline. People practice this discipline and make money in the process. I think this element is critical as economics are an important factor to consider when thinking about data science efforts. Additionally, improving a business or other system is included in the tail of the definition, as well. This implies that a benefit is the rational behind undertaking such as effort. For the economics to work, this benefit should outweight the costs of implementing the solution.
Another characteristic of data science efforts are the way that practiontioners solve problems. We utilize an adaptation of the scientific method. Most of us learned about the scientific method years ago in a middle school science class somewhere. As a reminder, this approach consists of observation, experimentation and testing. It is iterative, so we are repeating it, making adjustments and retesting while monitoring outcomes. Data science is also very much a process-driven affair. This process-driven approach implies defined ways of handling different problems in similar ways. There are generally accepted procedures for executing a data science project. Additionally, we want to generate a consistent outcome in terms of quality. This means that we follow similar procedures each time we solve the problem, but do this in a way that is flexible enough to solve many different types of problems.
Finally, we use mathematical algorithms to generate outcomes that solve our clearly defined business problems. These algorithms optimize some outcome and deliver insight that enable people to make better decisions. Finally, computer systems are used to generate and execute programs that utilize data to deliver these insights. You can certainly work through these math problems by hand (I've done a lot of this, especially when learning a new method or algorithm), but it is much more efficient to have computers systems solve these optimization problems. Again, we've got the economics to consider here and doing things by hand is time-consuming. And time equals money.
What is a Data Scientist?
Similarly, we will deconstruct the definition of a data scientist phrase-by-phrase as these are important for a full understanding of what a data scientist is and does. A person employed indicates again that this individual worker is compensated for their efforts. There is a cost element to this role as there is with all roles. This role, however, is quite a bit more costly than other professional roles. High demand for these services and an undersupply in the labor market means that compensation for data scientists is higher than most other roles on average.
To make this point a little more clear, let's look at the median household income in January of 2018. That figure was $59,055. Now, the average salary for a "data scientist" per GlassDoor.com is $139,840 per annum at the time of this writing. This is 136% more than the median household income or significantly more than twice the compensation of a common American household. This is an order of magnitude more as far as salaries go. No wonder so many folks are taking online data science courses. Even if we consider something a little comparable such as a business analyst it is still quite significant. Glass Door says that the average business analyst makes $77,712 per year. That means the average data scientist makes 80% more than the average business analyst (according to Glassdoor anyway). We'll take all of those figures with a grain of salt. Regardless, the differences are obvious.
Now that we know that data scientists are well-compensated, let's take a look at what they do. For starters, they identify and understand problems. They live in the world of business where they are expected to communicate with other important business folks (we'll call them stakeholders) and use data to solve their problems. This is how they add value. They serve others and help them to solve critical business problems using data.
Furthermore, they extract and manipulate raw data. This implies that they take data from some place and do something meaningful to it. This is actually the most fundamental element of this definition in my opinion. Traditionally, database professionals would take requests from business people and write queries to extract this information from a database. These requests were put into a queue and inevitably the person that needed to write the query was either at lunch or out for the week. This meant that there was significant lag between requesting the data and doing whatever the business person needed to do with the data. Now a data scientist comes along. This is someone who studies the advanced math and statistics and has the skillset of the business analyst, as well. They can now write the query because they have SQL or other querying skills.
No waiting for the database person to get back from lunch...or back from their vacation to Florida.
They conduct analysis. In other words, they do all of this math and statistical analysis and develop and deliver insights to those business stakeholders and managers. The data scientist might tell these managers that they can expect to have approximately 230 customers at a specific store location this Thursday and that these customers will spend $120 on average. And you know what? Over 190 of these customers are likely to buy light beer, so they'd better adjust their decision-making and order more if they haven't already.
Otherwise, there is a decent amount of revenue that could be left on the table. And this adds up over the weeks and months. And this benefit also covers the cost of having a fractional data scientist work on this effort.
Last Thoughts
So as I've mentioned, there is a lot of demand for this new-ish data scientist role and there is expected to be some significant shortfall in qualified talent to effectively satisfy this demand. This means that compensation will remain high, but will likely contract to some degree. There is currently a lot of hype as a result. Everyone seems to be implementing some new data science, or artificial intelligence (AI), or machine learning solution that will revolutionize their industry. People are doing cool and impactful things.
However, not all "data scientists" are created equal. There are a lot of data scientists and aspiring data scientists who are chasing this compensation. Some of these folks are less qualified than others and many simply lack experience. There are a lot of new data science graduate or undergraduate programs as universities see the demand, there are online courses of varied quality and there is a lot of misdirected energy caught up in this hype. For instance, folks from many different backgrounds are taking advanced Deep Learning courses online and are using abstracted software packages to solve these problems, but they have little experience in the industry or with manipulating data (we sometimes call this "data munging").
I'm certain that these courses teach a lot of valuable analytical tools and procedures, but many of these courses overlook the most fundamental elements of data science: the extract and manipulate raw data part of things. As it were, this data engineering process will consume 80% of a data science project timeline. Data science is nearly all data engineering.
This means that the SQL and Python skills are vastly more important. I've seen graduate programs that graduate "data scientists" with absolutely no SQL skills. This is a problem and a potential pitfall.
Be discriminating of who you hire and understand which skills and experience will translate most to success. We'll cover how to hire a good data scientist in a future post, but for now at least you know what we mean when we say Data Science and those Data Scientists who fill this role.