The Data Science Spectrum:
From Analyst to Machine Learning
The role of a data scientist has become narrower and more specialized as the demand for them has increased. In my last post, The Data Science Split, I talked about why I think this happened. In this post, I will walk through a few of the most common roles in the data ecosystem and cover what they do and what their skill sets are.
It useful to know where you prefer to be on the data science spectrum, as it will determine what roles you should apply for. “What are the responsibilities of this position and what the key skills to be successful in it?” is one of the first questions I ask when applying to a new position. The answer lets me map that specific role to its place in the ecosystem and helps me determine if I would be interested in the job.
You could define the spectrum of data science along multiple axises, but I find using just one works pretty well:1
Engineeriness: Roughly, how close the role is to a traditional software engineering role.
On the “low engineeriness” side of the spectrum you have roles that work almost entirely with the contents of the data and domain-specific languages for data access, processing, and plotting. As you move towards the other end you start working with “lower-level” languages and often less on the on the data itself and more on supportive tooling around it.
A particular job at a company could fall anywhere on the spectrum, but the title gives a good idea of where exactly it fits. Below are five common job titles in a rough order from least to most “engineery”. Of course, in the real world, the responsibilities of these jobs overlaps heavily with their neighbors on the spectrum.
A business analyst2 uses data to help the company understand what has happened, what is happening, and what will likely happen so they can make better decisions. Their primary deliverables are internal-facing reports, dashboards, and presentations. They are generally really adept at SQL and making data visualizations, but are less likely to use general-purpose languages like Python.
A data scientist3 is an expert at statistics and experimental design. They don’t just plot trends, they understand what causes them, and how you can influence them. They can clean a dataset, find biases, and then use it to power decisions and products. They work with more general programming languages like R or Python.
Machine learning modeler4 is a rarer title, but I include it because I feel it fills the hole between data scientist and machine learning engineer. This role focuses on building models that directly impact customers. They find customer problems, build machine learning models to solve them, and own those models all the way from first iteration through hosting it in production. They use Python, machine learning frameworks like TensorFlow, sometimes Scala and Spark, Docker, and REST APIs.
This is the role I feel most comfortable in, with its mix of software development, machine learning, and direct impact on customers.
Machine Learning Engineer
A machine learning engineer5 focuses on the platforms underlying machine learning modeling and hosting. They often build ML tooling, hosting, and pieces of ML specific infrastructure like feature stores. They focus on making sure the machine learning models can scale to meet the demands of running in production and return answers fast enough to be used. They generally work with lower-level languages than the modelers like Scala or Java. Many MLEs come from a software engineering background.
Data engineers build the infrastructure the data flows through. All the SQL databases, NoSQL, queues, streams, etc. that power the business and allow the other data roles to make use of it. They’re experts in a cloud services (where these systems are mostly hosted) and scaling systems to meet the demands of millions or billions of users while collecting and organizing their data.
If I were to add a second axis, it would probably be Researchiness to differentiate the product focused data roles covered in this post from the more academic roles present at some large companies. The biggest difference is “publishing papers” is a metric more researchy roles track. ↩
Sometimes data analyst, business intelligence analyst, or even data scientist. ↩
This role is sometimes called data scientist, sometimes machine learning engineer; often those two roles split the responcibility. These rolls are closer to Michael Hochster’s Type B Data Scientists. ↩
Sometime machine learning infrastructure engineer. ↩