As two very popular tech roles for 2022, the Data Scientist and Machine Learning Engineer can overlap or be entirely distinct, depending on the organization you work for. However, general differences between these positions require certain skill sets that you must be prepared for when applying for jobs.
Overlap between these two popular tech roles is sure to happen, so let’s dive deep into what skills are required for both roles and what makes them different. In general, data scientists can expect to work on the modeling side more, while machine learning engineers tend to focus on the deployment of that same model. Data scientists focus on the ins and outs of the algorithms, while machine learning engineers work to ship the model into a production environment that will interact with its users.
I will be describing these top skills by personal experience in 2021. I have seen a lot of articles communicate other skills and tools that data scientists use, but I want to describe the ones that most people I know, including myself, use daily. While there are popular skills always emerging, these three, in my experience, have all been the prominent ones worth investing in, whether that is in time or money.
Data scientists can expect to use the popular programming language Python nearly every day, while some others use R. They tend to have the same purpose, and the goal is to ingest data, explore it, process it, feature engineer, model build, and communicate results all with just Python.
Jupyter Notebook/or a popular IDE
Data scientists often use this tool because it serves as one central place to code, write text, and view various outputs like results and visualizations. Jupyter Notebook is a go-to for data scientists, and I do not think that will change any time soon. There are also some extensions that you can add in order to make your coding a little easier. Some other popular IDE’s that are more focused on coding include PyCharm and Atom.
A structured query language is essential for data scientists because data is at the foundation of a machine learning algorithm that will ultimately be a part of the final data science model. Data scientists need to utilize SQL for the first part of their data science process, like querying the first data and creating new features. Then at the end of the data science process where after the model is run and deployed, results are saved in your company database, which of course, uses SQL. There are a plethora of different SQL database/platforms like MySQL, PostgreSQL, and Microsoft SQL Server — it is usually up to the company which you will be working with specifically. However, all are very similar.
If you can master these three base skills, you will be well on your way to being a great data scientist. There are, of course, more skills you can learn as a data scientist, but it is not uncommon to learn skills on the job, as companies share different tools and require different skills. The main thing to consider is that you will need to know the following in general:
- a programming language
- an IDE/visualization platform
- a querying language
Machine learning engineers often come into play after the model has been built by the data scientist. Their main focus is to dive deeper into the code and its shipping. This process is also called deployment. For example, a machine learning engineer does not necessarily need to know how random forest works, but they need to know how to save and load a file automatically that can then be predicted within a production environment. Overall, they tend to be more software engineering-focused.
Both data scientists and machine learning engineers should know Python. However, even with the similarity that is this programming language, they will need to be more trained in Python overall. Machine learning engineers focus on more object-oriented programming (OOP) in Python, whereas data scientists tend to not be as OOP heavy — mainly because their job is to build the model and focus on the analytics and statistics involved, not necessarily all of the code. Of course, there are data scientists and machine learning engineers who are great at both, and some companies will make this a requirement, of which you will need to confirm with them so that you know if you will be a more statistics-focused data scientists or a more software engineering and machine learning-focused data science.
Most engineers use git and GitHub to version and store code repositories. This code management tool and platform is essential for machine learning engineers in order to make things like code changes and pull requests. Oftentimes, both data scientists and machine learning engineers are well-equipped in this skill. However, machine learning engineers usually focus more on git and GitHub.
This skill is perhaps where machine learning engineers and data scientists differ the most. While, yes, some data scientists know how to deploy a model, and some companies require it, if the role is machine learning engineer — you can expect the main part of your job to focus on deploying data science models. There are plenty of tools like AWS, Google Cloud, Azure, Docker, Flask, MLFlow, and Airflow, just to name a few.
I find that when the title is machine learning engineer, it really means machine learning operations engineer, which can be misleading, as you would expect a machine learning engineer to only focus on how machine learning algorithms work — so make sure the role that you will be applying to is either algorithm-focused or operations-focused (MLOps).
While some companies prefer a well-rounded scientist who is capable of both data science and machine learning (operations), a lot of companies will prefer a specialist in one area, as they will have the two roles separated out on their team. It can be a lot for one person to do everything from start to end, so having two designated people where one is focused on model building and one is focused on model deployment is a more efficient approach oftentimes.
To summarize, here are the key skills for each role. Keep in mind there are plenty more. However, these are skills that important nonetheless:
- Jupyter Notebook/IDE
- Deployment Tools