Do you want to learn data science too? I’m a self-taught data scientist. My data science journey started in 2018. Those who follow me on would know that I like sharing my experience of learning data science. I write about the mistakes I made, the challenges I faced, the tools I frequently use, and so on.
In this article, I would like to share 3 suggestions to those who plan to become a data scientist or just started learning data science. These are based on my own experience and what I observe in the data science ecosystem.
Without further ado, let’s get started.
1. Be agile
More and more businesses invest in data science with an aim of converting data to value. The form of this value depends on the business and industry.
- Retailers use data science to manage their inventory more efficiently by creating accurate and robust machine learning models.
- Factories collect and analyze large amounts of sensor data for predictive maintenance.
- I have seen some restaurants implement image processing systems to detect which items are put in the trash. This allows them to better manage how much they should cook.
The list goes on. There are lots of data science applications and products used in a variety of industries.
Once you are comfortable working with a particular tool, it seems like a waste of time to learn a new one. However, the new tool is likely to provide better performance. Follow the advancements in technology and research as much as you can and don’t hesitate to try out new tools.
This, of course, does not mean that you should learn how to use everything available out there. This is not possible or feasible. As you gain more experience in the field, you will have a sense of what is promising and has a potential worth discovering. The fundamental requirement though is that you need to be ready for change.
2. Certificates matter but don’t count
The amount and variety of the resources for learning data science is immense. You can read books, watch tutorials, take online courses, and so on.
And, there is the reality of certificates. You can find a certificate on pretty much any data science topic. Some cover a broader range while some focus on a specific task such as data cleaning with Pandas.
If you are to follow a self-learning path, certificates come in handy at first. I started off with collecting a couple of them. The two important advantages of certificates are:
- They are much cheaper than traditional learning methods such as a masters degree.
- They are usually organized and well-structured so you get yourself familiar with the field quickly.
While I agree on the benefit of certificates, I suggest not to focus too much on them. Having 20 certificates will not have a significant impact on hiring managers or recruiters. I don’t think they will go through a list of 20 certificates.
Also, what you learn from certificates are limited. Most of them require watching tutorials and solving simple exercises. You can understand a topic by watching a tutorial. However, in order to actually learn it, you need hands-on experience and active involvement.
3. Do a project that imitates an entire workflow
From the outside, the job of a data scientist seems to be analyzing data to extract insights and create models. This was what I thought, at least.
Now that I’m in, my ideas of what data scientists do are very different. Extracting insights from data or creating models is, of course, an important part of it. However, in most cases, what is expected from a data scientist goes beyond that.
It heavily involves what is known to be the job of data engineers. For instance, as a data scientist, you will probably need to take part in ETL processes. Depending on the company, you may have to deal with some software engineering tasks.
What I think the most challenging part is to do machine learning in production and at scale.
Let’s say you are assigned to create a machine learning model for sales forecasting. When learning data science, we usually work in Jupyter notebooks. In real life though, your model needs to be deployed in production. It may be your responsibility or you will need to take part.
In any case, I suggest getting yourself familiar with tools used for machine learning in production. By the way, it does not have to be machine learning. Instead, it could be collecting data from a few different sources, cleaning and combining them, and performing some analysis. The common part is that it needs to be done in production.
What helps the most for these tasks is to do a project that involves the steps of typical data science workflow in production.
Here is a suggestion of a project:
- Collect data stored in cloud (e.g. S3 bucket)
- Run a script that cleans and preprocess the data
- Create a machine learning model and train with the preprocessed data
- Do predictions
- Write the predictions to the cloud
This entire process can be run on an EC2 server and orchestrated using Airflow. By successfully completing this project, you will gain hands-on experience in the following areas:
- Cloud computing
- Data cleaning and preprocessing
- Machine learning
- Orchestration of data pipelines and workflows