How Not to Fail at Data Science
List of Data Science Best Practices that Will Set You Apart from the Crowd
As of July 2020, more than 900,000 LinkedIn profiles have the word “Data Science” in them. Out of these 900K profiles, a majority of them are either beginners – aspiring Data Scientists working their way to bag the “sexiest job of the 20th century”, or self-learners, gradually finding their way through the immense world of Data Science, getting better in their respective niche such as Natural Language Processing, Computer Vision, etc.
Now the thing is, learning is a long (and not so easy) process and requires a lot of hard work & perseverance. Also, if you are learning by yourself, things start to get even harder.
Most self-learners in Data Science and Machine Learning generally get started in their journey by taking up online courses or via YouTube tutorials. Not saying that these online courses are bad; in fact, a lot of these courses (such as Stanford’s Machine Learning courses by Andrew Ng) are exceptionally good in helping you form a strong foundation over the basic concepts.
However, there’s a huge restriction on the amount of content that such MOOCs can offer because it is not possible to cramp up Ph.D.-level Machine Learning concepts in a 30hr course meant for beginners.
A lot of self-learner and beginners get stuck in the never-ending cycle tutorials. Instead of reading books, working on research projects, practicing and implementing their skill into real-world solutions, and diving deeper into different topics to strengthen their conceptual foundations, a lot of students continue taking up a new beginners’ tutorials once they are done with their present ones under an impression that they are getting better, while in reality, they are learning the same things again and again without even realizing it.
This article is for the self-learners and beginners in the field of data science; a way to guide them on some of the best practices in Data Science that one can follow to become better than just a “mediocre”. This is also meant to encourage you to do some research as a way to help you out of the tutorials.
So, let’s look at these best practices one by one.
Know Your Data!
Diving headfirst into making deep learning or machine learning model as soon as you are handed over the data is the equivalent of wandering into the middle of the desert, trying to find your way out. You might get out, but the odds in your favor are slim. The same goes for Data Science.
The first and foremost thing to do when you get the data is to understand what the data is about. The better you understand the kind of data you are working with (structured or unstructured), or the type of problem statement your project aims to deal with, the better are your chances of coming up with an effective solution to your problem. Not only this, but it also helps you to make a proper plan regarding how your project will be structured and what all tasks you need to perform.
So, now that we know why knowing your data is important, let us also have a look at what exactly does knowing the data involve?
- Understanding what kind of problem you are working with- Firstly, you should understand what kind of problem you are dealing with. Is it a regression problem where you will have to predict a target variable based on the various features available to you? Or a classification problem where you aim to predict the target class a data instance might belong to? In some cases, the problem can be a clustering one where you have to group the various data instances based on the similarity of their features.
Understanding the problem statement will allow you to decide upon a certain set of model architectures that you can experiment with instead of going an unplanned route, just hoping for the best.
- Data Wrangling- Also commonly known as data preprocessing, this is a key step in any Data Science project. Data wrangling involves cleaning your data (i.e., removing any unnecessary data), dealing with null values, feature engineering (i.e., creating new features from the existing set of features), scaling the data (i.e., using different normalization and standardization techniques to balance the feature weight distribution), etc.
The reason why data wrangling is so important is that the data you are working with might contain a lot of noise in it, or simpler words, several undesired values. Let’s say a certain column in our dataset has a lot of null values. In such a case, using such the column as a feature for training might give a really poor inference performance.
Let’s consider one more example. Assume that the values in one column of your feature set are of the order 1e+5 and on the other hand, in other columns, the values are of the order 1e-3. While training the model without scaling the values, the values of the first column will most probably overshadow any effect of the second column while making the target prediction. In such a case, the representation of the second feature column will be nullified due to the weight imbalance.
Thus, before you start working on modeling, preprocessing your data is a must! You might want to check out our Pandas tutorial if you want to learn more about data preprocessing.
- Exploratory Data Analysis (EDA) and Data Visualization- Let’s say you worked on a project, created a model that makes pretty accurate predictions, or derived some very useful insights from the data and now it’s time to present your results to the stakeholders or the consumers. Now, the thing is, most stakeholders aren’t Data Science experts. This is the reason they hired you.
So, your responsibility is to present your observations and project conclusions to the stakeholders in a format that they can perceive and interpret. And the best way to do that is through data visualization— insightful graphs displaying the trends in the data, summarizing all your work in a way that one can understand the output without having to understand your code.
Therefore, if you want to be a good Data Scientist, you can tell a story with the data using your data visualization skills. If you want to learn more about Data Visualization, you may check out our Seaborn tutorial.
A Good Data Scientist Knows His Math
A lot of people argue that you don’t have to be a mathematician to be a Data Scientist. Well, we can’t outright discard the argument, as it is correct, but only partially.
There are a ton of Machine Learning and Deep Learning frameworks (like Scikit Learn, Tensorflow, PyTorch) that will help you build models (or use pre-trained models to make things even easier) without understanding any math behind it.
While this might get you through the beginner phase pretty easily, once you start working on more advanced projects, you’ll find that most concepts will straight up stop making sense.
While reading research papers and working on research projects, you won’t be able to understand the math used for the custom loss or optimization function. You won’t be able to answer the mathematical questions asked during the Data Science interviews, which will most certainly result in your losing a job opportunity.
So, if you want to be good at Data Science and are considering a career in it, it’s high time you start learning the math behind it. Here’s a list of resources that you can use to work on those mathematics skills of yours-
- Linear Algebra– Khan Academy (FREE)
- Differential Calculus– Khan Academy (FREE)
- Integral Calculus– Khan Academy (FREE)
- Multivariate Calculus– Khan Academy (FREE)
- Probability and Statistics– Khan Academy (FREE)
Advanced Level Courses-
- Mathematics for Data Science– Coursera (PAID)
- Linear Algebra (by Prof. Gilbert Strang)- MIT OpenCourseWare (FREE)
- Single Variate Calculus (by Prof. David Jerison)- MIT OpenCourseWare (FREE)
- Multivariate Calculus– MIT OpenCourseWare (FREE)
- NPTEL- Machine Learning Courses (FREE)
Data Science is All About Experimenting
Data Science, and to be more precise, the modeling phase, in particular, is an iterative process. To solve a problem, you can’t just create a single model and be done with it. Data Science is all about experimenting. For a given problem type, you should try different model architectures and a combination hyperparameters, evaluate the performance of each of these combinations, and then choose the architecture or algorithm that best serves your purpose.
Now, one might argue that why to waste time experimenting with several models and algorithms when you can simply go for a complex neural network and be done with it?”. This is very wrong and considered a bad practice.
First of all, just because you are opting for a complex architecture for your model doesn’t directly ensure a good inference performance. Deep neural networks with very complex architectures suffer from several problems, namely information loss, weight decay, and overfitting, which can very severely affect the model’s performance.
Secondly, it is very important to consider a performance-to-price ratio for your model. Deploying a complex architecture on a server will require a lot more computational resources as compared to a simpler architecture, which simply means a higher cost of operation. On the other hand, if you are getting an almost equal or very slightly less performance with a simpler architecture, you should go with the latter one due to its better P2P ratio.
Other than this, sometimes a model may not have a good performance for a certain hyperparameter value set but will perform exceptionally well on another set of hyperparameters. For example, let’s say we are working on a K-means clustering model. For the k-value 4 (i.e., 4 clusters), we get a very high loss. But on the other hand, with k-value 6, we not only get a significantly lower loss but at the same time pretty accurate inference. Therefore, tweaking the hyperparameters can sometimes result in a major performance boost!
I hope this made it clear why experimenting with different model architectures and hyperparameters is important.
Now onto the last part.
Learn to Deploy Your Models
If you are a Data Scientist in the current job market, you must know how to deploy your models. A Data Scientist who knows MLOps and can deploy his model on a server so that the client can use it is instantly better than a ton of other, generic Data Scientists. Some of the highest paying Data Science jobs require a good knowledge and experience of MLOps. So, it’ll help you in landing a good job if you have a strong grasp over concepts like how to use a Django/Flask server to deploy your models, how to use REST APIs to perform inference from a remote server, and, have good experience with technologies like Docker and Kubernetes.
At the beginning of this article, it was said that this article is for beginners and self-learners. Now a lot of terms in these articles were not exactly beginner-friendly, as expected. So, I’d advise you to start doing some research on our website itself if you don’t know out of those mentioned above. These are some of the most sought-after Data Science skills and best practices for the current job scenario. So if you want to stand out from the rest of the crowd, make sure you get proficient in all these skills.