Understanding EDA with FIFA-2021 Data Set
The article discusses the importance of EDA in Machine Learning by performing it on a popular EA Sports FIFA dataset.
Exploratory data analysis, also known as EDA is a very important term when it comes to Data Science or Machine Learning. In a real-world scenario, you will find that the data you are working with will seldom be perfect.
Either there will be some missing values in your dataset, or in some cases, some data in your dataset might not make any sense at all. You cannot just import your data set and start making your Machine Learning model. Instead, you will have to first “explore” your data set to have a better understanding of it.
So, what exactly is EDA? As the name suggests, it’s simply exploring your data using different techniques to fetch some meaningful results or observations out of it. It helps us in getting better insight from our data and getting some underlying information that we would not get otherwise. So let’s see how EDA works and how it is beneficial.
For this analysis, we will be using the FIFA-2021 players’ dataset. The project notebook with the code is linked at the bottom of the article for your reference.
The first step of this analysis will be importing all the necessary libraries and the dataset itself. Here’s an overview of the dataset after importing.
Now that we have our data imported, we can start by checking the null values and the data types of the columns. Our model would under-perform massively if we have a lot of null/missing values in our data or any missing value.
As you can see from the image above, we have zero null values in our data, but this won’t be the case always. We also checked the data types of the columns, and now we are clear with the type of data each column consists of. Now, we can continue analyzing our data further.
Now let’s rank the nations based on the criteria of the number of players in FIFA-21 from each country, and then we will view the top-50 of these countries.
As we can see, most players in the game have England as their nationality, followed by Germany, Spain, and Argentina. Now go ahead and take a look at the age distribution of the players in the game. Here, we will find the number of players in each age group.
As expected, most players fall between the age of 20 to 30 years. Most teams prefer to have young players as they have better fitness as compared to some of the older players. Many players are 25–26 years old. The reason for this can be that apart from their fitness, these players also have some experience to offer to their teams. Hence, these players are of great value. Next up, let’s take a look at the youngest players in the game.
Similarly, let’s have a look at some of the oldest players according to our data.
Now that we know the youngest and oldest players present in our data, let’s take a look at the youngest and the oldest squads in the game.
Now that we are aware of the youngest and oldest squads in the game, We will now have a look at the top-10 players present in the game as per the data.
As we can see Lionel Messi is the highest-rated player in the game followed by Cristiano Ronaldo and Neymar Jr. Messi had a great season earlier so him being the highest rated player is acceptable. So far we have explored the nationalities and ages of the players as well as the squads. We also checked the top 10 players in the Game. Let’s go ahead and explore a few players and the positions that they play. We will now check the top center backs, strikers, and goalkeepers, in the game.
And just like that, we can see which players are the best for a specific position. Having this knowledge might help a person put up a good team while competing. Let’s also try and see the best players for the position of the right-wing. most of the people would have guessed Messi to be the top right-winger in the game but there’s a catch to this.
Oh, wait a minute. There is not a single famous name in here! But why? In our data, some players play in more than one position. Messi plays multiple positions and not just right-wing, hence we would have to segregate the position values in our data accordingly. Now here comes the importance of exploring your data! Had we not performed an EDA, we would never be known about the different position values and this might have affected our model if we would have tried to make a classification-based model. Moving on, let’s take a quick look at how different columns correlate with each other.
Next up, we will define a few methods in Python to perform the following tasks-
1. Fetch individual player’s information.
2. Fetch information regarding players representing a particular country.
3. find the squad of any particular club.
4. Find players with a similar overall score.
Now, let us observe some of the results that we’ve got here.
In the analysis given above, we started by checking Cristiano’s stats present in our data. We went on to check the Indian squad and Real Madrid’s club stats. One thing to be noted is that this is not the complete data but just a subset of it, as it was not possible to fit all the rows and columns in one snippet. And Finally, we compared the players that have a similar overall rating, in our case we found that to be 86. Next, let’s compare the age of the players with their overall and potential Ratings.
From the graph given above, we can see that as the age of the players increases their overall rating increases too and if you have played the game chances are that you must be aware of the trend. This is because as you grow you gain more experience and that helps in an increase in the overall rating. But, at one point in time, the player’s fitness starts to drop, and after a certain rating has been achieved, there is no room for further increase in the overall rating, and hence, it starts to drop. There are some extreme points in our graph and that is due to players like Neymar Jr. who is 28 years old and has a 90+ overall rating and Cristiano Ronaldo who is 35 years old and has an Overall Rating of 92.
Let’s carry out the same thing for the potential of a player.
As we can see, the older the player, the lesser the chance of an increase in the player’s stats. Players between the age of 17 to 22 have a greater potential depending on individual player’s stats. Here too we have some extreme ups and downs and this is because some players are in Club Academies and have the potential of being 90+ rated players if trained well. Let’s take a look at the potential distribution of players in the game.
There are very few players that have the potential to have an overall rating of 90+. Most players have potential in the range 65–80, and after that, the numbers just keep going down.
Next, we will check the teams that have the best overall player scores and we will do the same thing with the potential of the players.
From the above graph, we can see that Barcelona has the best team in the game according to our data but this might change later as changes are made with the transfer window still open.
PSG is the team that has the players with the highest potential scores and its understandable as players like Neymar, Kylian Mbappe, Mauro Icardi, and several others that have a great potential score.
To wrap things up, we will check out the distribution of players on the world maps, comparing their potential and overall scores, and taking a look at the regions they belong to.
It’s very clear, Most players in our data that have good overall Ratings are from European and South American countries.
Let’s do the same thing for the potential of the players.
To sum things out, in this article, we explored the FIFA-2021 dataset. We checked different factors as the distribution of the age of players, their potentials, overalls, etc. We also checked the individual statistics of the players and even had a look at club data. All these factors play a key role if you are trying to make a classification model based on a player’s positions, or say, a regression model to predict a player’s market value.
It’s very important to explore your data well before you take any further steps as it will help in revealing some things that you wouldn’t notice otherwise. With datasets like these, having a bit of domain knowledge is also important. Domain knowledge combined with good exploratory data analysis will yield great results. Many pro esports teams depend on this crucial data to choose their teams. This data is important for the company as well. It gives them an idea of what changes they need to make in the future to keep the game engaging and interesting.
Link for the complete notebook:- https://www.kaggle.com/aayushmishra1512/fifa-2021-detailed-eda
Link for the Data set:- https://www.kaggle.com/aayushmishra1512/fifa-2021-complete-player-data
My Github Profile:- https://github.com/AM1CODES
About the Author:
Aayush is a creative Machine Learning & Data Science Enthusiast who loves to play around with data. Skilled in Python, Frontend Web Development, Machine Learning, SQL & Data analysis, I recently ventured into the world of Deep Learning. One of his future goals is to share the knowledge that he has gained throughout his journey with data enthusiasts so that he can introduce them to this beautiful world of Data Science and Machine Learning.