Assumptions of Logistic Regression
In this article, we are going to explain these assumptions in depth. We will also discuss the techniques to check these assumptions for the given data.
Most people build logistic regression models without understanding the basic assumptions of logistic regression.
Hold on and think about this. How many times have you used logistic regression models without understanding the assumptions working behind logistic regression? If you don’t know about logistic regression, then it is a well known supervised machine learning algorithm used for classification problems. It helps in predicting a categorical dependent variable based on certain independent variables.
It’s very important to understand the assumptions of logistic regression to get the expected outcome. There are certain assumptions you have to take care of to improve the model’s performance.
Logistic Regression Algorithm
Before addressing the algorithm, let me tell you what regression is. Regression is a technique used to determine the confidence of the relationship between a dependent variable(y) and one or more independent variables (x). Logistic Regression is one of the popular and easy to implement classification algorithms. The term “Logistic” is derived from the Logit function used in this method of classification.
The target variable contains discrete values only, for specified input features(independent variables).
For example, classifying an animal like a cat or dog is a classification problem, which can be solved using logistic regression.
Now, let’s focus on the main topic of this article i.e., assumptions.
I am listing down the assumption of the logistic regression and then we will discuss each of them in detail.
- The target variable is Binary
- Sample Independence
- Multicollinearity
- Outliers
- A linear relationship between the independent variable and logit of the target variable
- Large sample size
The target variable is Binary
This is the first assumption of logistic regression. According to this assumption, the target variable takes only two categorical values.
For example – yes or no, male or female, pass or fail, spam or not spam
How to check this assumption
Count the number of unique values present in the dependent (target) variable. If it has more than two unique values, we have to perform ordinal regression then.
To make things more clear, I am taking the titanic dataset. In this dataset, we have to determine the chances of survival of a person based on certain input features.
Here, Survived is dependent or target variable.
Checking if the target variable is binary¶
From the plot, we can see that the target variable is binary.
Sample Independence
The next assumption of logistic regression is that the observations of the dataset are independent of each other. The dataset should not contain duplicate or repeated values. If the features of the dataset are correlated, it can affect the performance of our model.
How to check this assumption
This assumption is easy to check. We can sort our data points and compare each pair of consecutive data. Another way to check this assumption is by plotting a scatterplot between residuals and time, and check if there is a random pattern. The assumption is violated if there is no random pattern. An in-depth exploratory analysis to detect the deviations in the data.
Multicollinearity
The most critical assumption of logistic regression is, there should be little or no multicollinearity in the provided dataset. This condition occurs when the features or independent variables of the dataset are highly correlated to each other in a manner, that they do not contribute unique or independent information in the regression model.
If a model has correlated variables, it becomes hard to determine which variable contributes to estimating the target variable. If the level of correlation is high between variables, it leads to problems while fitting and interpreting the model.
How to check this assumption
The most popular approach to detect multicollinearity is by using the correlation matrix, which measures the correlation and degree of correlation between the independent variables in a given dataset.
Another technique is the Variance Inflation Factor (VIF). VIF also helps in determining the correlation and degree of correlation between the independent variables in a given dataset.
Outliers
Logistic regression is very sensitive to outliers. It assumes that there are no outliers or influential observations in the given dataset. We can get unexpected outcomes due to the presence of just one outlier in our data. Outliers affect the performance of our model.
How to check this assumption
The most simple way to test for extreme outliers in a given dataset is to calculate Cook’s distance for each observation.
What to do if there are outliers in your dataset –
Drop them
You can replace the outliers by mean or median.
It is easy to detect these data points if we have one or two independent variables. But, if the number of independent variables is large, we can use the anomaly detection technique.
A linear relationship between the independent variable and logit of the target variable
Logistic regression assumes that there is a linear relationship between the independent variable(s) and the logit of the target variables.
Mathematically, the logit function is represented as –
Logit(p) = log(p / (1-p))
Where p denotes the probability of success.
The logit function is also known as a log-odds function. The term p/(1-p) is known as odds. The odds implies the ratio of the probability of a positive outcome to the probability of a negative outcome.
How to check this assumption
We can plot a scatter plot against the logit of the target value. If the data points form a straight line, the linearity assumption holds. Also, we can use the Box-Tidwell test to test this assumption.
Large sample size
The next assumption of logistic regression is that the size of the dataset should be large enough to make suitable conclusions from the logistic regression model.
How to check this assumption
You should have at least 10 events with the least frequent outcome for each independent variable.
We have 5 independent variables. According to the rule, we should have at least 300 records in this dataset.
Output:
Conclusion
This is it for this article. We discussed the assumptions of logistic regression analysis and methods to check if these assumptions are met or not. If you liked the article, please upvote and share with others.
Thanks for reading!
1 Response
[…] discussed assumptions of Logistic regression and cross-entropy loss in my previous […]