Assumptions of Logistic Regression

8+
In this article, we are going to explain these assumptions in depth. We will also discuss the techniques to check these assumptions for the given data.  

Most people build logistic regression models without understanding the basic assumptions of logistic regression.

Hold on and think about this. How many times have you used logistic regression models without understanding the assumptions working behind logistic regression? If you don’t know about logistic regression, then it is a well known supervised machine learning algorithm used for classification problems. It helps in predicting a categorical dependent variable based on certain independent variables. 

It’s very important to understand the assumptions of logistic regression to get the expected outcome. There are certain assumptions you have to take care of to improve the model’s performance.  

Logistic Regression Algorithm

Before addressing the algorithm, let me tell you what regression is. Regression is a technique used to determine the confidence of the relationship between a dependent variable(y) and one or more independent variables (x). Logistic Regression is one of the popular and easy to implement classification algorithms. The term “Logistic” is derived from the Logit function used in this method of classification. 

The target variable contains discrete values only, for specified input features(independent variables). 

For example, classifying an animal like a cat or dog is a classification problem, which can be solved using logistic regression. 

Now, let’s focus on the main topic of this article i.e., assumptions. 

I am listing down the assumption of the logistic regression and then we will discuss each of them in detail. 

  • The target variable is Binary
  • Sample Independence
  • Multicollinearity
  • Outliers
  • A linear relationship between the independent variable and logit of the target variable
  • Large sample size
The target variable is Binary

This is the first assumption of logistic regression. According to this assumption, the target variable takes only two categorical values. 

For example – yes or no, male or female, pass or fail, spam or not spam 

How to check this assumption

Count the number of unique values present in the dependent (target) variable. If it has more than two unique values, we have to perform ordinal regression then. 

To make things more clear, I am taking the titanic dataset. In this dataset, we have to determine the chances of survival of a person based on certain input features.  

Here, Survived is dependent or target variable. 

Checking if the target variable is binary

From the plot, we can see that the target variable is binary. 

Sample Independence

The next assumption of logistic regression is that the observations of the dataset are independent of each other. The dataset should not contain duplicate or repeated values. If the features of the dataset are correlated, it can affect the performance of our model. 

How to check this assumption

This assumption is easy to check. We can sort our data points and compare each pair of consecutive data. Another way to check this assumption is by plotting a scatterplot between residuals and time, and check if there is a random pattern. The assumption is violated if there is no random pattern. An in-depth exploratory analysis to detect the deviations in the data.

Multicollinearity

The most critical assumption of logistic regression is, there should be little or no multicollinearity in the provided dataset. This condition occurs when the features or independent variables of the dataset are highly correlated to each other in a manner, that they do not contribute unique or independent information in the regression model. 

If a model has correlated variables, it becomes hard to determine which variable contributes to estimating the target variable. If the level of correlation is high between variables, it leads to problems while fitting and interpreting the model. 

How to check this assumption

The most popular approach to detect multicollinearity is by using the correlation matrix, which measures the correlation and degree of correlation between the independent variables in a given dataset. 

Another technique is the Variance Inflation Factor (VIF). VIF also helps in determining the correlation and degree of correlation between the independent variables in a given dataset. 

Outliers

Logistic regression is very sensitive to outliers. It assumes that there are no outliers or influential observations in the given dataset. We can get unexpected outcomes due to the presence of just one outlier in our data. Outliers affect the performance of our model. 

How to check this assumption

The most simple way to test for extreme outliers in a  given dataset is to calculate Cook’s distance for each observation. 

What to do if there are outliers in your dataset – 

Drop them

You can replace the outliers by mean or median.

It is easy to detect these data points if we have one or two independent variables. But,  if the number of independent variables is large, we can use the anomaly detection technique. 

A linear relationship between the independent variable and logit of the target variable

Logistic regression assumes that there is a linear relationship between the independent variable(s) and the logit of the target variables. 

Mathematically, the logit function is represented as – 

Logit(p)  = log(p / (1-p))

Where p denotes the probability of success. 

The logit function is also known as a log-odds function. The term p/(1-p) is known as odds. The odds implies the ratio of the probability of a positive outcome to the probability of a negative outcome.

How to check this assumption

We can plot a scatter plot against the logit of the target value. If the data points form a straight line, the linearity assumption holds. Also, we can use the Box-Tidwell test to test this assumption. 

Large sample size

The next assumption of logistic regression is that the size of the dataset should be large enough to make suitable conclusions from the logistic regression model. 

How to check this assumption

You should have at least 10 events with the least frequent outcome for each independent variable. 

We have 5 independent variables. According to the rule, we should have at least 300 records in this dataset. 

Output: 

Conclusion 

This is it for this article. We discussed the assumptions of logistic regression analysis and methods to check if these assumptions are met or not. If you liked the article, please upvote and share with others. 

Thanks for reading! 

close
8+

You may also like...

1 Response

  1. January 18, 2021

    […] discussed assumptions of Logistic regression and cross-entropy loss in my previous […]

    0

Leave a Reply

Your email address will not be published. Required fields are marked *

DMCA.com Protection Status