Linear vs Logistic Regression

Choosing the Right Model for Your Data

Picking the wrong regression model is one of the most common and most damaging mistakes in quantitative research. And it usually happens not because the researcher does not care, but because the choice between linear and logistic regression seems less important than it is. Both involve regression. Both have predictors. Both produce output. So, what is the big deal?

The big deal is this: these two models answer fundamentally different questions. Use the wrong one and your entire analysis, no matter how carefully executed, rests on a broken foundation. This guide will make sure that never happens to you.

Linear regression predicts a number. Logistic regression predicts a probability.

That single distinction drives every other difference between the two models their assumptions, their outputs, their interpretation, and the kind of research questions they are designed to answer.

If your dependent variable is something you can measure on a scale income, temperature, test scores, GDP linear regression is your starting point. If your dependent variable is a category whether someone defaults on a loan, whether a patient survives, whether a student passes or fails logistic regression is what you need.

Common mistake: Many researchers use linear regression with a binary outcome (coded 0 and 1) because it is simpler. This is technically wrong. Linear regression can produce predicted probabilities above 1 and below 0, which are mathematically impossible. Always use logistic regression for binary outcomes.

Understanding Linear Regression

Linear regression models the relationship between one or more predictors and a continuous outcome variable. It fits a straight line (or hyperplane in multiple dimensions) through the data that minimises the sum of squared residuals — the difference between what the model predicts and what actually happened.

The equation

Y = a + b1X1 + b2X2 + … + bnXn + e

Where Y is your continuous outcome, a is the intercept, b values are the slope coefficients, X values are your predictors, and e is the error term.

What the output tells you

Each coefficient b tells you how much Y changes for every one-unit increase in that predictor, holding all other predictors constant. This is a direct, intuitive interpretation — one that your thesis readers will understand without needing a statistics background.

Key assumptions

1. Linearity: The relationship between X and Y must be linear.

2. Independence: Observations must be independent of each other.

3. Homoscedasticity: The variance of residuals must be constant across all levels of X.

4. Normality: Residuals should be approximately normally distributed.

5. No multicollinearity: Predictors should not be highly correlated with each other.

Example research questions for linear regression

What is the effect of study hours on exam scores?

How does advertising spend affect monthly revenue?

What is the relationship between age and blood pressure?

Understanding Logistic Regression

Logistic regression models the probability that an observation belongs to a particular category. Rather than fitting a straight line, it fits an S-shaped curve (called the sigmoid function) that maps any input value to a probability between 0 and 1.

The equation

log(p / 1-p) = a + b1X1 + b2X2 + … + bnXn

The left side, log(p / 1-p), is called the log-odds or logit. It transforms a probability into a value that can range from negative infinity to positive infinity, which is what makes the linear equation on the right side valid.

What the output tells you

Logistic regression coefficients are interpreted in terms of log-odds, which most people find unintuitive. The standard practice is to exponentiate them into odds ratios, which are easier to communicate. An odds ratio above 1 means the predictor increases the likelihood of the outcome. An odds ratio below 1 means it decreases it.

Tip: In your thesis results section, always report both the log-odds coefficient and the exponentiated odds ratio. Odds ratios make your findings accessible to a broader audience including non-statisticians.

Key assumptions

1. Binary outcome: The dependent variable must be categorical (binary logistic) or ordinal (ordinal logistic).

2. Independence: Observations must be independent.

3. No multicollinearity: Predictors should not be highly correlated.

4. Large sample: Logistic regression needs a reasonably large sample. A common rule is at least 10 events per predictor.

5. No extreme outliers: Influential outliers can distort the model significantly.

Example research questions for logistic regression

What factors predict whether a student passes or fails an exam?

Does smoking status predict the probability of developing a disease?

What variables predict whether a customer will churn?

Side-by-Side Comparison

Feature	Linear Regression	Logistic Regression
Outcome variable	Continuous (e.g. salary, score)	Categorical (e.g. yes/no, pass/fail)
Output	Predicted numeric value	Probability (0 to 1)
Equation	Y = a + bX	log(p/1-p) = a + bX
Assumption	Normally distributed residuals	Binary or ordinal outcome
Error measure	RMSE, R-squared	Log-loss, AUC-ROC
Interpretation	One unit increase in X changes Y by b	One unit increase in X changes log-odds by b
Example use	Predicting house prices	Predicting loan default (yes/no)

How to Choose: A Decision Framework

Before you run any model, ask yourself three questions in order. The answers will tell you exactly which regression to use.

Question 1: What type of variable is your outcome?

This is the most important question and the answer almost always decides the model. If your dependent variable is continuous and numeric, start with linear regression. If it is binary, nominal, or ordinal, use logistic regression.

Question 2: What is your research question asking?

How much and by how much are linear regression questions. Will it happen and what is the probability are logistic regression questions. If your thesis asks whether certain factors predict the likelihood of an event, you need logistic regression regardless of how the data is coded.

Question 3: What does your data distribution look like?

Even with a continuous outcome, if your data is heavily skewed or bounded (for example, proportions that only range from 0 to 1), a standard linear model may not be appropriate. Similarly, if your outcome is count data, a Poisson regression may be more suitable than either.

Use this quick-reference table when making your decision:

Situation	Use Linear	Use Logistic
Outcome type	Continuous numeric	Binary or categorical
Research question	How much? By how much?	Will it happen? What is the probability?
Residuals	Normally distributed	Not required
Predicted values	Any number	Between 0 and 1
Model fit measure	R-squared	Pseudo R-squared, AUC

Running the Models: Code Reference

Linear regression in R

model_linear <- lm(outcome ~ predictor1 + predictor2, data = dataset)

summary(model_linear)

Logistic regression in R

model_logistic <- glm(outcome ~ predictor1 + predictor2,

data = dataset, family = binomial)

summary(model_logistic)

exp(coef(model_logistic)) # Convert to odds ratios

Linear regression in SPSS

Analyze > Regression > Linear > Move outcome to Dependent, predictors to Independent

Logistic regression in SPSS

Analyze > Regression > Binary Logistic > Move outcome to Dependent, predictors to Covariates

In Stata

regress outcome predictor1 predictor2 // Linear

logit outcome predictor1 predictor2 // Logistic

logistic outcome predictor1 predictor2 // Logistic with odds ratios

How to Report Results in Your Thesis

Reporting linear regression

A multiple linear regression was conducted to examine the effect of study hours (X1) and attendance rate (X2) on final exam scores (Y). The model was statistically significant, F(2, 147) = 34.21, p < .001, and explained 31.8% of the variance in exam scores (R-squared = .318, adjusted R-squared = .309). Study hours emerged as a significant predictor (b = 4.32, SE = 0.61, p < .001), indicating that each additional hour of study was associated with a 4.32-point increase in exam scores.

Reporting logistic regression

A binary logistic regression was performed to assess the predictors of exam pass or fail. The model was statistically significant, chi-square(2) = 47.83, p < .001, and correctly classified 78.4% of cases. Study hours significantly predicted pass or fail status (b = 0.87, SE = 0.19, p < .001, OR = 2.39), suggesting that each additional hour of study was associated with 2.39 times greater odds of passing.

Pro tip: Always report model fit statistics, individual predictor statistics, and effect sizes together. Reporting just p-values without coefficients or effect sizes is no longer considered acceptable in most academic journals and thesis committees.

Conclusion

Choosing between linear and logistic regression is not a matter of preference or convenience. It is a matter of alignment between your research question, your data type, and your model. Get this wrong and the rest of your analysis no matter how carefully executed is built on a cracked foundation.

The rule is simple: continuous outcome means linear regression. Categorical outcome means logistic regression. When in doubt, go back to your research question and ask what it is actually trying to predict. The answer is almost always in the question itself.

Tagged Data Science, Linear Regression, Logistic Regression, Regression Analysis, Statistical Modeling

Choosing the Right Model for Your Data

Understanding Linear Regression

Understanding Logistic Regression

Side-by-Side Comparison

How to Choose: A Decision Framework

Running the Models: Code Reference

How to Report Results in Your Thesis

Leave a Reply Cancel reply

Contact Us

Navigation

Support

Office Hours