Feb 22, 2026

Outliers, Leverage, and Influence in Regression

When one data point changes everything, and how to know if it should.

You are fitting a regression of exam scores on hours studied. You have 49 well-behaved data points that produce a slope of about 4 points per hour. Then you add the 50th data point: a student who studied 0 hours and scored 97. Your slope drops to 2.1. One observation nearly halved your estimated effect. What do you do?

Three Distinct Concepts

An outlier is an observation with an unusually large residual: its actual value is far from what the model predicts. A high-leverage point is an observation with an unusual value on the predictor variable (x), far from the mean of x. An influential point is an observation whose removal would substantially change the estimated coefficients.

These three are related but not the same. An outlier in y that sits near the mean of x has little leverage and may not be influential. A point with extreme x but that falls on the regression line has high leverage but low actual influence (the line passes through it anyway). The most dangerous points are those that are both outliers in y and high leverage in x, because they pull the line toward themselves from a position of high mechanical advantage.

Cook's Distance

Cook's distance is a single statistic that measures influence: how much do all the fitted values change when you remove observation i? A large Cook's distance indicates an influential point.

Cook's distance combines leverage and residual size. A point with high leverage and a large residual has a large Cook's distance. A point with high leverage but a small residual (it sits on the line) has a small Cook's distance. A point with a large residual but low leverage also has a small Cook's distance because it cannot pull the line much from its central position.

What to Do With Influential Observations

The first step is always to understand why the point is influential. In the exam score example, a student who studied zero hours and scored 97 might be a misentry (the hours were actually 9, not 0), a student who already knew the material perfectly, or a genuine outlier in the population. Each of these calls for a different response.

A data error should be corrected or removed, with documentation. A genuine outlier that represents a real phenomenon should be kept, but the analysis should be reported both with and without the point so readers can see its effect. If removing a single observation reverses your conclusion, that conclusion is not robust and needs to be communicated as such.

Removing influential points simply because they are inconvenient is not valid. It is a form of selecting data to support a predetermined conclusion, even if done without intent.

Robust Regression

If your data contains outliers that are real and should be retained, but you do not want them to dominate the fit, robust regression methods downweight observations with large residuals automatically. Iteratively reweighted least squares (IRLS) and M-estimators are common approaches. These produce coefficients that are more representative of the bulk of the data without requiring you to delete any observations.

The choice between OLS and robust regression should be made based on what you believe the data-generating process is. If outliers are measurement errors, robust regression is a reasonable default. If outliers are real and informative, robust regression may hide important features of the data. There is no universally correct choice, only choices that are more or less appropriate for the question at hand.

Mark Leschinsky

PRESIDENT & FOUNDER

When one data point changes everything, and how to know if it should.

Outliers, Leverage, and Influence in Regression

Three Distinct Concepts

Cook's Distance

What to Do With Influential Observations

Robust Regression

Mark Leschinsky

Subscribe for cutting-edge AI updates

Related articles

Outliers, Leverage, and Influence in Regression

Ordinary Least Squares Regression and What the Coefficients Mean

Type I and Type II Errors in Hypothesis Testing

Related articles

Outliers, Leverage, and Influence in Regression
When one data point changes everything, and how to know if it should.
Feb 22, 2026
/
Modeling
Outliers, Leverage, and Influence in Regression
When one data point changes everything, and how to know if it should.
Feb 22, 2026
/
Modeling

Ordinary Least Squares Regression and What the Coefficients Mean
The most widely used statistical model in the world, and where it goes wrong.
Mar 5, 2026
/
Modeling
Mar 5, 2026
/
Modeling

Type I and Type II Errors in Hypothesis Testing
Every test has two failure modes, and tightening one always loosens the other.
Feb 22, 2026
/
Inference
Feb 22, 2026
/
Inference