Extreme outlier in real data

Question 1

I'm looking at the amount of carbon in seven forest pools. For dead trees left on the landscape across many locations and over several harvest retention (logging) treatments, there is an extreme value that I happen to know is real.

Data is fairly zero inflated (10 of 190 obs) and right skewed.

Min. = 0.0000
1st Qu. = 0.1733
Median = 0.6664
Mean = 7.0793
3rd Qu. = 3.2283
Max. = 468.9519

Histogram with outlier:

Histogram of data without outlier:

A massive coastal old growth snag results in a plot having 469 Mg C ha⁻¹ in the dead trees when the next most C-rich measurement is 83 Mg C ha⁻¹. This is a real tree in my actual plot, but it completely skews the estimates of my GLMMs away from meaningful inference of the rest the data. It's random that this tree wound up in this particular treatment as plots were randomly assigned treatments. It is not random that a tree is at this location because it is our most southern/humid research forest.

How do you handle a totally real but seriously destructive outlier?

Question 2

Welcome. What problems are you having with the model?

Question 3

And what are you trying to achieve from your modelling? The most important question :)

Question 4

If you were to read some of my past answers/comments about outliers, and outlier removal, you would note that I am very sanguine about people who think very little about removing so-called outliers.

So first let me commend you for at least having some scruples.

And second, maybe surprisingly coming from me, I see nothing wrong with you simply ignoring said "anomaly". As long as you clearly disclose this (as yo did in the question), note that you know it is "true data" (and not an error), but that it is so exceptional as to bias your model, then simply ignore it.
And you may also want to ignore the datapoints at 83 and 61 (only 1 observation in these ranges of your data).
And if you then clearly state that your model is only valid in the range of 0 to ~40 Mg C ha⁻¹ (it really would be pushing it to claim up to 50, as you have essentially no observations in that range), then there is no issue. That is the range your model can claim to be valid in, and you are simply excluding datapoints beyond that range.
And trying to use "robust models" or other such tools is fool's gold; you have basically no observation beyond ~40, so making claims that you are modelling beyond this value is pointless.
Now, if you had observed a lot of values between ~40 and ~400, you could make a broader claim, but then you would not have an outlier, would you? You would have an extreme value, and it would be fair to account for it in your (broader) model.

Question 5

One possibility is to use multilevel quantile regression and look at the median (or other quantiles) see e.g.this thread, which is somewhat old but seems on point. A paper that may be helpful isGalarza, Lachos, & BandyopadhyayQuantile regression in linear mixed models

Another possibility is robust linear models, see e.g. the documentation forrobustlmm and the works cited there.

jginestet 13.1k2 gold badges9 silver badges35 bronze badges · Answer 1 · 2025-11-28 18:11:45Z

If you were to read some of my past answers/comments about outliers, and outlier removal, you would note that I am very sanguine about people who think very little about removing so-called outliers.

So first let me commend you for at least having some scruples.

And second, maybe surprisingly coming from me, I see nothing wrong with you simply ignoring said "anomaly". As long as you clearly disclose this (as yo did in the question), note that you know it is "true data" (and not an error), but that it is so exceptional as to bias your model, then simply ignore it.
And you may also want to ignore the datapoints at 83 and 61 (only 1 observation in these ranges of your data).
And if you then clearly state that your model is only valid in the range of 0 to ~40 Mg C ha⁻¹ (it really would be pushing it to claim up to 50, as you have essentially no observations in that range), then there is no issue. That is the range your model can claim to be valid in, and you are simply excluding datapoints beyond that range.
And trying to use "robust models" or other such tools is fool's gold; you have basically no observation beyond ~40, so making claims that you are modelling beyond this value is pointless.
Now, if you had observed a lot of values between ~40 and ~400, you could make a broader claim, but then you would not have an outlier, would you? You would have an extreme value, and it would be fair to account for it in your (broader) model.

Peter Flom 141k37 gold badges201 silver badges484 bronze badges · Answer 2 · 2025-11-28 00:00:59Z

One possibility is to use multilevel quantile regression and look at the median (or other quantiles) see e.g.this thread, which is somewhat old but seems on point. A paper that may be helpful isGalarza, Lachos, & BandyopadhyayQuantile regression in linear mixed models

Another possibility is robust linear models, see e.g. the documentation forrobustlmm and the works cited there.

Movatterモバイル変換

Stack Exchange Network

Extreme outlier in real data

2 Answers2

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS