4
$\begingroup$

I'm looking at the amount of carbon in seven forest pools. For dead trees left on the landscape across many locations and over several harvest retention (logging) treatments, there is an extreme value that I happen to know is real.

Data is fairly zero inflated (10 of 190 obs) and right skewed.

Min. = 0.0000
1st Qu. = 0.1733
Median = 0.6664
Mean = 7.0793
3rd Qu. = 3.2283
Max. = 468.9519

Histogram with outlier:Histogram containing outlier

Histogram of data without outlier:Histogram of data with outlier removed

A massive coastal old growth snag results in a plot having 469 Mg C ha−1 in the dead trees when the next most C-rich measurement is 83 Mg C ha−1. This is a real tree in my actual plot, but it completely skews the estimates of my GLMMs away from meaningful inference of the rest the data. It's random that this tree wound up in this particular treatment as plots were randomly assigned treatments. It is not random that a tree is at this location because it is our most southern/humid research forest.

How do you handle a totally real but seriously destructive outlier?

askedyesterday
Declan's user avatar
New contributor
Declan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.Check out ourCode of Conduct.
$\endgroup$
2
  • 4
    $\begingroup$Welcome. What problems are you having with the model?$\endgroup$Commentedyesterday
  • 3
    $\begingroup$And what are you trying to achieve from your modelling? The most important question :)$\endgroup$Commentedyesterday

2 Answers2

6
$\begingroup$

If you were to read some of my past answers/comments about outliers, and outlier removal, you would note that I am very sanguine about people who think very little about removing so-called outliers.

So first let me commend you for at least having some scruples.

And second, maybe surprisingly coming from me, I see nothing wrong with you simply ignoring said "anomaly". As long as you clearly disclose this (as yo did in the question), note that you know it is "true data" (and not an error), but that it is so exceptional as to bias your model, then simply ignore it.
And you may also want to ignore the datapoints at 83 and 61 (only 1 observation in these ranges of your data).
And if you then clearly state that your model is only valid in the range of 0 to ~40 Mg C ha−1 (it really would be pushing it to claim up to 50, as you have essentially no observations in that range), then there is no issue. That is the range your model can claim to be valid in, and you are simply excluding datapoints beyond that range.
And trying to use "robust models" or other such tools is fool's gold; you have basically no observation beyond ~40, so making claims that you are modelling beyond this value is pointless.
Now, if you had observed a lot of values between ~40 and ~400, you could make a broader claim, but then you would not have an outlier, would you? You would have an extreme value, and it would be fair to account for it in your (broader) model.

answeredyesterday
jginestet's user avatar
$\endgroup$
4
$\begingroup$

One possibility is to use multilevel quantile regression and look at the median (or other quantiles) see e.g.this thread, which is somewhat old but seems on point. A paper that may be helpful isGalarza, Lachos, & BandyopadhyayQuantile regression in linear mixed models

Another possibility is robust linear models, see e.g. the documentation forrobustlmm and the works cited there.

answeredyesterday
Peter Flom's user avatar
$\endgroup$

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.