This is an example of exploratory latent class analysis (LCA) withcontinuous indicators, otherwise known as latent profile analysis (LPA)or finite Gaussian mixture modeling, usingtidySEM. See VanLissa, C. J., Garnier-Villarreal, M., & Anadria, D. (2023).Recommended Practices in Latent Class Analysis using the Open-SourceR-Package tidySEM. Structural Equation Modeling.https://doi.org/10.1080/10705511.2023.2250920. Thepresent example uses data collected by Alkema as part of a study onocean microplastics. The purpose of this study was to provide a morenuanced model for the distribution of different sizes of oceanmicroplastics than the commonly used normal distribution. To this end, amixture of normals was used. Since there is no theoretical reason toexpect a certain number of classes, this is an exploratory LCA. To viewits documentation, run the command?tidySEM::alkema_microplastics in the R console. Theoriginal analyses are available athttps://github.com/cjvanlissa/lise_microplastics; inthis vignette, we take a different approach to the analysis to showcaseother possibilities.
To load the data, simply attach thetidySEM package. Forconvenience, we assign the variables used for analysis to an objectcalleddf. As explained in the paper, the classes are quitedifferent for lines, films, and fragments. For this reason, we here onlyuse data from fragments. The indicators are fragments’ length and widthin millimeters. The sample size was not planned.
As per the best practices, the first step in LCA is examining theobserved data. We usetidySEM::descriptives() to describethe data numerically. Because all items are continuous, we removecolumns for categorical data to de-clutter the table:
desc<- tidySEM::descriptives(df)desc<- desc[,c("name","type","n","unique","mean","median","sd","min","max","skew_2se","kurt_2se")]knitr::kable(desc,caption ="Descriptive statistics")The data are correctly coded asnumeric and thedistributional characteristics match the intended measurement level. Thevariable scales are comparable (both in millimeters and no largediscrepancies between variances). There are no missing values; if anyvariables had missing values, we would report an MCAR test withmice::mcar(), and explain that missing data are accountedfor using FIML. Additionally, we can plot the data. Theggplot2 functiongeom_density() is useful tovisualize continuous data:
df_plot<- dfnames(df_plot)<-paste0("Value.",names(df_plot))df_plot<-reshape(df_plot,varying =names(df_plot),direction ="long",timevar ="Variable")ggplot(df_plot,aes(x = Value))+geom_density()+facet_wrap(~Variable)+theme_bw()Both the table above and the density plot indicate that the data areextremely right-skewed and kurtotic. With this in mind, it can be usefulto transform and rescale the data. We will use a log transformation.
df_plot$Value<-log(df_plot$Value)ggplot(df_plot,aes(x = Value))+geom_density()+facet_wrap(~Variable)+theme_bw()The log transformation addresses the aforementioned concernsregarding skew and kurtosis. To confirm this, reshape the data to wideformat and examine a scatterplot:
As all variables are continuous, we can use the convenience functiontidySEM::mx_profiles(), which is a wrapper for the genericfunctionmx_mixture() optimized for continuous indicators.Its default settings are appropriate for LPA, assuming fixed variancesacross classes and zero covariances. Its arguments aredataand number ofclasses. All variables indataare included in the analysis, which is why we first selected theindicator variables. The models are estimated using simulated annealing,with start values determined via initial K-means clustering.
As this is an exploratory LCA, we will conduct a rather extensivesearch across model specifications and number of classes. We will setthe maximum number of classes\(K\) tothree to limit computational demands. We set a seed to ensure replicableresults.
As the analysis takes a long time to compute, it is prudent to savethe results to disk immediately, so as not to lose them. For this, weuse the functionsaveRDS(). We can later useres <- readRDS("res_gmm.RData") to load the analysisfrom the file.
To compare the fit of the estimated models, we create a model fittable usingtable_fit(). We will use the BIC for classenumeration.
First, we determine whether any models can be disqualified. Therewere no indications of convergence problems during estimation, so thisis not a reason to disqualify solutions. Next, we check for global andlocal identifiability. The global ratio of observations per parameter islarge, as the minimumnp_ratio is 244. The smallest ratioof class size to class-specific parameters is 18 (seenp_local), which is no cause for concern.
tbl<- fit[,c("Name","LL","Parameters","BIC","Entropy","prob_min","n_min","np_ratio","np_local")]names(tbl)<-c("Name","LL","p","BIC","Ent.","p_min","n_min","np_ratio","np_local")knitr::kable(tbl,caption ="Model fit table.")However, note that we have a very large sample, and for many models,the smallest class comprises only a very small percentage of the totalsample. Since the purpose of this analysis is to better represent thedistribution of ocean microplastics, we can wonder whether it makessense to allow for classes that only describe a small percentage of thecases. We therefore only consider solutions that capture at least 10% ofthe sample.
Another interesting characteristic of this data is that the BIC andthe entropy are strongly correlated. The raw correlation between thesetwo metrics is .66,cor(fit$BIC, fit$Entropy). If we omitthe 1-class models, for which entropy is technically not defined, thecorrelation is even as high as .85,cor(fit$BIC[!fit$Classes == 1], fit$Entropy[!fit$Classes == 1]).
This strong correlation indicates that an increase in fit comes witha decrease in class separability. This illustrates why entropy shouldnot be treated as a model fit criterion. It also illustrates thatcriteria for class enumeration should be explicit, because we willlikely come to a different decision depending on which criteria areused.
As mentioned before, we drop models with < 10% of cases in thesmallest class:
If our strategy is to optimize fit, we can examine the fit tableabove, or plot a scree plot for the BIC by callingplot(fit). Note that, due to the large sample size, all ICsgive identical conclusions.
Looking at the blocks of 1-4 class models for each modelspecification, it appears that the BIC keeps decreasing with theaddition of more classes. Across the blocks, the BIC keeps decreasingwith increasingly complex model specifications.
Out of the 16 models that remain after removing those with < 10%of cases in the smallest class, one model stands out: The 2-class modelwith free (co)variances.We thus select this as our final model.
We here request the estimates (est) and standardizedestimatesstd_est, because the latter allows us tointerpret the correlations between length and width. Note that standarderrors and p-values are relatively uninformative: With a sample size of5606, every parameter is significantly different from zero.
res_bic<- res[["free var, free cov 2"]]cp<-class_prob(res_bic)results<-table_results(res_bic,columns =c("label","est","std_est"))resultsInterpreting the results is facilitated by examining a plot of themodel and data. Relevant plot functions areplot_bivariate(),plot_density(), andplot_profiles(). However, we omit the density plots,becauseplot_bivariate() also includes them.
On the diagonal of the bivariate plot are weighted density plots:normal approximations of the density function of observed data, weighedby class probability. On the off-diagonal are plots for each pair ofindicators, with the class means indicated by a point, class standarddeviations indicated by lines, and covariances indicated by circles.
The bivariate and marginal plots show that the classes are notclearly separable, as also evident from the low entropy. At the sametime however, it is clear that the observed distributions arenon-normal, and the second class accounts for some of this non-normality(there is a smaller ‘bump’ to the right of the mode, which could be themean of a second normal distribution). The first class (57%) accountsfor smaller fragments, and the second class (43%) accounts for some ofthe right-skew in fragments’ length and width. We label class 1 assmall fragments, and class 2 aslarger fragments.
It also appears that the correlation between length and width isstronger for small fragments than for large fragments. To test thedifference, usewald_test(res_bic, hypothesis = "c11 = c21"). Resultsindicate that the correlation is indeed significantly larger for smallfragments (\(r = .92\)) than for largerfragments (\(r = .85\)),\(\chi^2(1) = 11.56, p < .001\). Thus,small fragments are more coextensive than large fragments.
There are, however, concerns about the interpretability of thissolutions: the entropy is.56 and the minimumclassification probability is.81. This is because ofsubstantial overlap in the distributions of the two classes.
Finally, we may want to compare the different classes on auxiliaryvariables or models. TheBCH() function applies three-stepanalysis, which compares the classes using a multi-group model,controlling for classification error. For example, we can test whetherpolymer type differs between the two classes. Because polymer type is anominal variable, we must convert it to dummies and estimate a thresholdfor each dummy:
df_pt<-mx_dummies(df_analyze$poly_type)aux_pt<-BCH(res_bic,model ="poly_typeOther | t1 poly_typePE | t1 poly_typePP | t1",data = df_pt)aux_pt<-mxTryHardOrdinal(aux_pt)To obtain an omnibus likelihood ratio test of the significance of thedifferences in polymer type across classes, uselr_test(aux_pt). The results indicate that there aresignificant differences in polymer types across classes,\(\Delta LL(3) = 17.14, p < .001\). Theresults can be reported in probability scale usingtable_prob(aux_pt). To test differences for specificpolymer types, we can use Wald tests:
wald_test(aux_pt,"class1.Thresholds[1,1] = class2.Thresholds[1,1]; class1.Thresholds[1,2] = class2.Thresholds[1,2]; class1.Thresholds[1,3] = class2.Thresholds[1,3]")The results indicate that there is no significant difference in theprevalence of “Other” polymer types across classes. However, PE issignificantly more prevalent in class 1, and PP is significantly moreprevalent in class 2.