2- The Path coefficient model

In a path coefficient analysis, descriptive statistics and Pearsoncorrelation coefficients (double-headed arrows) between variables may beestimates which is done in this package. Moreover, and especially simpleor multiple linear regression of dependent (or endogenous) variable(s)on independent variable(s) may be done, a task is done here. Of course,in a sequential path coefficient analysis, intervening or endogenousvariables exist and analyses are performed step-by-step via thispackage, but in a simple path coefficient analysis one step is enough,which is done in this package along with the path diagram which is drawnautomatically, but for complicated or sequential path, some more worksmust be done which is discussed later in this manual. In a path model,path coefficient or direct effects (Pi’s) indicates the direct effect ofa variable on another, and are standardized partial regressioncoefficients (in Wright’s terminology) due they are estimated fromcorrelations or from the transformed (standardized) data as:\(P_i =\beta_i\frac{\sigma_{X_i}}{\sigma_Y}\). The path equations are asfollows:

One dependent variable:

\[\mathbf{X} =\begin{pmatrix}P_1 + P_2r_{12} + P_3r_{13} + ... + P_nr_{1n} = r_{Y1} \\P_1r_{21} + P_2 + P_3r_{23} + ... + P_nr_{2n} = r_{Y2} \\P_1r_{31} + P_2r_{32} + P_3 + ... + P_nr_{3n} = r_{Y3}\\ \vdots + \vdots \\ P_1r_{n1} + P_2r_{n2} + P_3r_{n3} + ... + P_n = r_{Yn} \\\end{pmatrix}\]

Extension to more dependent variables:

Our package is capable of performing this straightforward taskthrough detailed explanations. As stated by Bondari (1990), for twodependent variables\(Y_1\) and\(Y_2\):

\[Y_1=p_1X_1+p_2X_2+p_3X_3+... +p_nX_n\\ Y_2=p'_1X_1+p'_2X_2+p'_3X_3+... +p'_nX_n\\ ...\\where:\\ r_{Y_1Y_2}=p_1p'_1+p_2p'_2+p_3p'_3+...+p_np'_n+\sigma_{i=j}p_ip'_1r_{ij}=\sigma_{i,j}p_ip'_ir_{ij}\]

The commands above are shown in the Figures 1&2. The simple pathdiagram:

Fig. 1: A simple path diagram (courtesy ofSewall Wright)

Fig. 2: A multivariate path diagram(courtesy ofBondari, 1990)

The opening part of this vignette (instruction manual) provides abrief introduction to the concepts underpinning path coefficientanalysis. The subsequent part showcases two practical demonstrations. Ina path coefficient analysis, the Pearson correlation coefficientsbetween dependent variables and their related independent variables aredecomposed, as previously mentioned.

Our ** package can be applied in two cases:simple andsequential path coefficient analysis. If not installed, the **package is being installed firstly through:

if(!require('Path.Analysis')){install.packages('Path.Analysis')}#> Loading required package: Path.Analysis#> Registered S3 method overwritten by 'GGally':#>   method from#>   +.gg   ggplot2library('Path.Analysis')

The analyses requires the following R packages:

library(car)library(stats)library(Hmisc)library(pastecs)library(devtools)library(usethis)library(testthat)library(knitr)library(rmarkdown)## For graphical displayslibrary(metan)library(ComplexHeatmap)library(grDevices)library(DiagrammeR)

2-1- Simple path coefficient analysis

2-1-1- worked example 1:

When data is put within thedata folder of\(\mathbf{}\) package. This is thesimplest dataset in this package consisting of a dependent variablecalledY and 3 independent calledX1,X2 andX3. Then in the command prompt line type the following commandsand run the analyses:

data(dtsimp)

head(dtsimp[1:3, ])

Correlation between variables:

corr(dtsimp, verbose = FALSE)

Simple linear regression between Y and X1-X3 vars:

reg(dtsimp, 1, verbose = FALSE)

Plot the path main diagram

matdiag(dtsimp, 1)

#> [[1]]#>        y    x1    x2    x3#> y   1.00  0.43 -0.12  0.03#> x1  0.43  1.00 -0.14  0.08#> x2 -0.12 -0.14  1.00 -0.08#> x3  0.03  0.08 -0.08  1.00#> #> n= 105 #> #> #> P#>    y      x1     x2     x3    #> y         0.0000 0.2226 0.7772#> x1 0.0000        0.1682 0.4333#> x2 0.2226 0.1682        0.4329#> x3 0.7772 0.4333 0.4329       #> #> [[2]]#> [[2]]$p#>               y           x1        x2        x3#> y  0.000000e+00 4.281686e-06 0.2225777 0.7772096#> x1 4.281686e-06 0.000000e+00 0.1682316 0.4333210#> x2 2.225777e-01 1.682316e-01 0.0000000 0.4328677#> x3 7.772096e-01 4.333210e-01 0.4328677 0.0000000#> #> [[2]]$lowCI#>             y         x1         x2         x3#> y   1.0000000  0.2616079 -0.3046920 -0.1646039#> x1  0.2616079  1.0000000 -0.3188570 -0.1161105#> x2 -0.3046920 -0.3188570  1.0000000 -0.2650856#> x3 -0.1646039 -0.1161105 -0.2650856  1.0000000#> #> [[2]]$uppCI#>             y         x1         x2        x3#> y  1.00000000 0.57567143 0.07331527 0.2184383#> x1 0.57567143 1.00000000 0.05769229 0.2650146#> x2 0.07331527 0.05769229 1.00000000 0.1160351#> x3 0.21843826 0.26501461 0.11603511 1.0000000#> Warning in summary.lm(mlreg): essentially perfect fit: summary may be#> unreliable#> [[1]]#> #> Call:#> lm(formula = datap[, resp] ~ ., data = datap)#> #> Coefficients:#> (Intercept)            y           x1           x2           x3  #>   1.109e-14    1.000e+00    3.295e-18   -1.762e-17    7.056e-17  #> #> #> [[2]]#> #> Call:#> lm(formula = datap[, resp] ~ ., data = datap)#> #> Residuals:#>        Min         1Q     Median         3Q        Max #> -3.397e-15 -1.069e-15 -1.786e-16  9.219e-16  1.232e-14 #> #> Coefficients:#>               Estimate Std. Error   t value Pr(>|t|)    #> (Intercept)  1.109e-14  1.644e-15  6.75e+00 9.81e-10 ***#> y            1.000e+00  1.065e-17  9.39e+16  < 2e-16 ***#> x1           3.295e-18  1.457e-17  2.26e-01    0.821    #> x2          -1.762e-17  9.490e-17 -1.86e-01    0.853    #> x3           7.056e-17  2.392e-16  2.95e-01    0.769    #> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 1.845e-15 on 100 degrees of freedom#> Multiple R-squared:      1,  Adjusted R-squared:      1 #> F-statistic: 2.722e+33 on 4 and 100 DF,  p-value: < 2.2e-16

Fig. 3: Diagram of the path coefficient analysis of ‘dtsimp’ sampledataset.

> Note: when user faces with an external data:Suppose we have data stored in a hard drive at the pathPath/to/data in a file calledmydata.xls. Toperform the following steps in RStudio console, follow theseinstructions:

library(readxl), if installed thereadxl package.

dtraw <- read_excel(“Path/to/data/mydata.xls”).

2-1-2- worked example 2:

The next dataset, calleddtraw is used in this part. Itis also a built-in data in ** and contains nine variables: one dependentvariable calledY and eight independent variables labeledX1 throughX8. This dataset belongs to apopulation of a Camelina oil crop in its seed oil (Y) and C18, C18.1,C18.2, C18.3, C20.0, C20.1, C20.2, C22.1 fatty acids (marked as X1-X8)were measured. Then type the following commands in the RStudio consoleand run them:

data(dtraw)

rownames(dtraw) <- dtraw[, 1]

dtraw[, 1] <- NULL

head(dtraw[1:4, ])

The output is as follows:

data(dtraw)dtraw<-as.data.frame(dtraw)rownames(dtraw)<- dtraw[,1]dtraw[,1]<-NULLhead(dtraw[1:4, ])#>         Y   X1    X2    X3    X4   X5    X6   X7   X8#> DH1 38.58 2.20 15.61 15.05 35.37 1.29 14.16 1.49 3.20#> DH2 38.73 2.23 15.34 15.56 34.50 1.23 14.46 1.47 3.33#> DH3 38.87 2.14 16.66 15.41 36.82 1.24 14.06 2.07 3.19#> DH4 36.72 2.84 14.46 16.42 34.33 1.27 14.07 1.38 3.13

This dataset can be analyzed via ** packages as follows using‘corr_plot’ function of ‘metan’ package, thanks to(Olivoto and Dal’Col Lúcio2020).

Running ‘cor_plot’ function for ‘dtsimp’:

data(dtsimp)cor_plot(dtsimp)

Fig. 4: Correlogram of dtsimp dataset, a built-in sample data.

Running the ‘matdiag’ function for ‘dtraw’ dataset ignoring the firstcolumn from left, or column names:

Fig. 5: Diagram of the path coefficient analysis of dtraw

The most significant part of my ** package is fitting such diagram,which is produced with the assistance of theDiagrammeRpackage.

It is important to exercise caution when encountering a short PlotWindow in RStudio. To resolve this issue, navigate to R-Studio andposition the cursor at the top of the graph window until four-way arrowsappear. Then, effortlessly drag the top of the plot region upwardstowards the variable list. If the figure region problem originated fromthis, running the code without any modifications will generate theanticipated graph. Additionally, ensure that your outer default marginsare correctly sized and that your R plot area labels are not truncated.https://www.programmingr.com/r-error-messages/r-figure-margins-too-large/

When response existed between dependents, but not the first fromleft:

data(heart)desc(heart,2)#> $`Descriptive statistics:`#>      Biking       Heart.disease        Smoking#>  Min.   : 1.119   Min.   : 0.5519   Min.   : 0.5259#>  1st Qu.:20.205   1st Qu.: 6.5137   1st Qu.: 8.2798#>  Median :35.824   Median :10.3853   Median :15.8146#>  Mean   :37.788   Mean   :10.1745   Mean   :15.4350#>  3rd Qu.:57.853   3rd Qu.:13.7240   3rd Qu.:22.5689#>  Max.   :74.907   Max.   :20.4535   Max.   :29.9467#>#> $`Descriptive statistics:`#>                    Biking Heart.disease      Smoking#> nbr.val      4.980000e+02   498.0000000  498.0000000#> nbr.null     0.000000e+00     0.0000000    0.0000000#> nbr.na       0.000000e+00     0.0000000    0.0000000#> min          1.119154e+00     0.5518982    0.5258500#> max          7.490711e+01    20.4534962   29.9467431#> range        7.378796e+01    19.9015981   29.4208931#> sum          1.881863e+04  5066.9199578 7686.6471384#> median       3.582446e+01    10.3852547   15.8146139#> mean         3.778841e+01    10.1745381   15.4350344#> SE.mean      9.626099e-01     0.2048706    0.3714820#> CI.mean.0.95 1.891286e+00     0.4025192    0.7298687#> var          4.614556e+02    20.9020349   68.7234260#> std.dev      2.148152e+01     4.5718743    8.2899593#> coef.var     5.684684e-01     0.4493447    0.5370872#>#> $`Correlation coefficients:`#>                    Biking Heart.disease    Smoking#> Biking         1.00000000    -0.9354555 0.01513618#> Heart.disease -0.93545547     1.0000000 0.30913098#> Smoking        0.01513618     0.3091310 1.00000000# matdiag(heart, 2)

*Please be cautious that the diagram is only produced automaticallywhen there is only one dependent variable and related independentvariable (causative). In the data set, the dependent variable (Y) shouldbe the first variable from the left, and the other variables should beordered from left to right, as observed indtsimp ordtraw. In other words, when the target is simple pathcoefficient analysis, you can call the packages via: **matdiag(dtsimp,1). The package extracts textual outputs (without graphs) under anyconditions, even when there is missing data.*

2-2- Sequential path coefficient analysis

2-2-1- worked example:

As mentioned earlier, there are two types of path diagrams ormethodologies: simple and multivariate. The multivariate form requiresmore steps and work, but the relationships between variables are thesame and easy to understand. In the case of a sequential path diagram,this methodology is more complex because it includes interveningvariables that need to be accounted for. Let’s consider a specificscenario with a dataset. For more information see(Arminian et al.2008). Regarding the dataset, let’s assume our data is storedin a hard drive with the path “~path_to_data/” and is named ‘dtseq.xls’.To load this dataset into the Rstudio console, follow these steps:

library(readxl) #following installing thereadxlpackage

dtseq <- read_excel(“~path_to_data/dtseq.xls”)

Methods like ‘Pearson’ or ‘Spearman’ can be used to analyze thecorrelation between variables. A correlogram is a tool that combinesscatterplots and histograms, making it possible to examine therelationship between each pair of numeric variables in a matrix. Thecorrelation is visually depicted in scatterplots, while the diagonal ofthe correlogram showcases the distribution of each variable using ahistogram or density plot. (Source:https://python-graph-gallery.com/correlogram/) Thisanalysis can be presented in the form of tables or matrices, which canbe generated using the ‘PerformanceAnalytics’ and ‘metan’ packages.

step 1: YLD v.s FS, DFT, FW, FV:

library(metan)

data(dtseq)

dtseq1 <- dtseq[, c(2, 4, 3, 6, 5)]

head(dtseq1)

matdiag(dtseq1, 1)

#> # A tibble: 6 × 5#>     YLD    FS   DFT    FW    FV#>   <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 410.   31.6  51.7  8.69 10.0 #> 2  84.7  38.5  52    8.16  8.33#> 3 360.   25.3  54.7  7.48  5.65#> 4 380.   33.9  49   10.0   9.33#> 5 311.   24.7  50    9.19 10.0 #> 6 404.   19.1  52.5  9.97  9.87

Fig. 6: Diagram of dtseq1, modified of the dtseq data.

Network diagrams, also known as graphs, visually depict theconnections between a group of entities. Each entity is represented as anode or vertice, and the connections between nodes are shown as links oredges (source:https://www.data-to-viz.com/graph/network.html). In Rsoftware, you can create network plots or connections between objectsusing the ‘corrr’ package. This package allows you to create coloredlinks that can be thin or thick, depending on the strength of thecorrelation, to represent the correlations between objects. Take a lookat the graph that illustrates the correlations for ‘dtseq1’. Itshowcases a larger number of variables, making it visually appealing andinformative.

#> Correlation computed with#> • Method: 'pearson'#> • Missing treated using: 'pairwise.complete.obs'

Fig. 7: Network plot of the dtraw2.

Fig. 8: Heatmap of the dtraw2 dataset.

Attractive heatmaps

For plotting the heatmaps and clustering of observations andvariables simultaneously, we can use some packages developed such asComplexHeatmap<10.1002/imt2.43> (Gu Z (2022).“Complex Heatmap Visualization.” iMeta.doi:10.1002/imt2.43.), andpheatmappackages. We here introduce the application ofComplexHeatmap package in clustering thedtraw2 dataset measured on 35 genotypes of a plant with 9traits.

Fig. 9: Complex heatmap plot1 of the dtraw2.

Step 2: FS vs. FLP, DFL:

#> # A tibble: 6 × 3#>      FS   FLP   DFL#>   <dbl> <dbl> <dbl>#> 1  31.6  55.2  16.7#> 2  38.5  55.7  18.3#> 3  25.3  49.8  17.6#> 4  33.9  59.1  17.9#> 5  24.7  49.5  15.5#> 6  19.1  67.3  17.3

Fig. 10: Diagram of the path coefficient analysis of the dtseq2 (partof dtseq)

Step 3: DFT vs. FLP, DFL:

#> # A tibble: 6 × 3#>     DFT   FLP   DFL#>   <dbl> <dbl> <dbl>#> 1  51.7  55.2  16.7#> 2  52    55.7  18.3#> 3  54.7  49.8  17.6#> 4  49    59.1  17.9#> 5  50    49.5  15.5#> 6  52.5  67.3  17.3

Fig. 11: Diagram of the path coefficient analysis of dtseq3 (part ofdtseq)

Step 4: FW vs. FLP, DFL:

#> # A tibble: 6 × 3#>      FW   FLP   DFL#>   <dbl> <dbl> <dbl>#> 1  8.69  55.2  16.7#> 2  8.16  55.7  18.3#> 3  7.48  49.8  17.6#> 4 10.0   59.1  17.9#> 5  9.19  49.5  15.5#> 6  9.97  67.3  17.3

Fig. 12: Diagram of the path coefficient analysis of dtseq4 (part ofdtseq)

Step 5: FV vs. FLP, DFL:

#> # A tibble: 6 × 3#>      FV   FLP   DFL#>   <dbl> <dbl> <dbl>#> 1 10.0   55.2  16.7#> 2  8.33  55.7  18.3#> 3  5.65  49.8  17.6#> 4  9.33  59.1  17.9#> 5 10.0   49.5  15.5#> 6  9.87  67.3  17.3

Fig. 13: Correlation plot of the dtseq5 (part of dtseq)

Step 6: DFL vs. FLP:

#> # A tibble: 6 × 2#>     DFL   FLP#>   <dbl> <dbl>#> 1  16.7  55.2#> 2  18.3  55.7#> 3  17.6  49.8#> 4  17.9  59.1#> 5  15.5  49.5#> 6  17.3  67.3

Fig. 14: Network plot of the dtseq6 (part of dtseq).

Multivariate analysis of variance (MANOVA) to estimate SSCP matricesand so on. This requires the following package to be installed:

data(dtseqr)dtseqr<-as.data.frame(dtseqr)dtseqr[,1]<-as.factor(dtseqr[,1])# Repdtseqr[,2]<-as.factor(dtseqr[,2])# Genotypesf<-lm(cbind(YLD, DFT, FS, FV, FW, DFL, FLP)~ Rep+ Genotypes, dtseqr)summary(Anova(f))# all results for MANOVA#>#> Type II MANOVA Tests:#>#> Sum of squares and products for error:#>           YLD        DFT          FS        FV        FW       DFL        FLP#> YLD 30872.750 1305.82650 -1284.15150 -402.8620 -420.6260 -213.0040 1418.89700#> DFT  1305.827  677.12718    86.45078   -1.4006   50.1155  118.7334   69.47155#> FS  -1284.151   86.45078   254.45997  -23.1646   24.3031   43.1752   41.54635#> FV   -402.862   -1.40060   -23.16460   28.2584    6.8566   18.5540  -39.72960#> FW   -420.626   50.11550    24.30310    6.8566   24.0384   22.1664  -42.02800#> DFL  -213.004  118.73340    43.17520   18.5540   22.1664   56.4220  -35.56200#> FLP  1418.897   69.47155    41.54635  -39.7296  -42.0280  -35.5620  271.59110#>#> ------------------------------------------#>#> Term: Rep#>#> Sum of squares and products for the hypothesis:#>           YLD        DFT         FS      FV       FW    DFL       FLP#> YLD  327.6100  195.02750  124.79950  10.860 -53.2140  7.240 -336.8410#> DFT  195.0275  116.10063   74.29363   6.465 -31.6785  4.310 -200.5228#> FS   124.7995   74.29363   47.54103   4.137 -20.2713  2.758 -128.3159#> FV    10.8600    6.46500    4.13700   0.360  -1.7640  0.240  -11.1660#> FW   -53.2140  -31.67850  -20.27130  -1.764   8.6436 -1.176   54.7134#> DFL    7.2400    4.31000    2.75800   0.240  -1.1760  0.160   -7.4440#> FLP -336.8410 -200.52275 -128.31595 -11.166  54.7134 -7.444  346.3321#>#> Multivariate Tests: Rep#>                  Df test stat  approx F num Df den Df     Pr(>F)#> Pillai            2   0.98891   1.25750     14     18    0.31903#> Wilks             2   0.01109   9.70834     14     16 2.5259e-05 ***#> Hotelling-Lawley  2  89.15113  44.57557     14     14 3.7404e-09 ***#> Roy               2  89.15113 114.62289      7      9 4.5158e-08 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> ------------------------------------------#>#> Term: Genotypes#>#> Sum of squares and products for the hypothesis:#>            YLD        DFT           FS         FV         FW         DFL#> YLD 697553.096 5033.04937 -11241.57825 4343.21250 2541.13650 -2906.41425#> DFT   5033.049  116.01356    -91.18987    7.00395   -1.07955   -11.41207#> FS  -11241.578  -91.18987    926.14245    3.20730  -24.60870   134.06535#> FV    4343.213    7.00395      3.20730   83.14020   37.53540   -13.49070#> FW    2541.136   -1.07955    -24.60870   37.53540   23.42280    -5.21070#> DFL  -2906.414  -11.41207    134.06535  -13.49070   -5.21070    54.84165#> FLP   5863.115   22.55888    -56.60295  126.96570   90.78030    72.84285#>            FLP#> YLD 5863.11525#> DFT   22.55888#> FS   -56.60295#> FV   126.96570#> FW    90.78030#> DFL   72.84285#> FLP  769.10145#>#> Multivariate Tests: Genotypes#>                  Df test stat approx F num Df   den Df     Pr(>F)#> Pillai            7     3.896    2.510     49 98.00000 5.5418e-05 ***#> Wilks             7     0.000   16.018     49 45.03719 < 2.22e-16 ***#> Hotelling-Lawley  7  4971.966  637.803     49 44.00000 < 2.22e-16 ***#> Roy               7  4949.541 9899.081      7 14.00000 < 2.22e-16 ***#> ---#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Anova(f)$SSPE    # individual printing SSCP matrix of errorAnova(f)$SSPE[4:5,4:5]# SSCP matrix of error for two dependent variables i.e Fv and FW.#>         FV      FW#> FV 28.2584  6.8566#> FW  6.8566 24.0384

Following performing multivariate path coefficient analysis, it isnecessary to estimate the correlation coefficient between residuals (here FV and FW are final dependent variables) as follow. To do this, theError matrix needs to be calculated in MANOVA.

ru1u2<-Anova(f)$SSPE[4,5]/(sqrt(Anova(f)$SSPE[4,4])*sqrt(Anova(f)$SSPE[5,5]))cat("\nCorrelation coefficient between residuals is:\n", ru1u2)#>#> Correlation coefficient between residuals is:#>  0.2630766

After performing or analyzing sequential path analyses step-by-step,it is now time to create a sequential path diagram, which includes amultivariate path diagram. To do this, one can use the program calledGraphviz in relation toDiarammeR. If a specificsection of the sequential model is considered as a multivariate path,one can draw a multivariate path Diagram(Arminian et al. 2008) and estimatethe correlation coefficient between residuals (as previously estimated)as follows:

Fig. 15: Sequential univariate path diagram. It is important to notethat residuals can be added to each endogenous variable, which areestimated throughout steps 1 to 6 above.

For full color names or other signs of DiagrammeR or lots ofnode/nodge attributes, and Graphviz go to be used in the diagrams seethe manuals and guides like:https://rich-iannone.github.io/DiagrammeR/articles/graphviz-mermaid.html.Also: vignettes/graphviz-mermaid.Rmd

Fig. 16: The sequential multivariate path Diagram

Notice: Users can see the ‘lavaan’ package in R and simple ‘PATHSAS’code written by Cramer et al.(Cramer, Wehner, and Donaghy 1999), andalso and “semPlot” function of ‘OpenMxas’ package as initial tools forconducting path analyses and SEM (Structural Equation Modeling).

Movatterモバイル変換

Path Coefficient Analysis

Ali Arminian

2024-09-23

1-Introduction