Path coefficient analysis which introduced by SewallWright in 1921 as “correlation and causation” is the extended form ofmultiple regression analysis, which decomposes correlation coefficientsinto direct, indirect, spurious and unanalyzed effects. It is a vitaltool to study the cause-effect relationships of normal variables. It isof 3 types: simple, sequential and multivariate, in the simple form,there is a single dependent (endogenous) and one or more independentvariables (exogenous). Certainly Sewall Wright, is the pioneer of pathcoefficient analysis who has numerous publications in this case from1916 to 1980 ys. This method was initially considered with skepticismand later accepted and widely used in social sciences. Today, pathcoefficient analysis is used in almost all fields of life. For more infoon path coefficient analysis see(Bondari 1990;Wright 1923,1934,1960;Li 1975;Wolfle2003). It is suggested to refer to the statisticalreferences, for example(Snedecor and Cochran 1980;Bhattacharyya and Johnson 1997;Draper and Smith 1981;Neter, Whitmore, and Wasserman 1992)in order to become more familiar with topics in statistics, such asdescriptive statistics.
In a path coefficient analysis, descriptive statistics and Pearsoncorrelation coefficients (double-headed arrows) between variables may beestimates which is done in this package. Moreover, and especially simpleor multiple linear regression of dependent (or endogenous) variable(s)on independent variable(s) may be done, a task is done here. Of course,in a sequential path coefficient analysis, intervening or endogenousvariables exist and analyses are performed step-by-step via thispackage, but in a simple path coefficient analysis one step is enough,which is done in this package along with the path diagram which is drawnautomatically, but for complicated or sequential path, some more worksmust be done which is discussed later in this manual. In a path model,path coefficient or direct effects (Pi’s) indicates the direct effect ofa variable on another, and are standardized partial regressioncoefficients (in Wright’s terminology) due they are estimated fromcorrelations or from the transformed (standardized) data as:\(P_i =\beta_i\frac{\sigma_{X_i}}{\sigma_Y}\). The path equations are asfollows:
\[\mathbf{X} =\begin{pmatrix}P_1 + P_2r_{12} + P_3r_{13} + ... + P_nr_{1n} = r_{Y1} \\P_1r_{21} + P_2 + P_3r_{23} + ... + P_nr_{2n} = r_{Y2} \\P_1r_{31} + P_2r_{32} + P_3 + ... + P_nr_{3n} = r_{Y3}\\ \vdots + \vdots \\ P_1r_{n1} + P_2r_{n2} + P_3r_{n3} + ... + P_n = r_{Yn} \\\end{pmatrix}\]
Our package is capable of performing this straightforward taskthrough detailed explanations. As stated by Bondari (1990), for twodependent variables\(Y_1\) and\(Y_2\):
\[Y_1=p_1X_1+p_2X_2+p_3X_3+... +p_nX_n\\ Y_2=p'_1X_1+p'_2X_2+p'_3X_3+... +p'_nX_n\\ ...\\where:\\ r_{Y_1Y_2}=p_1p'_1+p_2p'_2+p_3p'_3+...+p_np'_n+\sigma_{i=j}p_ip'_1r_{ij}=\sigma_{i,j}p_ip'_ir_{ij}\]
The commands above are shown in the Figures 1&2. The simple pathdiagram:
The opening part of this vignette (instruction manual) provides abrief introduction to the concepts underpinning path coefficientanalysis. The subsequent part showcases two practical demonstrations. Ina path coefficient analysis, the Pearson correlation coefficientsbetween dependent variables and their related independent variables aredecomposed, as previously mentioned.
Our ** package can be applied in two cases:simple andsequential path coefficient analysis. If not installed, the **package is being installed firstly through:
if(!require('Path.Analysis')){install.packages('Path.Analysis')}#> Loading required package: Path.Analysis#> Registered S3 method overwritten by 'GGally':#> method from#> +.gg ggplot2library('Path.Analysis')The analyses requires the following R packages:
When data is put within thedata folder of\(\mathbf{}\) package. This is thesimplest dataset in this package consisting of a dependent variablecalledY and 3 independent calledX1,X2 andX3. Then in the command prompt line type the following commandsand run the analyses:
data(dtsimp)
head(dtsimp[1:3, ])
Correlation between variables:
corr(dtsimp, verbose = FALSE)
Simple linear regression between Y and X1-X3 vars:
reg(dtsimp, 1, verbose = FALSE)
Plot the path main diagram
matdiag(dtsimp, 1)
#> [[1]]#> y x1 x2 x3#> y 1.00 0.43 -0.12 0.03#> x1 0.43 1.00 -0.14 0.08#> x2 -0.12 -0.14 1.00 -0.08#> x3 0.03 0.08 -0.08 1.00#> #> n= 105 #> #> #> P#> y x1 x2 x3 #> y 0.0000 0.2226 0.7772#> x1 0.0000 0.1682 0.4333#> x2 0.2226 0.1682 0.4329#> x3 0.7772 0.4333 0.4329 #> #> [[2]]#> [[2]]$p#> y x1 x2 x3#> y 0.000000e+00 4.281686e-06 0.2225777 0.7772096#> x1 4.281686e-06 0.000000e+00 0.1682316 0.4333210#> x2 2.225777e-01 1.682316e-01 0.0000000 0.4328677#> x3 7.772096e-01 4.333210e-01 0.4328677 0.0000000#> #> [[2]]$lowCI#> y x1 x2 x3#> y 1.0000000 0.2616079 -0.3046920 -0.1646039#> x1 0.2616079 1.0000000 -0.3188570 -0.1161105#> x2 -0.3046920 -0.3188570 1.0000000 -0.2650856#> x3 -0.1646039 -0.1161105 -0.2650856 1.0000000#> #> [[2]]$uppCI#> y x1 x2 x3#> y 1.00000000 0.57567143 0.07331527 0.2184383#> x1 0.57567143 1.00000000 0.05769229 0.2650146#> x2 0.07331527 0.05769229 1.00000000 0.1160351#> x3 0.21843826 0.26501461 0.11603511 1.0000000#> Warning in summary.lm(mlreg): essentially perfect fit: summary may be#> unreliable#> [[1]]#> #> Call:#> lm(formula = datap[, resp] ~ ., data = datap)#> #> Coefficients:#> (Intercept) y x1 x2 x3 #> 1.109e-14 1.000e+00 3.295e-18 -1.762e-17 7.056e-17 #> #> #> [[2]]#> #> Call:#> lm(formula = datap[, resp] ~ ., data = datap)#> #> Residuals:#> Min 1Q Median 3Q Max #> -3.397e-15 -1.069e-15 -1.786e-16 9.219e-16 1.232e-14 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.109e-14 1.644e-15 6.75e+00 9.81e-10 ***#> y 1.000e+00 1.065e-17 9.39e+16 < 2e-16 ***#> x1 3.295e-18 1.457e-17 2.26e-01 0.821 #> x2 -1.762e-17 9.490e-17 -1.86e-01 0.853 #> x3 7.056e-17 2.392e-16 2.95e-01 0.769 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 1.845e-15 on 100 degrees of freedom#> Multiple R-squared: 1, Adjusted R-squared: 1 #> F-statistic: 2.722e+33 on 4 and 100 DF, p-value: < 2.2e-16Fig. 3: Diagram of the path coefficient analysis of ‘dtsimp’ sampledataset.
> Note: when user faces with an external data:Suppose we have data stored in a hard drive at the pathPath/to/data in a file calledmydata.xls. Toperform the following steps in RStudio console, follow theseinstructions:
library(readxl), if installed the
readxlpackage.
dtraw <- read_excel(“Path/to/data/mydata.xls”).
The next dataset, calleddtraw is used in this part. Itis also a built-in data in ** and contains nine variables: one dependentvariable calledY and eight independent variables labeledX1 throughX8. This dataset belongs to apopulation of a Camelina oil crop in its seed oil (Y) and C18, C18.1,C18.2, C18.3, C20.0, C20.1, C20.2, C22.1 fatty acids (marked as X1-X8)were measured. Then type the following commands in the RStudio consoleand run them:
data(dtraw)
rownames(dtraw) <- dtraw[, 1]
dtraw[, 1] <- NULL
head(dtraw[1:4, ])
The output is as follows:
data(dtraw)dtraw<-as.data.frame(dtraw)rownames(dtraw)<- dtraw[,1]dtraw[,1]<-NULLhead(dtraw[1:4, ])#> Y X1 X2 X3 X4 X5 X6 X7 X8#> DH1 38.58 2.20 15.61 15.05 35.37 1.29 14.16 1.49 3.20#> DH2 38.73 2.23 15.34 15.56 34.50 1.23 14.46 1.47 3.33#> DH3 38.87 2.14 16.66 15.41 36.82 1.24 14.06 2.07 3.19#> DH4 36.72 2.84 14.46 16.42 34.33 1.27 14.07 1.38 3.13This dataset can be analyzed via ** packages as follows using‘corr_plot’ function of ‘metan’ package, thanks to(Olivoto and Dal’Col Lúcio2020).
Running ‘cor_plot’ function for ‘dtsimp’:
Fig. 4: Correlogram of dtsimp dataset, a built-in sample data.
Running the ‘matdiag’ function for ‘dtraw’ dataset ignoring the firstcolumn from left, or column names:
Fig. 5: Diagram of the path coefficient analysis of dtraw
The most significant part of my ** package is fitting such diagram,which is produced with the assistance of theDiagrammeRpackage.
It is important to exercise caution when encountering a short PlotWindow in RStudio. To resolve this issue, navigate to R-Studio andposition the cursor at the top of the graph window until four-way arrowsappear. Then, effortlessly drag the top of the plot region upwardstowards the variable list. If the figure region problem originated fromthis, running the code without any modifications will generate theanticipated graph. Additionally, ensure that your outer default marginsare correctly sized and that your R plot area labels are not truncated.https://www.programmingr.com/r-error-messages/r-figure-margins-too-large/
When response existed between dependents, but not the first fromleft:
data(heart)desc(heart,2)#> $`Descriptive statistics:`#> Biking Heart.disease Smoking#> Min. : 1.119 Min. : 0.5519 Min. : 0.5259#> 1st Qu.:20.205 1st Qu.: 6.5137 1st Qu.: 8.2798#> Median :35.824 Median :10.3853 Median :15.8146#> Mean :37.788 Mean :10.1745 Mean :15.4350#> 3rd Qu.:57.853 3rd Qu.:13.7240 3rd Qu.:22.5689#> Max. :74.907 Max. :20.4535 Max. :29.9467#>#> $`Descriptive statistics:`#> Biking Heart.disease Smoking#> nbr.val 4.980000e+02 498.0000000 498.0000000#> nbr.null 0.000000e+00 0.0000000 0.0000000#> nbr.na 0.000000e+00 0.0000000 0.0000000#> min 1.119154e+00 0.5518982 0.5258500#> max 7.490711e+01 20.4534962 29.9467431#> range 7.378796e+01 19.9015981 29.4208931#> sum 1.881863e+04 5066.9199578 7686.6471384#> median 3.582446e+01 10.3852547 15.8146139#> mean 3.778841e+01 10.1745381 15.4350344#> SE.mean 9.626099e-01 0.2048706 0.3714820#> CI.mean.0.95 1.891286e+00 0.4025192 0.7298687#> var 4.614556e+02 20.9020349 68.7234260#> std.dev 2.148152e+01 4.5718743 8.2899593#> coef.var 5.684684e-01 0.4493447 0.5370872#>#> $`Correlation coefficients:`#> Biking Heart.disease Smoking#> Biking 1.00000000 -0.9354555 0.01513618#> Heart.disease -0.93545547 1.0000000 0.30913098#> Smoking 0.01513618 0.3091310 1.00000000# matdiag(heart, 2)*Please be cautious that the diagram is only produced automaticallywhen there is only one dependent variable and related independentvariable (causative). In the data set, the dependent variable (Y) shouldbe the first variable from the left, and the other variables should beordered from left to right, as observed indtsimp ordtraw. In other words, when the target is simple pathcoefficient analysis, you can call the packages via: **matdiag(dtsimp,1). The package extracts textual outputs (without graphs) under anyconditions, even when there is missing data.*
As mentioned earlier, there are two types of path diagrams ormethodologies: simple and multivariate. The multivariate form requiresmore steps and work, but the relationships between variables are thesame and easy to understand. In the case of a sequential path diagram,this methodology is more complex because it includes interveningvariables that need to be accounted for. Let’s consider a specificscenario with a dataset. For more information see(Arminian et al.2008). Regarding the dataset, let’s assume our data is storedin a hard drive with the path “~path_to_data/” and is named ‘dtseq.xls’.To load this dataset into the Rstudio console, follow these steps:
library(readxl) #following installing thereadxlpackage
dtseq <- read_excel(“~path_to_data/dtseq.xls”)
Methods like ‘Pearson’ or ‘Spearman’ can be used to analyze thecorrelation between variables. A correlogram is a tool that combinesscatterplots and histograms, making it possible to examine therelationship between each pair of numeric variables in a matrix. Thecorrelation is visually depicted in scatterplots, while the diagonal ofthe correlogram showcases the distribution of each variable using ahistogram or density plot. (Source:https://python-graph-gallery.com/correlogram/) Thisanalysis can be presented in the form of tables or matrices, which canbe generated using the ‘PerformanceAnalytics’ and ‘metan’ packages.
step 1: YLD v.s FS, DFT, FW, FV:
library(metan)
data(dtseq)
dtseq1 <- dtseq[, c(2, 4, 3, 6, 5)]
head(dtseq1)
matdiag(dtseq1, 1)
#> # A tibble: 6 × 5#> YLD FS DFT FW FV#> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 410. 31.6 51.7 8.69 10.0 #> 2 84.7 38.5 52 8.16 8.33#> 3 360. 25.3 54.7 7.48 5.65#> 4 380. 33.9 49 10.0 9.33#> 5 311. 24.7 50 9.19 10.0 #> 6 404. 19.1 52.5 9.97 9.87Fig. 6: Diagram of dtseq1, modified of the dtseq data.
Network diagrams, also known as graphs, visually depict theconnections between a group of entities. Each entity is represented as anode or vertice, and the connections between nodes are shown as links oredges (source:https://www.data-to-viz.com/graph/network.html). In Rsoftware, you can create network plots or connections between objectsusing the ‘corrr’ package. This package allows you to create coloredlinks that can be thin or thick, depending on the strength of thecorrelation, to represent the correlations between objects. Take a lookat the graph that illustrates the correlations for ‘dtseq1’. Itshowcases a larger number of variables, making it visually appealing andinformative.
#> Correlation computed with#> • Method: 'pearson'#> • Missing treated using: 'pairwise.complete.obs'Fig. 7: Network plot of the dtraw2.
Fig. 8: Heatmap of the dtraw2 dataset.
For plotting the heatmaps and clustering of observations andvariables simultaneously, we can use some packages developed such asComplexHeatmap<10.1002/imt2.43> (Gu Z (2022).“Complex Heatmap Visualization.” iMeta.doi:10.1002/imt2.43.), andpheatmappackages. We here introduce the application ofComplexHeatmap package in clustering thedtraw2 dataset measured on 35 genotypes of a plant with 9traits.
Fig. 9: Complex heatmap plot1 of the dtraw2.
Step 2: FS vs. FLP, DFL:
#> # A tibble: 6 × 3#> FS FLP DFL#> <dbl> <dbl> <dbl>#> 1 31.6 55.2 16.7#> 2 38.5 55.7 18.3#> 3 25.3 49.8 17.6#> 4 33.9 59.1 17.9#> 5 24.7 49.5 15.5#> 6 19.1 67.3 17.3Fig. 10: Diagram of the path coefficient analysis of the dtseq2 (partof dtseq)
Step 3: DFT vs. FLP, DFL:
#> # A tibble: 6 × 3#> DFT FLP DFL#> <dbl> <dbl> <dbl>#> 1 51.7 55.2 16.7#> 2 52 55.7 18.3#> 3 54.7 49.8 17.6#> 4 49 59.1 17.9#> 5 50 49.5 15.5#> 6 52.5 67.3 17.3Fig. 11: Diagram of the path coefficient analysis of dtseq3 (part ofdtseq)
Step 4: FW vs. FLP, DFL:
#> # A tibble: 6 × 3#> FW FLP DFL#> <dbl> <dbl> <dbl>#> 1 8.69 55.2 16.7#> 2 8.16 55.7 18.3#> 3 7.48 49.8 17.6#> 4 10.0 59.1 17.9#> 5 9.19 49.5 15.5#> 6 9.97 67.3 17.3Fig. 12: Diagram of the path coefficient analysis of dtseq4 (part ofdtseq)
Step 5: FV vs. FLP, DFL:
#> # A tibble: 6 × 3#> FV FLP DFL#> <dbl> <dbl> <dbl>#> 1 10.0 55.2 16.7#> 2 8.33 55.7 18.3#> 3 5.65 49.8 17.6#> 4 9.33 59.1 17.9#> 5 10.0 49.5 15.5#> 6 9.87 67.3 17.3Fig. 13: Correlation plot of the dtseq5 (part of dtseq)
Step 6: DFL vs. FLP:
#> # A tibble: 6 × 2#> DFL FLP#> <dbl> <dbl>#> 1 16.7 55.2#> 2 18.3 55.7#> 3 17.6 49.8#> 4 17.9 59.1#> 5 15.5 49.5#> 6 17.3 67.3Fig. 14: Network plot of the dtseq6 (part of dtseq).
Multivariate analysis of variance (MANOVA) to estimate SSCP matricesand so on. This requires the following package to be installed:
data(dtseqr)dtseqr<-as.data.frame(dtseqr)dtseqr[,1]<-as.factor(dtseqr[,1])# Repdtseqr[,2]<-as.factor(dtseqr[,2])# Genotypesf<-lm(cbind(YLD, DFT, FS, FV, FW, DFL, FLP)~ Rep+ Genotypes, dtseqr)summary(Anova(f))# all results for MANOVA#>#> Type II MANOVA Tests:#>#> Sum of squares and products for error:#> YLD DFT FS FV FW DFL FLP#> YLD 30872.750 1305.82650 -1284.15150 -402.8620 -420.6260 -213.0040 1418.89700#> DFT 1305.827 677.12718 86.45078 -1.4006 50.1155 118.7334 69.47155#> FS -1284.151 86.45078 254.45997 -23.1646 24.3031 43.1752 41.54635#> FV -402.862 -1.40060 -23.16460 28.2584 6.8566 18.5540 -39.72960#> FW -420.626 50.11550 24.30310 6.8566 24.0384 22.1664 -42.02800#> DFL -213.004 118.73340 43.17520 18.5540 22.1664 56.4220 -35.56200#> FLP 1418.897 69.47155 41.54635 -39.7296 -42.0280 -35.5620 271.59110#>#> ------------------------------------------#>#> Term: Rep#>#> Sum of squares and products for the hypothesis:#> YLD DFT FS FV FW DFL FLP#> YLD 327.6100 195.02750 124.79950 10.860 -53.2140 7.240 -336.8410#> DFT 195.0275 116.10063 74.29363 6.465 -31.6785 4.310 -200.5228#> FS 124.7995 74.29363 47.54103 4.137 -20.2713 2.758 -128.3159#> FV 10.8600 6.46500 4.13700 0.360 -1.7640 0.240 -11.1660#> FW -53.2140 -31.67850 -20.27130 -1.764 8.6436 -1.176 54.7134#> DFL 7.2400 4.31000 2.75800 0.240 -1.1760 0.160 -7.4440#> FLP -336.8410 -200.52275 -128.31595 -11.166 54.7134 -7.444 346.3321#>#> Multivariate Tests: Rep#> Df test stat approx F num Df den Df Pr(>F)#> Pillai 2 0.98891 1.25750 14 18 0.31903#> Wilks 2 0.01109 9.70834 14 16 2.5259e-05 ***#> Hotelling-Lawley 2 89.15113 44.57557 14 14 3.7404e-09 ***#> Roy 2 89.15113 114.62289 7 9 4.5158e-08 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> ------------------------------------------#>#> Term: Genotypes#>#> Sum of squares and products for the hypothesis:#> YLD DFT FS FV FW DFL#> YLD 697553.096 5033.04937 -11241.57825 4343.21250 2541.13650 -2906.41425#> DFT 5033.049 116.01356 -91.18987 7.00395 -1.07955 -11.41207#> FS -11241.578 -91.18987 926.14245 3.20730 -24.60870 134.06535#> FV 4343.213 7.00395 3.20730 83.14020 37.53540 -13.49070#> FW 2541.136 -1.07955 -24.60870 37.53540 23.42280 -5.21070#> DFL -2906.414 -11.41207 134.06535 -13.49070 -5.21070 54.84165#> FLP 5863.115 22.55888 -56.60295 126.96570 90.78030 72.84285#> FLP#> YLD 5863.11525#> DFT 22.55888#> FS -56.60295#> FV 126.96570#> FW 90.78030#> DFL 72.84285#> FLP 769.10145#>#> Multivariate Tests: Genotypes#> Df test stat approx F num Df den Df Pr(>F)#> Pillai 7 3.896 2.510 49 98.00000 5.5418e-05 ***#> Wilks 7 0.000 16.018 49 45.03719 < 2.22e-16 ***#> Hotelling-Lawley 7 4971.966 637.803 49 44.00000 < 2.22e-16 ***#> Roy 7 4949.541 9899.081 7 14.00000 < 2.22e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1# Anova(f)$SSPE # individual printing SSCP matrix of errorAnova(f)$SSPE[4:5,4:5]# SSCP matrix of error for two dependent variables i.e Fv and FW.#> FV FW#> FV 28.2584 6.8566#> FW 6.8566 24.0384Following performing multivariate path coefficient analysis, it isnecessary to estimate the correlation coefficient between residuals (here FV and FW are final dependent variables) as follow. To do this, theError matrix needs to be calculated in MANOVA.
ru1u2<-Anova(f)$SSPE[4,5]/(sqrt(Anova(f)$SSPE[4,4])*sqrt(Anova(f)$SSPE[5,5]))cat("\nCorrelation coefficient between residuals is:\n", ru1u2)#>#> Correlation coefficient between residuals is:#> 0.2630766After performing or analyzing sequential path analyses step-by-step,it is now time to create a sequential path diagram, which includes amultivariate path diagram. To do this, one can use the program calledGraphviz in relation toDiarammeR. If a specificsection of the sequential model is considered as a multivariate path,one can draw a multivariate path Diagram(Arminian et al. 2008) and estimatethe correlation coefficient between residuals (as previously estimated)as follows:
Fig. 15: Sequential univariate path diagram. It is important to notethat residuals can be added to each endogenous variable, which areestimated throughout steps 1 to 6 above.
For full color names or other signs of DiagrammeR or lots ofnode/nodge attributes, and Graphviz go to be used in the diagrams seethe manuals and guides like:https://rich-iannone.github.io/DiagrammeR/articles/graphviz-mermaid.html.Also: vignettes/graphviz-mermaid.Rmd
Fig. 16: The sequential multivariate path Diagram
Notice: Users can see the ‘lavaan’ package in R and simple ‘PATHSAS’code written by Cramer et al.(Cramer, Wehner, and Donaghy 1999), andalso and “semPlot” function of ‘OpenMxas’ package as initial tools forconducting path analyses and SEM (Structural Equation Modeling).