Movatterモバイル変換


[0]ホーム

URL:


Mathematical Framework for latentcor

Mingze Huang, Christian L. Müller, IrinaGaynanova

2025-11-18

Latent Gaussian Copula Model for Mixed Data

latentcor utilizes the powerful semi-parametric latentGaussian copula models to estimate latent correlations between mixeddata types (continuous/binary/ternary/truncated or zero-inflated). Belowwe review the definitions for each type.

Definition of continuous model(Fan et al. 2017)

A random\(X\in\cal{R}^{p}\)satisfies the Gaussian copula (or nonparanormal) model if there existmonotonically increasing\(f=(f_{j})_{j=1}^{p}\) with\(Z_{j}=f_{j}(X_{j})\) satisfying\(Z\sim N_{p}(0, \Sigma)\),\(\sigma_{jj}=1\); we denote\(X\sim NPN(0, \Sigma, f)\).

X=gen_data(n =6,types ="con")$XX#>            [,1]#> [1,]  0.2949394#> [2,] -1.3428070#> [3,] -0.7989761#> [4,]  0.2237552#> [5,]  0.8093799#> [6,]  0.8812656

Definition of binary model(Fan et al. 2017)

A random\(X\in\cal{R}^{p}\)satisfies the binary latent Gaussian copula model if there exists\(W\sim NPN(0, \Sigma, f)\) such that\(X_{j}=I(W_{j}>c_{j})\), where\(I(\cdot)\) is the indicator function and\(c_{j}\) are constants.

X=gen_data(n =6,types ="bin")$XX#>      [,1]#> [1,]    1#> [2,]    0#> [3,]    1#> [4,]    1#> [5,]    0#> [6,]    0

Definition of ternary model(Quan, Booth, and Wells 2018)

A random\(X\in\cal{R}^{p}\)satisfies the ternary latent Gaussian copula model if there exists\(W\sim NPN(0, \Sigma, f)\) such that\(X_{j}=I(W_{j}>c_{j})+I(W_{j}>c'_{j})\),where\(I(\cdot)\) is the indicatorfunction and\(c_{j}<c'_{j}\)are constants.

X=gen_data(n =6,types ="ter")$XX#>      [,1]#> [1,]    0#> [2,]    1#> [3,]    0#> [4,]    1#> [5,]    1#> [6,]    2

Definition of truncated or zero-inflatedmodel(Yoon, Carroll, and Gaynanova2020)

A random\(X\in\cal{R}^{p}\)satisfies the truncated latent Gaussian copula model if there exists\(W\sim NPN(0, \Sigma, f)\) such that\(X_{j}=I(W_{j}>c_{j})W_{j}\), where\(I(\cdot)\) is the indicator functionand\(c_{j}\) are constants.

X=gen_data(n =6,types ="tru")$XX#>            [,1]#> [1,] 0.00000000#> [2,] 0.60456858#> [3,] 2.40780850#> [4,] 0.00000000#> [5,] 0.00000000#> [6,] 0.04778851

Mixed latent Gaussian copula model

The mixed latent Gaussian copula model jointly models\(W=(W_{1}, W_{2}, W_{3}, W_{4})\sim NPN(0, \Sigma,f)\) such that\(X_{1j}=W_{1j}\),\(X_{2j}=I(W_{2j}>c_{2j})\),\(X_{3j}=I(W_{3j}>c_{3j})+I(W_{3j}>c'_{3j})\)and\(X_{4j}=I(W_{4j}>c_{4j})W_{4j}\).

set.seed("234820")X=gen_data(n =100,types =c("con","bin","ter","tru"))$Xhead(X)#>            [,1] [,2] [,3]      [,4]#> [1,] -0.2780040    0    0 0.0000000#> [2,] -0.8939372    0    0 0.0000000#> [3,]  0.3838027    1    1 0.3857133#> [4,] -1.8276629    1    2 0.0000000#> [5,] -0.7248362    0    1 0.0000000#> [6,] -0.8752912    0    0 0.0000000

Moment-based estimation of\(\Sigma\) based on bridge functions

The estimation of latent correlation matrix\(\Sigma\) is achieved via thebridgefunction\(F\) which isdefined such that\(E(\hat{\tau}_{jk})=F(\sigma_{jk})\), where\(\sigma_{jk}\) is the latentcorrelation between variables\(j\) and\(k\), and\(\hat{\tau}_{jk}\) is the correspondingsample Kendall’s\(\tau\).

Kendall’s\(\tau\)(\(\tau_{a}\))

Given observed\(\mathbf{x}_{j},\mathbf{x}_{k}\in\cal{R}^{n}\),

\[\hat{\tau}_{jk}=\hat{\tau}(\mathbf{x}_{j},\mathbf{x}_{k})=\frac{2}{n(n-1)}\sum_{1\le i<i'\len}sign(x_{ij}-x_{i'j})sign(x_{ik}-x_{i'k}),\] where\(n\) is the samplesize.

latentcor calculates pairwise Kendall’s\(\widehat \tau\) as part of the estimationprocess

estimate=latentcor(X,types =c("con","bin","ter","tru"))K= estimate$KK#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.1826263 0.2347475 0.2963636#> [2,] 0.1826263 1.0000000 0.2121212 0.1866667#> [3,] 0.2347475 0.2121212 1.0000000 0.2842424#> [4,] 0.2963636 0.1866667 0.2842424 1.0000000

Using\(F\) and\(\widehat \tau_{jk}\), a moment-basedestimator is\(\hat{\sigma}_{jk}=F^{-1}(\hat{\tau}_{jk})\)with the corresponding\(\hat{\Sigma}\)being consistent for\(\Sigma\)(Fan et al. 2017; Quan, Booth, and Wells 2018; Yoon,Carroll, and Gaynanova 2020).

The explicit form ofbridge function\(F\) has been derived for all combinationsof continuous(C)/binary(B)/ternary(N)/truncated(T) variable types, andwe summarize the corresponding references. Each of this combinations isimplemented inlatentcor.

Typecontinuousbinaryternaryzero-inflated (truncated)
continuousLiu, Lafferty, and Wasserman(2009)---
binaryFan et al. (2017)Fan et al. (2017)--
ternaryQuan, Booth, and Wells (2018)Quan, Booth, and Wells (2018)Quan, Booth, and Wells (2018)-
zero-inflated (truncated)Yoon, Carroll, and Gaynanova(2020)Yoon, Carroll, and Gaynanova(2020)See AppendixYoon, Carroll, and Gaynanova(2020)

Below we provide an explicit form of\(F\) for each combination.

Theorem (explicit form of bridge function) Let\(W_{1}\in\cal{R}^{p_{1}}\),\(W_{2}\in\cal{R}^{p_{2}}\),\(W_{3}\in\cal{R}^{p_{3}}\),\(W_{4}\in\cal{R}^{p_{4}}\) be such that\(W=(W_{1}, W_{2}, W_{3}, W_{4})\sim NPN(0,\Sigma, f)\) with\(p=p_{1}+p_{2}+p_{3}+p_{4}\). Let\(X=(X_{1}, X_{2}, X_{3},X_{4})\in\cal{R}^{p}\) satisfy\(X_{j}=W_{j}\) for\(j=1,...,p_{1}\),\(X_{j}=I(W_{j}>c_{j})\) for\(j=p_{1}+1, ..., p_{1}+p_{2}\),\(X_{j}=I(W_{j}>c_{j})+I(W_{j}>c'_{j})\)for\(j=p_{1}+p_{2}+1, ..., p_{3}\) and\(X_{j}=I(W_{j}>c_{j})W_{j}\) for\(j=p_{1}+p_{2}+p_{3}+1, ..., p\) with\(\Delta_{j}=f(c_{j})\). The rank-basedestimator of\(\Sigma\) based on theobserved\(n\) realizations of\(X\) is the matrix\(\mathbf{\hat{R}}\) with\(\hat{r}_{jj}=1\),\(\hat{r}_{jk}=\hat{r}_{kj}=F^{-1}(\hat{\tau}_{jk})\)with block structure

\[\mathbf{\hat{R}}=\begin{pmatrix}F_{CC}^{-1}(\hat{\tau}) & F_{CB}^{-1}(\hat{\tau}) &F_{CN}^{-1}(\hat{\tau}) & F_{CT}^{-1}(\hat{\tau})\\F_{BC}^{-1}(\hat{\tau}) & F_{BB}^{-1}(\hat{\tau}) &F_{BN}^{-1}(\hat{\tau}) & F_{BT}^{-1}(\hat{\tau})\\F_{NC}^{-1}(\hat{\tau}) & F_{NB}^{-1}(\hat{\tau}) &F_{NN}^{-1}(\hat{\tau}) & F_{NT}^{-1}(\hat{\tau})\\F_{TC}^{-1}(\hat{\tau}) & F_{TB}^{-1}(\hat{\tau}) &F_{TN}^{-1}(\hat{\tau}) & F_{TT}^{-1}(\hat{\tau})\end{pmatrix}\]\[F(\cdot)=\begin{cases}CC:\ 2\sin^{-1}(r)/\pi \\\\BC: \ 4\Phi_{2}(\Delta_{j},0;r/\sqrt{2})-2\Phi(\Delta_{j}) \\\\BB: \2\{\Phi_{2}(\Delta_{j},\Delta_{k};r)-\Phi(\Delta_{j})\Phi(\Delta_{k})\} \\\\NC: \4\Phi_{2}(\Delta_{j}^{2},0;r/\sqrt{2})-2\Phi(\Delta_{j}^{2})+4\Phi_{3}(\Delta_{j}^{1},\Delta_{j}^{2},0;\Sigma_{3a}(r))-2\Phi(\Delta_{j}^{1})\Phi(\Delta_{j}^{2})\\\\NB: \2\Phi_{2}(\Delta_{j}^{2},\Delta_{k},r)\{1-\Phi(\Delta_{j}^{1})\}-2\Phi(\Delta_{j}^{2})\{\Phi(\Delta_{k})-\Phi_{2}(\Delta_{j}^{1},\Delta_{k},r)\}\\\\NN: \2\Phi_{2}(\Delta_{j}^{2},\Delta_{k}^{2};r)\Phi_{2}(-\Delta_{j}^{1},-\Delta_{k}^{1};r)-2\{\Phi(\Delta_{j}^{2})-\Phi_{2}(\Delta_{j}^{2},\Delta_{k}^{1};r)\}\{\Phi(\Delta_{k}^{2})-\Phi_{2}(\Delta_{j}^{1},\Delta_{k}^{2};r)\}\\\\TC: \-2\Phi_{2}(-\Delta_{j},0;1/\sqrt{2})+4\Phi_{3}(-\Delta_{j},0,0;\Sigma_{3b}(r))\\\\TB: \2\{1-\Phi(\Delta_{j})\}\Phi(\Delta_{k})-2\Phi_{3}(-\Delta_{j},\Delta_{k},0;\Sigma_{3c}(r))-2\Phi_{3}(-\Delta_{j},\Delta_{k},0;\Sigma_{3d}(r)) \\\\TN: \ -2\Phi(-\Delta_{k}^{1})\Phi(\Delta_{k}^{2}) +2\Phi_{3}(-\Delta_{k}^{1},\Delta_{k}^{2},\Delta_{j};\Sigma_{3e}(r))+2\Phi_{4}(-\Delta_{k}^{1},\Delta_{k}^{2},-\Delta_{j},0;\Sigma_{4a}(r))+2\Phi_{4}(-\Delta_{k}^{1},\Delta_{k}^{2},-\Delta_{j},0;\Sigma_{4b}(r))\\\\TT: \-2\Phi_{4}(-\Delta_{j},-\Delta_{k},0,0;\Sigma_{4c}(r))+2\Phi_{4}(-\Delta_{j},-\Delta_{k},0,0;\Sigma_{4d}(r))\\\end{cases}\]

where\(\Delta_{j}=\Phi^{-1}(\pi_{0j})\),\(\Delta_{k}=\Phi^{-1}(\pi_{0k})\),\(\Delta_{j}^{1}=\Phi^{-1}(\pi_{0j})\),\(\Delta_{j}^{2}=\Phi^{-1}(\pi_{0j}+\pi_{1j})\),\(\Delta_{k}^{1}=\Phi^{-1}(\pi_{0k})\),\(\Delta_{k}^{2}=\Phi^{-1}(\pi_{0k}+\pi_{1k})\),

\[\Sigma_{3a}(r)=\begin{pmatrix}1 & 0 & \frac{r}{\sqrt{2}} \\0 & 1 & -\frac{r}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & 1\end{pmatrix}, \;\;\;\Sigma_{3b}(r)=\begin{pmatrix}1 & \frac{1}{\sqrt{2}} & \frac{r}{\sqrt{2}}\\\frac{1}{\sqrt{2}} & 1 & r \\\frac{r}{\sqrt{2}} & r & 1\end{pmatrix}, \;\;\;\Sigma_{3c}(r)=\begin{pmatrix}1 & -r & \frac{1}{\sqrt{2}} \\-r & 1 & -\frac{r}{\sqrt{2}} \\\frac{1}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & 1\end{pmatrix},\]

\[\Sigma_{3d}(r)=\begin{pmatrix}1 & 0 & -\frac{1}{\sqrt{2}} \\0 & 1 & -\frac{r}{\sqrt{2}} \\-\frac{1}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & 1\end{pmatrix}, \;\;\;\Sigma_{3e}(r)=\begin{pmatrix}1 & 0 & 0 \\0 & 1 & r \\0 & r & 1\end{pmatrix}, \;\;\;\Sigma_{4a}(r)=\begin{pmatrix}1 & 0 & 0 & \frac{r}{\sqrt{2}} \\0 & 1 & -r & \frac{r}{\sqrt{2}} \\0 & -r & 1 & -\frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}}& 1\end{pmatrix},\]

\[\Sigma_{4b}(r)=\begin{pmatrix}1 & 0 & r & \frac{r}{\sqrt{2}} \\0 & 1 & 0 & \frac{r}{\sqrt{2}} \\r & 0 & 1 & \frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}}& 1\end{pmatrix}, \;\;\;\Sigma_{4c}(r)=\begin{pmatrix}1 & 0 & \frac{1}{\sqrt{2}} & -\frac{r}{\sqrt{2}} \\0 & 1 & -\frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\\frac{1}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & 1 & -r \\-\frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} & -r & 1\end{pmatrix}\;\;\text{and}\;\;\Sigma_{4d}(r)=\begin{pmatrix}1 & r & \frac{1}{\sqrt{2}} & \frac{r}{\sqrt{2}} \\r & 1 & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\\frac{1}{\sqrt{2}} & \frac{r}{\sqrt{2}} & 1 & r \\\frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} & r & 1\end{pmatrix}.\]

Estimation methods

Given the form of bridge function\(F\), obtaining a moment-based estimation\(\widehat \sigma_{jk}\) requiresinversion of\(F\).latentcor implements two methods for calculation of theinversion:

Both methods calculate inverse bridge function applied to eachelement of sample Kendall’s\(\tau\)matrix. Because the calculation is performed point-wise (separately foreach pair of variables), the resulting point-wise estimator ofcorrelation matrix may not be positive semi-definite.latentcor performs projection of the pointwise-estimator tothe space of positive semi-definite matrices, and allows for shrinkagetowards identity matrix using the parameternu (seeSubsection describing adjustment of point-wiseestimator and relevant parameternu).

Original method (method = "original")

Original estimation approach relies on numerical inversion of\(F\) based on solving uni-root optimizationproblem. Given the calculated\(\widehat\tau_{jk}\) (sample Kendall’s\(\tau\) between variables\(j\) and\(k\)), the estimate of latent correlation\(\widehat \sigma_{jk}\) is obtained bycallingoptimize function to solve the followingoptimization problem:\[\widehat r_{jk} = \arg\min_{r} \{F(r) - \widehat \tau_{jk}\}^2.\] The parametertol controls the desired accuracyof the minimizer and is passed tooptimize, with thedefault precision of 1e-8:

estimate=latentcor(X,types =c("con","bin","ter","tru"),method ="original",tol =1e-8)

Algorithm for Original method

Input:\(F(r)=F(r,\mathbf{\Delta})\) - bridge function based on the type ofvariables\(j\),\(k\)

estimate$K#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.1826263 0.2347475 0.2963636#> [2,] 0.1826263 1.0000000 0.2121212 0.1866667#> [3,] 0.2347475 0.2121212 1.0000000 0.2842424#> [4,] 0.2963636 0.1866667 0.2842424 1.0000000
estimate$zratios#> [[1]]#> [1] NA#>#> [[2]]#> [1] 0.5#>#> [[3]]#> [1] 0.3 0.8#>#> [[4]]#> [1] 0.5
estimate$Rpointwise#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.4001521 0.4291095 0.5239691#> [2,] 0.4001521 1.0000000 0.5454664 0.4722794#> [3,] 0.4291095 0.5454664 1.0000000 0.5952768#> [4,] 0.5239691 0.4722794 0.5952768 1.0000000

Approximation method (method = "approx")

A faster approximation method is based on multi-linear interpolationof pre-computed inverse bridge function on a fixed grid of points(Yoon, Müller, and Gaynanova 2021). This ispossible as the inverse bridge function is an analytic function of atmost 5 parameters:

In short, d-dimensional multi-linear interpolation uses a weightedaverage of\(2^{d}\) neighbors toapproximate the function values at the points within the d-dimensionalcube of the neighbors, and to perform interpolation,latentcor takes advantage of the R packagechebpol(Gaure 2019). Thisapproximation method has been first described in(Yoon, Müller, and Gaynanova 2021) forcontinuous/binary/truncated cases. Inlatentcor, weadditionally implement ternary case, and optimize the choice of grid aswell as interpolation boundary for faster computations with smallermemory footprint.

#estimate = latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx")

Algorithm for Approximation method

Input: Let\(\check{g}=h(g)\), pre-computed values\(F^{-1}(h^{-1}(\check{g}))\) on a fixed grid\(\check{g}\in\check{\cal{G}}\) basedon the type of variables\(j\) and\(k\). For binary/continuous case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j})\); for binary/binary case,\(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j},\check{\Delta}_{k})\); for truncated/continuous case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j})\); for truncated/truncated case,\(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j},\check{\Delta}_{k})\); for ternary/continuous case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2})\); forternary/binary case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2},\check{\Delta}_{k})\); for ternary/truncated case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2},\check{\Delta}_{k})\); for ternay/ternary case,\(\check{g}=(\check{\tau}_{jk},\check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k}^{1},\check{\Delta}_{k}^{2})\).

To avoid interpolation in areas with high approximation errors closeto the boundary, we use hybrid scheme in Step 3. The parameterratio controls the size of the region where theinterpolation is performed (ratio = 0 means nointerpolation,ratio = 1 means interpolation is alwaysperformed). For the derivation of approximate bound for BC, BB, TC, TB,TT cases seeYoon, Müller, and Gaynanova(2021). The derivation of approximate bound for NC, NB, NN, NTcase is in the Appendix.

\[\bar{\tau}_{jk}(\cdot)=\begin{cases}2\pi_{0j}(1-\pi_{0j}) & for \; BC \; case\\2\min(\pi_{0j},\pi_{0k})\{1-\max(\pi_{0j}, \pi_{0k})\} & for \;BB \; case\\2\{\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j})\} & for \;NC \; case\\2\min(\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j}),\pi_{0k}(1-\pi_{0k})) & for\; NB \; case\\2\min(\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j}), \\\;\;\;\;\;\;\;\;\;\;\pi_{0k}(1-\pi_{0k})+\pi_{1k}(1-\pi_{0k}-\pi_{1k})) & for\; NN \; case\\1-(\pi_{0j})^{2} & for \; TC \; case\\2\max(\pi_{0k},1-\pi_{0k})\{1-\max(\pi_{0k},1-\pi_{0k},\pi_{0j})\} & for\; TB \; case\\1-\{\max(\pi_{0j},\pi_{0k},\pi_{1k},1-\pi_{0k}-\pi_{1k})\}^{2} & for\; TN \; case\\1-\{\max(\pi_{0j},\pi_{0k})\}^{2} & for \; TT \; case\\\end{cases}\]

By default,latentcor usesratio = 0.9 asthis value was recommended inYoon, Müller, andGaynanova (2021) having a good balance of accuracy andcomputational speed. This value, however, can be modified by theuser.

#latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.99)$R#latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.4)$Rlatentcor(X,types =c("con","bin","ter","tru"),method ="original")$R#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.3997519 0.4286803 0.5234451#> [2,] 0.3997519 1.0000000 0.5449209 0.4718071#> [3,] 0.4286803 0.5449209 1.0000000 0.5946815#> [4,] 0.5234451 0.4718071 0.5946815 1.0000000

The lower is theratio, the closer is the approximationmethod to original method (withratio = 0 being equivalenttomethod = "original"), but also the higher is the cost ofcomputations.

library(microbenchmark)#microbenchmark(latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.99)$R)#microbenchmark(latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.4)$R)microbenchmark(latentcor(X,types =c("con","bin","ter","tru"),method ="original")$R)#> Unit: milliseconds#>                                                                        expr#>  latentcor(X, types = c("con", "bin", "ter", "tru"), method = "original")$R#>       min       lq     mean   median       uq      max neval#>  12.29274 12.38659 12.86126 12.59087 13.00323 16.91262   100

Rescaled Grid for Interpolation

Since\(|\hat{\tau}|\le\bar{\tau}\), the grid does not need to cover the whole domain\(\tau\in[-1, 1]\). To optimize memoryassociated with storing the grid, we rescale\(\tau\) as follows:\(\check{\tau}_{jk}=\tau_{jk}/\bar{\tau}_{jk}\in[-1,1]\), where\(\bar{\tau}_{jk}\)is as defined above.

In addition, for ternary variable\(j\), it always holds that\(\Delta_{j}^{2}>\Delta_{j}^{1}\) since\(\Delta_{j}^{1}=\Phi^{-1}(\pi_{0j})\)and\(\Delta_{j}^{2}=\Phi^{-1}(\pi_{0j}+\pi_{1j})\).Thus, the grid should not cover the the area corresponding to\(\Delta_{j}^{2}\le\Delta_{j}^{1}\). We thusrescale as follows:\(\check{\Delta}_{j}^{1}=\Delta_{j}^{1}/\Delta_{j}^{2}\in[0,1]\);\(\check{\Delta}_{j}^{2}=\Delta_{j}^{2}\in[0,1]\).

Speed Comparison

To illustrate the speed improvement bymethod = "approx", we plot the run time scaling behavior ofmethod = "approx" andmethod = "original"(settingtypes forgen_data by replicatingc("con", "bin", "ter", "tru") multiple times) withincreasing dimensions\(p = [20, 40, 100, 200,400]\) at sample size\(n =100\) using simulation data. Figure below summarizes the observedscaling in a log-log plot. For both methods we observe the expected\(O(p^2)\) scaling behavior withdimension p, i.e., a linear scaling in the log-log plot. However,method = "approx" is at least one order of magnitude fasterthanmethod = "original" independent of the dimension ofthe problem.

Adjustment of pointwise-estimator for positive-definiteness

Since the estimation is performed point-wise, the resulting matrix ofestimated latent correlations is not guaranteed to be positivesemi-definite. For example, this could be expected when the sample sizeis small (and so the estimation error for each pairwise correlation islarger)

set.seed("234820")X=gen_data(n =6,types =c("con","bin","ter","tru"))$XX#>             [,1] [,2] [,3]      [,4]#> [1,] -0.16632675    1    1 0.1478548#> [2,] -0.61463940    0    0 0.0000000#> [3,]  0.02495623    1    2 1.3444076#> [4,]  1.03744038    0    0 0.0000000#> [5,] -0.73599650    0    1 0.0000000#> [6,] -1.74626874    1    1 0.1778577out=latentcor(X,types =c("con","bin","ter","tru"))out$Rpointwise#>            [,1]       [,2]       [,3]      [,4]#> [1,]  1.0000000 -0.1477240 -0.1254816 0.0000000#> [2,] -0.1477240  1.0000000  0.9983941 0.9990000#> [3,] -0.1254816  0.9983941  1.0000000 0.9987533#> [4,]  0.0000000  0.9990000  0.9987533 1.0000000eigen(out$Rpointwise)$values#> [1]  3.009835688  1.000242414  0.001632933 -0.011711035

latentcor automatically corrects the pointwise estimatorto be positive definite by making two adjustments. First, ifRpointwise has smallest eigenvalue less than zero, thelatentcor projects this matrix to the nearest positivesemi-definite matrix. The user is notified of this adjustment throughthe message (supressed in previous code chunk), e.g.

out=latentcor(X,types =c("con","bin","ter","tru"))

Second,latentcor shrinks the adjusted matrix ofcorrelations towards identity matrix using the parameter\(\nu\) with default value of 0.001(nu = 0.001), so that the resultingR isstrictly positive definite with the minimal eigenvalue being greater orequal to\(\nu\). That is\[R = (1 - \nu) \widetilde R + \nu I,\] where\(\widetilde R\) is thenearest positive semi-definite matrix toRpointwise.

out=latentcor(X,types =c("con","bin","ter","tru"),nu =0.001)out$Rpointwise#>            [,1]       [,2]       [,3]      [,4]#> [1,]  1.0000000 -0.1477240 -0.1254816 0.0000000#> [2,] -0.1477240  1.0000000  0.9983941 0.9990000#> [3,] -0.1254816  0.9983941  1.0000000 0.9987533#> [4,]  0.0000000  0.9990000  0.9987533 1.0000000out$R#>              [,1]       [,2]       [,3]         [,4]#> [1,]  1.000000000 -0.1462282 -0.1245497 -0.002133664#> [2,] -0.146228204  1.0000000  0.9987604  0.988550037#> [3,] -0.124549686  0.9987604  1.0000000  0.991469239#> [4,] -0.002133664  0.9885500  0.9914692  1.000000000

As a result,R andRpointwise could bequite different when sample size\(n\)is small. When\(n\) is large and\(p\) is moderate, the difference istypically driven by parameternu.

set.seed("234820")X=gen_data(n =100,types =c("con","bin","ter","tru"))$Xout=latentcor(X,types =c("con","bin","ter","tru"),nu =0.001)out$Rpointwise#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.4001181 0.4288656 0.5232826#> [2,] 0.4001181 1.0000000 0.5471851 0.4710536#> [3,] 0.4288656 0.5471851 1.0000000 0.5830118#> [4,] 0.5232826 0.4710536 0.5830118 1.0000000out$R#>           [,1]      [,2]      [,3]      [,4]#> [1,] 1.0000000 0.3997180 0.4284367 0.5227594#> [2,] 0.3997180 1.0000000 0.5466379 0.4705825#> [3,] 0.4284367 0.5466379 1.0000000 0.5824288#> [4,] 0.5227594 0.4705825 0.5824288 1.0000000

Appendix

Derivation of bridge function\(F\)for ternary/truncated case

Without loss of generality, let\(j=1\) and\(k=2\). By the definition of Kendall’s\(\tau\),\[ \tau_{12}=E(\hat{\tau}_{12})=E[\frac{2}{n(n-1)}\sum_{1\leq i\leqi' \leq n} sign\{(X_{i1}-X_{i'1})(X_{i2}-X_{i'2})\}].\] Since\(X_{1}\) is ternary,\[\begin{align} &sign(X_{1}-X_{1}') \nonumber\\=&[I(U_{1}>C_{11},U_{1}'\leqC_{11})+I(U_{1}>C_{12},U_{1}'\leqC_{12})-I(U_{1}>C_{12},U_{1}'\leq C_{11})] \nonumber\\ &-[I(U_{1}\leq C_{11}, U_{1}'>C_{11})+I(U_{1}\leq C_{12},U_{1}'>C_{12})-I(U_{1}\leq C_{11}, U_{1}'>C_{12})]\nonumber\\ =&[I(U_{1}>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{11})+I(U_{1}>C_{12})-I(U_{1}>C_{12},U_{1}'>C_{12})\nonumber\\ &-I(U_{1}>C_{12})+I(U_{1}>C_{12},U_{1}'>C_{11})]\nonumber\\ &-[I(U_{1}'>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{11})+I(U_{1}'>C_{12})-I(U_{1}>C_{12},U_{1}'>C_{12})\nonumber\\ &-I(U_{1}'>C_{12})+I(U_{1}>C_{11},U_{1}'>C_{12})]\nonumber\\ =&I(U_{1}>C_{11})+I(U_{1}>C_{12},U_{1}'>C_{11})-I(U_{1}'>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{12})\nonumber\\ =&I(U_{1}>C_{11},U_{1}'\leqC_{12})-I(U_{1}'>C_{11},U_{1}\leq C_{12}).\end{align}\] Since\(X_{2}\) istruncated,\(C_{1}>0\) and\[\begin{align} sign(X_{2}-X_{2}')=&-I(X_{2}=0,X_{2}'>0)+I(X_{2}>0,X_{2}'=0)\nonumber\\ &+I(X_{2}>0,X_{2}'>0)sign(X_{2}-X_{2}')\nonumber\\ =&-I(X_{2}=0)+I(X_{2}'=0)+I(X_{2}>0,X_{2}'>0)sign(X_{2}-X_{2}').\end{align}\] Since\(f\) ismonotonically increasing,\(sign(X_{2}-X_{2}')=sign(Z_{2}-Z_{2}')\),\[\begin{align} \tau_{12}=&E[I(U_{1}>C_{11},U_{1}'\leq C_{12})sign(X_{2}-X_{2}')] \nonumber\\&-E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) sign(X_{2}-X_{2}')]\nonumber\\ =&-E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}=0)]\nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}'=0)]\nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leqC_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ &+E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) I(X_{2}=0)]\nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) I(X_{2}'=0)]\nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leqC_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ =&-2E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}=0)]\nonumber\\ &+2E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}'=0)]\nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leqC_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leqC_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')].\end{align}\] From the definition of\(U\), let\(Z_{j}=f_{j}(U_{j})\) and\(\Delta_{j}=f_{j}(C_{j})\) for\(j=1,2\). Using\(sign(x)=2I(x>0)-1\), we obtain\[\begin{align} \tau_{12}=&-2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12},Z_{2}\leq\Delta_{2})]+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12},Z_{2}'\leq \Delta_{2})] \nonumber\\ &+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12})I(Z_{2}>\Delta_{2},Z_{2}'>\Delta_{2},Z_{2}-Z_{2}'>0)]\nonumber\\ &-2E[I(Z_{1}'>\Delta_{11},Z_{1}\leq\Delta_{12})I(Z_{2}>\Delta_{2},Z_{2}'>\Delta_{2},Z_{2}-Z_{2}'>0)]\nonumber\\ =&-2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12},Z_{2}\leq \Delta_{2})]+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12}, Z_{2}'\leq \Delta_{2})] \nonumber\\ &+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12},Z_{2}'>\Delta_{2},Z_{2}>Z_{2}')]\nonumber\\ &-2E[I(Z_{1}'>\Delta_{11},Z_{1}\leq\Delta_{12},Z_{2}'>\Delta_{2},Z_{2}>Z_{2}')].\end{align}\] Since\(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}},-Z{1}\}\),\(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}},Z{1}'\}\) and\(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}},-Z{2}'\}\) are standard bivariate normally distributedvariables with correlation\(-\frac{1}{\sqrt{2}}\),\(r/\sqrt{2}\) and\(-\frac{r}{\sqrt{2}}\), respectively, by thedefinition of\(\Phi_3(\cdot,\cdot,\cdot;\cdot)\) and\(\Phi_4(\cdot,\cdot, \cdot,\cdot;\cdot)\) wehave\[\begin{align} F_{NT}(r;\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k})= &-2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix}1 & 0 & -r \\0 & 1 & 0 \\-r & 0 & 1\end{pmatrix} \right\} \nonumber\\ &+2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix}1 & 0 & 0 \\0 & 1 & r \\0 & r & 1\end{pmatrix}\right\}\nonumber \\ &+2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & 0 & \frac{r}{\sqrt{2}} \\0 & 1 & -r & \frac{r}{\sqrt{2}} \\0 & -r & 1 & -\frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\} \nonumber\\ &-2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & r & -\frac{r}{\sqrt{2}} \\0 & 1 & 0 & -\frac{r}{\sqrt{2}} \\r & 0 & 1 & -\frac{1}{\sqrt{2}} \\-\frac{r}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\}.\end{align}\] Using the facts that\[\begin{align}&\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & r & -\frac{r}{\sqrt{2}} \\0 & 1 & 0 & -\frac{r}{\sqrt{2}} \\r & 0 & 1 & -\frac{1}{\sqrt{2}} \\-\frac{r}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\} \nonumber\\&+\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & r & \frac{r}{\sqrt{2}} \\0 & 1 & 0 & \frac{r}{\sqrt{2}} \\r & 0 & 1 & \frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\} \nonumber\\=&\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k};\begin{pmatrix}1 & 0 & 0 \\0 & 1 & r \\0 & r & 1\end{pmatrix}\right\}\end{align}\] and\[\begin{align}&\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k};\begin{pmatrix}1 & 0 & 0 \\0 & 1 & r \\0 & r & 1\end{pmatrix}\right\}+\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix}1 & 0 & -r \\0 & 1 & 0 \\-r & 0 & 1\end{pmatrix} \right\} \nonumber\\=&\Phi_{2}(-\Delta_{j}^{1},\Delta_{j}^{2};0)=\Phi(-\Delta_{j}^{1})\Phi(\Delta_{j}^{2}).\end{align}\] So that,\[\begin{align} F_{NT}(r;\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k})= &-2\Phi(-\Delta_{j}^{1})\Phi(\Delta_{j}^{2}) \nonumber\\ &+2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix}1 & 0 & 0 \\0 & 1 & r \\0 & r & 1\end{pmatrix}\right\}\nonumber \\ &+2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & 0 & \frac{r}{\sqrt{2}} \\0 & 1 & -r & \frac{r}{\sqrt{2}} \\0 & -r & 1 & -\frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\} \nonumber\\ &+2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix}1 & 0 & r & \frac{r}{\sqrt{2}} \\0 & 1 & 0 & \frac{r}{\sqrt{2}} \\r & 0 & 1 & \frac{1}{\sqrt{2}} \\\frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}}& 1\end{pmatrix}\right\}.\end{align}\]

It is easy to get the bridge function for truncated/ternary case byswitching\(j\) and\(k\).

Derivation of approximate bound for the ternary/continuous case

Let\(n_{0x}=\sum_{i=1}^{n_x}I(x_{i}=0)\),\(n_{2x}=\sum_{i=1}^{n_x}I(x_{i}=2)\),\(\pi_{0x}=\frac{n_{0x}}{n_{x}}\) and\(\pi_{2x}=\frac{n_{2x}}{n_{x}}\), then\[\begin{align} |\tau(\mathbf{x})|\leq &\frac{n_{0x}(n-n_{0x})+n_{2x}(n-n_{0x}-n_{2x})}{\begin{pmatrix} n \\ 2\end{pmatrix}} \nonumber\\ = &2\{\frac{n_{0x}}{n-1}-(\frac{n_{0x}}{n})(\frac{n_{0x}}{n-1})+\frac{n_{2x}}{n-1}-(\frac{n_{2x}}{n})(\frac{n_{0x}}{n-1})-(\frac{n_{2x}}{n})(\frac{n_{2x}}{n-1})\}\nonumber\\ \approx &2\{\frac{n_{0x}}{n}-(\frac{n_{0x}}{n})^2+\frac{n_{2x}}{n}-(\frac{n_{2x}}{n})(\frac{n_{0x}}{n})-(\frac{n_{2x}}{n})^2\}\nonumber\\ = & 2\{\pi_{0x}(1-\pi_{0x})+\pi_{2x}(1-\pi_{0x}-\pi_{2x})\}\end{align}\]

For ternary/binary and ternary/ternary cases, we combine the twoindividual bounds.

Derivation of approximate bound for the ternary/truncated case

Let\(\mathbf{x}\in\mathcal{R}^{n}\)and\(\mathbf{y}\in\mathcal{R}^{n}\) bethe observed\(n\) realizations ofternary and truncated variables, respectively. Let\(n_{0x}=\sum_{i=0}^{n}I(x_{i}=0)\),\(\pi_{0x}=\frac{n_{0x}}{n}\),\(n_{1x}=\sum_{i=0}^{n}I(x_{i}=1)\),\(\pi_{1x}=\frac{n_{1x}}{n}\),\(n_{2x}=\sum_{i=0}^{n}I(x_{i}=2)\),\(\pi_{2x}=\frac{n_{2x}}{n}\),\(n_{0y}=\sum_{i=0}^{n}I(y_{i}=0)\),\(\pi_{0y}=\frac{n_{0y}}{n}\),\(n_{0x0y}=\sum_{i=0}^{n}I(x_{i}=0 \;\& \;y_{i}=0)\),\(n_{1x0y}=\sum_{i=0}^{n}I(x_{i}=1 \;\& \;y_{i}=0)\) and\(n_{2x0y}=\sum_{i=0}^{n}I(x_{i}=2 \;\& \;y_{i}=0)\) then\[\begin{align} |\tau(\mathbf{x}, \mathbf{y})|\leq & \frac{\begin{pmatrix}n \\ 2\end{pmatrix}-\begin{pmatrix}n_{0x} \\2\end{pmatrix}-\begin{pmatrix}n_{1x} \\ 2\end{pmatrix}-\begin{pmatrix}n_{2x} \\ 2 \end{pmatrix}-\begin{pmatrix}n_{0y} \\2\end{pmatrix}+\begin{pmatrix}n_{0x0y} \\ 2\end{pmatrix}+\begin{pmatrix}n_{1x0y} \\2\end{pmatrix}+\begin{pmatrix}n_{2x0y} \\2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\end{align}\] Since\(n_{0x0y}\leq\min(n_{0x},n_{0y})\),\(n_{1x0y}\leq\min(n_{1x},n_{0y})\) and\(n_{2x0y}\leq\min(n_{2x},n_{0y})\) we obtain\[\begin{align} |\tau(\mathbf{x}, \mathbf{y})|\leq & \frac{\begin{pmatrix}n \\ 2\end{pmatrix}-\begin{pmatrix}n_{0x} \\2\end{pmatrix}-\begin{pmatrix}n_{1x} \\ 2\end{pmatrix}-\begin{pmatrix}n_{2x} \\ 2 \end{pmatrix}-\begin{pmatrix}n_{0y} \\2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ & + \frac{\begin{pmatrix}\min(n_{0x},n_{0y}) \\ 2\end{pmatrix}+\begin{pmatrix}\min(n_{1x},n_{0y}) \\2\end{pmatrix}+\begin{pmatrix}\min(n_{2x},n_{0y}) \\2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ \leq & \frac{\begin{pmatrix}n \\2\end{pmatrix}-\begin{pmatrix}\max(n_{0x},n_{1x},n_{2x},n_{0y}) \\2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ \leq &1-\frac{\max(n_{0x},n_{1x},n_{2x},n_{0y})(\max(n_{0x},n_{1x},n_{2x},n_{0y})-1)}{n(n-1)}\nonumber\\ \approx & 1-(\frac{\max(n_{0x},n_{1x},n_{2x},n_{0y})}{n})^{2}\nonumber\\ =& 1-\{\max(\pi_{0x},\pi_{1x},\pi_{2x},\pi_{0y})\}^{2}\nonumber\\ =&1-\{\max(\pi_{0x},(1-\pi_{0x}-\pi_{2x}),\pi_{2x},\pi_{0y})\}^{2}\end{align}\]

It is easy to get the approximate bound for truncated/ternary case byswitching\(\mathbf{x}\) and\(\mathbf{y}\).

References

Croux, Christophe, Peter Filzmoser, and Heinrich Fritz. 2013.“Robust Sparse Principal Component Analysis.”Technometrics 55 (2): 202–14.
Fan, Jianqing, Han Liu, Yang Ning, and Hui Zou. 2017.“HighDimensional Semiparametric Latent Graphical Model for MixedData.”Journal of the Royal Statistical Society. Series B:Statistical Methodology 79 (2): 405–21.
Filzmoser, Peter, Heinrich Fritz, and Klaudius Kalcher. 2021.pcaPP:Robust PCA by Projection Pursuit.https://CRAN.R-project.org/package=pcaPP.
Fox, John. 2019.Polycor: Polychoric and PolyserialCorrelations.https://CRAN.R-project.org/package=polycor.
Gaure, Simen. 2019.Chebpol: Multivariate Interpolation.https://github.com/sgaure/chebpol.
Liu, Han, John Lafferty, and Larry Wasserman. 2009.“TheNonparanormal: Semiparametric Estimation of High Dimensional UndirectedGraphs.”Journal of Machine Learning Research 10 (10).
Quan, Xiaoyun, James G Booth, and Martin T Wells. 2018.“Rank-Based Approach for Estimating Correlations in Mixed OrdinalData.”arXiv Preprint arXiv:1809.06255.
Yoon, Grace, Raymond J Carroll, and Irina Gaynanova. 2020.“SparseSemiparametric Canonical Correlation Analysis for Data of MixedTypes.”Biometrika 107 (3): 609–25.
Yoon, Grace, Christian L Müller, and Irina Gaynanova. 2021.“FastComputation of Latent Correlations.”Journal of Computationaland Graphical Statistics, 1–8.

[8]ページ先頭

©2009-2025 Movatter.jp