Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

The concordance filter: an adaptive model-free feature screening procedure

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

A new model-free and data-adaptive feature screening procedure referred to as the concordance filter is developed for ultrahigh-dimensional data. The proposed method is based on the concordance filter which measures concordance between random vectors and can work adaptively with several types of predictors and response variables. We apply the concordance filter to deal with feature screening problems emerging from a wide range of real applications, such as nonparametric regression and survival analysis, among others. It is shown that the concordance filter enjoys the sure screening and rank consistency properties under weak regularity conditions. In particular, the concordance filter can still be powerful in the presence of censoring and heavy tails. We further demonstrate the superior performance of the concordance filter over existing screening methods by numerical examples and medical applications.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+
from ¥17,985 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.

References

  • Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Annals Stat 43(5):2055–2085

    MathSciNet  Google Scholar 

  • Barber RF, Candès EJ (2019) A knockoff filter for high-dimensional selective inference. Annals Stat 47(5):2504–2537

    MathSciNet  Google Scholar 

  • Bing X, Wegkamp MH (2019) Adaptive estimation of the rank of the coefficient matrix in high-dimensional multivariate response regression models. Annals Stat 47(6):3157–3184

    MathSciNet  Google Scholar 

  • Chen B, Qin J, Yuan A (2021) Using the accelerated failure time model to analyze current status data with misclassified covariates. Electron J Stat 15(1):1372–1394

    MathSciNet  Google Scholar 

  • Clayton D, Cuzick J (1985) Multivariate generalizations of the proportional hazards model. J R Stat Soc: Series A (General) 148(2):82–108

    MathSciNet  Google Scholar 

  • Cox DR (1972) Regression models and life-tables. J R Stat Soc: Series B (Methodological) 34(2):187–202

    MathSciNet  Google Scholar 

  • Desmedt C, Piette F, Loi S et al. (2007) Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series. Clin Cancer Res 13(11):3207–3214

    Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Series B (Stat Methodol) 70(5):849–911

    MathSciNet  Google Scholar 

  • Fan J, Song R (2010) Sure independence screening in generalized linear models with np-dimensionality. Annals Stat 38(6):3567–3604

    MathSciNet  Google Scholar 

  • Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:2013–2038

    MathSciNet  Google Scholar 

  • Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106(494):544–557

    MathSciNet  Google Scholar 

  • Fan J, Li R, Zhang CH et al (2020) Statistical foundations of data science. Chapman and Hall/CRC

    Google Scholar 

  • Hall P, Miller H (2009) Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graphic Stat 18(3):533–550

    MathSciNet  Google Scholar 

  • Hall P, Xue JH (2014) On selecting interacting features from high-dimensional data. Comput Stat Data Anal 71:694–708

    MathSciNet  Google Scholar 

  • Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546

    Google Scholar 

  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30

    MathSciNet  Google Scholar 

  • Huang J, Horowitz JL, Ma S (2008) Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Annals Stat 36(2):587–613

    MathSciNet  Google Scholar 

  • Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499

    MathSciNet  Google Scholar 

  • Ishwaran H, Kogalur UB, Blackstone EH et al (2008) Random survival forests. Ann Appl Stat 2(3):841–860

    MathSciNet  Google Scholar 

  • Kalbfleisch JD, Prentice RL (2011) The statistical analysis of failure time data, vol 360. John Wiley & Sons

    Google Scholar 

  • Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481

    MathSciNet  Google Scholar 

  • Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93

    Google Scholar 

  • Klein N, Kneib T, Lang S et al (2015) Bayesian structured additive distributional regression with an application to regional income inequality in Germany. Ann Appl Stat 9(2):1024–1052

    MathSciNet  Google Scholar 

  • Li G, Peng H, Zhang J et al (2012) Robust rank correlation based screening. Ann Stat 40(3):1846–1877

    MathSciNet  Google Scholar 

  • Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139

    MathSciNet  Google Scholar 

  • Liu W, Ke Y, Liu J et al (2022) Model-free feature screening and FDR control with knockoff features. J Am Stat Assoc 117(537):428–443

    MathSciNet  Google Scholar 

  • Lovell MC (1963) Seasonal adjustment of economic time series and multiple regression analysis. J Am Stat Assoc 58(304):993–1010

    MathSciNet  Google Scholar 

  • Lv J, Liu JS (2014) Model selection principles in misspecified models. J R Stat Soc Ser B Stat Methodol 76(1):141–167

    MathSciNet  Google Scholar 

  • Pan W, Wang X, Xiao W et al (2019) A generic sure independence screening procedure. J Am Stat Assoc 114(526):928–937

    MathSciNet  Google Scholar 

  • Pan W, Wang X, Zhang H et al (2020) Ball covariance: a generic measure of dependence in Banach space. J Am Stat Assoc 115(529):307–317

    MathSciNet  Google Scholar 

  • Pukelsheim F (1994) The three sigma rule. Am Stat 48(2):88–91

    MathSciNet  Google Scholar 

  • Ritchie MD, Van Steen K (2018) The search for gene-gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation. Annals of translational medicine 6(8):157–157

    Google Scholar 

  • Saldana DF, Feng Y (2018) Sis: an r package for sure independence screening in ultrahigh-dimensional statistical models. J Stat Softw 83(1):1–25

    Google Scholar 

  • Sellke TM, Sellke SH (1997) Chebyshev inequalities for unimodal distributions. Am Stat 51(1):34–40

    MathSciNet  Google Scholar 

  • Sen PK (1968) Estimates of the regression coefficient based on Kendall’s tau. J Am Stat Assoc 63(324):1379–1389

    MathSciNet  Google Scholar 

  • Shen Y, Ning J, Qin J (2009) Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104(487):1192–1202

    MathSciNet  Google Scholar 

  • Song R, Lu W, Ma S et al (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101(4):799–814

    MathSciNet  Google Scholar 

  • Stroud JR, Müller P, Polson NG (2003) Nonlinear state-space models with state-dependent variances. J Am Stat Assoc 98(462):377–386

    MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288

    MathSciNet  Google Scholar 

  • Vogelsang TJ (2001) Nonlinear econometric modeling in time series analysis, in: Proceedings of the eleventh international symposium in economic theory. 96(453):354–354

  • Zhao SD, Li Y (2012) Principled sure independence screening for cox models with ultra-high-dimensional covariates. J Multivar Anal 105(1):397–411

    MathSciNet  Google Scholar 

  • Zhao Z (2008) Parametric and nonparametric models and methods in financial econometrics. Stat Surv 2:1–42

    MathSciNet  Google Scholar 

  • Zhu J, Pan W, Zheng W et al (2021) Ball: an r package for detecting distribution difference and association in metric spaces. J Stat Softw 97:1–31

    Google Scholar 

  • Zhu LP, Li L, Li R et al (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106(496):1464–1475

    MathSciNet  Google Scholar 

Download references

Acknowledgements

We are grateful for resources from the High Performance Computing Center of Central South University. We have developed theMFSIS R package about the proposed CSIS method and this package is available in R-CRAN viahttps://cran.r-project.org/package=MFSIS. We acknowledge the support in part by National Statistical Scientific Research Project of China (No. 2022LZ28 & 2021LY042), Postgraduate Scientific Research Innovation Project of Hunan Province (No. CX20200148), Changsha Municipal Natural Science Foundation (No. kq2202080) and the Open Research Fund from the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen(No. B10120210117-OF04)

Author information

Author notes
  1. Xuewei Cheng: Conceptualization, Methodology, Software, Writing original draft. Gang Li: Writing–review & editing, Validation. Hong Wang: Supervision, Writing–review & editing.

Authors and Affiliations

  1. MOE-LCSM, School of Mathematics and Statistics, Hunan Normal University, Changsha, China

    Xuewei Cheng

  2. Key Laboratory of Applied Statistics and Data Science, College of Hunan Province, Hunan Normal University, Changsha, China

    Xuewei Cheng

  3. School of Mathematics and Statistics, Central South University, Changsha, China

    Xuewei Cheng & Hong Wang

  4. School of Public Health, University of California Los Angeles, Los Angeles, USA

    Gang Li

Authors
  1. Xuewei Cheng
  2. Gang Li
  3. Hong Wang

Corresponding author

Correspondence toHong Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xuewei Cheng, Gang Li and Hong Wang have contributed equally to this work.

Appendices

Appendix A Proof of the sure screening property

In this Appendix, we first prove the property of rank consistency of the proposed method by concentration inequalities. Then, we provide the proof of property of sure screening by Hoeffing’s inequality for U-statistic. All of the techniques of proof of theorems are followed from Li et al. (2012a) and Song et al. (2014).

Proof of Theorem 1

We show that\(E|\omega _{k}|\ge c_{0}n^{-\kappa }\) for some\(c_{0}>0\), if\(k\in {\mathcal {M}}_{*}\), we have

$$\begin{aligned} E(\omega _{k})&=\frac{E\{I(X_{1k}<X_{2k})I(Y_{1}<Y_{2}, \delta _{1}=1)\}}{E\{I(Y_{1}<Y_{2},\delta _{1}=1)\}}-\frac{1}{2}\\&=\frac{E\{I(X_{1k}<X_{2k})I(Y_{1}<Y_{2})\} E(\delta _{1}=1)}{E\{I(Y_{1}<Y_{2})\}E(\delta _{1}=1)}-\frac{1}{2}\\&=2E\{I(X_{1k}<X_{2k})I(Y_{1}<Y_{2})\}-\frac{1}{2}\\&=2E\{I(X_{1k}<X_{2k})I[H(Y_{1})-H(Y_{2})<0]\}-\frac{1}{2}\\&=2E\{I(X_{1k}<X_{2k})I[H(Y_{1})-H(Y_{2})\\&\;\;\;\;-\rho _{k}^{*}\{m_{k}(X_{1k})-m_{k}(X_{2k})\}]<\rho _{k}^{*}[m_{k}(X_{2k})-m_{k}(X_{1k})]\}-\frac{1}{2}\\&=2E[I(X_{1k}<X_{2k})I(\Delta \epsilon _{k}<-\rho _{k}^{*}\Delta m_{k})]-\frac{1}{2}\\&=2E[I(X_{1k}<X_{2k})F_{\Delta \epsilon _{k} \vert \Delta m_{k}}(-\rho _{k}^{*}\Delta m_{k})]-\frac{1}{2}, \end{aligned}$$

where\(F_{\Delta \epsilon _{k}\vert \Delta m_{k}}\) is the conditional cumulative distribution function of\(\Delta \epsilon _{k}=H(Y_{1})-H(Y_{2})-\rho _{k}^{*}\{m_{k}(X_{1k})-m_{k}(X_{2k})\}\) given\(\Delta m_{k}=m_{k}(X_{1k})-m_{k}(X_{2k})\). The second equality holds since the status\(\delta\) is independent of the responseY. The third equality holds since\(E\{I(Y_{1}<Y_{2})\}=1/2\).

Because\(m_{k}(\cdot )\) is monotone,\(m_{k}(X_{1k})-m_{k}(X_{2k})\) is either greater than or less than zero for all\(X_{1k}<X_{2k}\). This implies that\(1-F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(-\rho _{k}^{*}\Delta m_{k})\) is either greater or less than 1/2 due to Condition (C3).\(E(\omega _{k})\) is either greater or less than zero for\(k\in {\mathcal {M}}_{*}\). From the above analysis and taking into account the symmetry of\(F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(t)\), we have

$$\begin{aligned}{} & {} 1-F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(-t)=F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(t), \end{aligned}$$
(A1)
$$\begin{aligned}{} & {} F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(0)=\frac{1}{2}. \end{aligned}$$
(A2)

In the following, we further establish the lower bound of\(\vert E(\omega _{k})\vert\).

Without loss of generality, assuming\(m_{k}(\cdot )\) is monotone increasing and\(\rho _{k}^{*}>c^{*}n^{-\kappa }\). Note that\(E(\omega _{k})\) can be equivalently written as

$$\begin{aligned} \begin{aligned} E(\omega _{k})&=2E\{I(X_{1k}<X_{2k})[F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(-\rho _{k}^{*}\Delta m_{k})-F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(0)]\}\\&=2E\{I(X_{1k}<X_{2k})\int _{0}^{-\rho _{k}^{*}\Delta m_{k}}f_{\Delta \epsilon _{k}\vert \Delta m_{k}}(t)dt\}.\\ \end{aligned} \end{aligned}$$
(A3)

By the Gaussian inequality for the symmetric unimodal distribution (Pukelsheim1994; Sellke and Sellke1997),

$$\begin{aligned} P(\mid X\mid \ge t)\le \sqrt{3} \sigma /(t+\sqrt{3}\sigma ), \end{aligned}$$
(A4)

whereX is a unimodal random variable with a mode at the origin zero and variance\(\sigma ^{2}\). Using the Gaussian inequality for the symmetric unimodal distribution obtained above and condition (C4), we have

$$\begin{aligned} \begin{aligned} \int _{0}^{-\rho _{k}^{*}\Delta m_{k}}f_{\Delta \epsilon _{k}\vert \Delta m_{k}}(t)dt&\ge \frac{1}{2}\left( 1-\frac{\sqrt{3}\sigma _{2}}{-\rho _{k}^{*}\Delta m_{k}+\sqrt{3}\sigma _{2}}\right) \\&=\frac{-\rho _{k}^{*}\Delta m_{k}}{2(-\rho _{k}^{*}\Delta m_{k}+\sqrt{3}\sigma _{2})}. \end{aligned} \end{aligned}$$
(A5)

By the condition (C2) and the inequality\(E|X-Y|>E|X |\), where bothX andY are i.i.d random variables with\(E(X)=E(Y)=0\). We have

$$\begin{aligned} E(\omega _{k})&\ge 2E\left[ I(X_{1k}<X_{2k})\frac{-\rho _{k}^{*}\Delta m_{k}}{2(-\rho _{k}^{*}\Delta m_{k}+\sqrt{3}\sigma _{2})}\right] \\&\ge \frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}E[-\Delta m_{k}I(-M \le \Delta m_{k}<0)]\\&\ge \frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}E[-\Delta m_{k}I(\Delta m_{k}<0)+\Delta m_{k}I(\Delta m_{k}<-M)]\\&= \frac{\rho _{k}^{*}}{2M+2\sqrt{3}\sigma _{2}}E[\vert m(X_{2k})-m(X_{1k})\vert ]\\&\quad +\frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}E[\Delta m_{k}I(\Delta m_{k}<-M)]\\&\ge \frac{\rho _{k}^{*}c_{{\mathcal {M}}_{*}}}{2M+2\sqrt{3}\sigma _{2}} +\frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}E[\Delta m_{k}I(\Delta m_{k}<-M)]\\&=I_{1}+I_{2}, \end{aligned}$$

for some constant\(M>0\). By Condition (C3-C4), we have

$$\begin{aligned} P(\Delta m_{k}<-M)=P(\Delta m_{k}>M)\le \frac{E(\Delta ^{2} m_{k})}{M^{2}} =\frac{\text {Var}(\Delta m_{k})}{M^{2}}\le \frac{2\sigma _{1}^2}{M^2}. \end{aligned}$$
(A6)

By Chebyshev’s inequality and Condition (C4), the second term\(I_{2}\) can be further bounded below as

$$\begin{aligned} \begin{aligned} I_{2}&\ge -\frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}\sqrt{\{E[(-\Delta m_{k})^2]E[I^{2}(\Delta m_{k}<-M)]\}}\\&\ge -\frac{\rho _{k}^{*}}{M+\sqrt{3}\sigma _{2}}\sqrt{2\sigma _{1}^{2}*(2\sigma _{1}^{2}/M^{2})}\\&= -\frac{2\rho _{k}^{*}\sigma _{1}^{2}}{M^{2}+\sqrt{3}M\sigma _{2}}, \end{aligned} \end{aligned}$$
(A7)

where the second inequality holds due to Condition (C3) and Inequality (A6). The above two inequalities for\(I_{1}\) and\(I_{2}\) yield that

$$\begin{aligned} \begin{aligned} E(\omega _{k})&\ge I_{1}+I_{2} \ge \frac{\rho _{k}^{*}c_{{\mathcal {M}}_{*}}}{2M+2\sqrt{3}\sigma _{2}} -\frac{2\rho _{k}^{*}\sigma _{1}^{2}}{M^{2}+\sqrt{3}M\sigma _{2}}. \end{aligned} \end{aligned}$$
(A8)

Let\(M=8\sigma _{1}^{2}/c_{{\mathcal {M}}_{*}}\). Then,\(M\ge \sqrt{3}\sigma _{2}\) whenn is large. We have\(E(\omega _{k})\ge c_{0}n^{-\kappa }\), where\(c_{0}=\frac{c^{*} c^{2}_{{\mathcal {M}}_{*}}}{32\sigma _{1}^2}\). Similarly, if\(m_{k}(\cdot )\) is monotone decreasing and\(\rho _{k}^{*}<-c^{*}n^{-\kappa }\), we can show that\(E(\omega _{k})\le -c_{0}n^{-\kappa }\). Therefore,\(E(\vert \omega _{k}\vert )\ge c_{0}n^{-\kappa }\) when\(\vert \rho _{k}^{*}\vert >c^{*}n^{-\kappa }\) for any\(k\in {\mathcal {M}}_{*}\). Theorem 1 is then proved.\(\square\)

Lemma 1

Hoeffing’s inequality for U-statistics (Hoeffding1963) Let\(h=(x_{1},x_{2},...,x_{m})\) be a symmetric kernel of the U-statistics\(U_{n}\), with\(a\le (x_{1},x_{2},...,x_{m})\le b\). Then, for\(t>0\) and\(m<n\), we have

$$\begin{aligned} P\left\{ \mid U_{n}-E(U_{n})\mid >t\right\} \le 2\text {exp}\left( \frac{-2\lfloor n/m \rfloor t^2}{(b-a)^2}\right) . \end{aligned}$$
(A9)

Proof of Theorem 2

By the definition of\(\omega _{k}\) (3), we have, for each\(k=1,2,...,p\),

$$\begin{aligned} \begin{aligned} \omega _{k}&=\frac{\sum _{j\ne i}^{n} I(X_{ik}<X_{jk})I(Y_{i}<Y_{j},\delta _{i}=1)}{\sum _{j\ne i}^{n} I(Y_{i}<Y_{j},\delta _{i}=1)}-\frac{1}{2}\\&=\frac{1}{K}\sum _{1\le i<j\le n}^{n}h[(X_{ik},Y_{i}),(X_{jk},Y_{j})], \end{aligned} \end{aligned}$$
(A10)

where\(K=\sum _{j \ne i}^{n}I(Y_{i}<Y_{j},\delta _{i}=1)\) is the number of all comparable pairs (in particular,\(K=\left( {\begin{array}{c}n\\ 2\end{array}}\right)\) when data is fully observed) and\(h[(X_{ik},Y_{i}),(X_{jk},Y_{j})]=[\delta _{i}I(X_{ik}<X_{jk})I(Y_{i}<Y_{j}) +\delta _{j}I(X_{ik}>X_{jk})I(Y_{i}>Y_{j})]-\frac{1}{2}\).

This means that\(\omega _{k}\) is a U-statistics with symmetric kernel\(h[(X_{ik},Y_{i}),(X_{jk},Y_{j})]\). As the indicator function is bounded by 1, we have that

$$\begin{aligned} -\frac{1}{2}\le h[(X_{ik},Y_{i}),(X_{jk},Y_{j})] \le \frac{1}{2}. \end{aligned}$$
(A11)

Also,\(-\frac{1}{2}\le E(\omega _{k})\le \frac{1}{2}\). Taking application of Hoeffding’s inequality of Lemma 1, for\(0<\kappa <\frac{1}{2}\) and any\(c_{1}>0\), there exists a positive constant\(c_{2}\) such that

$$\begin{aligned} P(\mid \omega _{k}-E(\omega _{k})\mid \ge c_{1}n^{-\kappa }) \le 2\{\text {exp}(-2c_{1}^2\lfloor n/2\rfloor n^{-2\kappa })\}\le 2\{\text {exp}(-c_{2}n^{1-2\kappa })\}, \end{aligned}$$
(A12)

where\(c_{2}=c_{1}^{2}\). Thus,

$$\begin{aligned} \begin{aligned} P\left( \mathop {\text {max}}\limits _{1\le k \le p}\mid \omega _{k}-E(\omega _{k})\mid \ge c_{1}n^{-\kappa }\right)&\le \sum _{k=1}^{p} P\left( \mid \omega _{k}-E(\omega _{k})\mid \ge c_{1}n^{-\kappa }\right) \\&\le 2p\left\{ \text {exp}(-c_{2}n^{1-2\kappa })\right\} . \end{aligned} \end{aligned}$$
(A13)

Denote the set\(S_{n}=\left\{ \mathop {\text {max}}\limits _{k \in {\mathcal {M}}_{*}}\mid \omega _{k}-E(\omega _{k})\mid \ge c_{0}n^{-\kappa }/2\right\}\). On this set\(S_{n}\) and the result of Theorem 1, we have

$$\begin{aligned} \mid \omega _{k} \mid \ge \mid E(\omega _{k})\mid -\mid \omega _{k} -E(\omega _{k})\mid \ge c_{0}n^{-\kappa }/2, \text {for all}\; k \in {\mathcal {M}}_{*}. \end{aligned}$$
(A14)

From inequality (A13), it is easy to check that there exists a positive constant\(c_{2}\) such that

$$\begin{aligned} P(S_{n}^{c})\le 2\vert {\mathcal {M}}_{*}\vert \text {exp}\left\{ -c_{2}n^{1-2\kappa }\right\} . \end{aligned}$$
(A15)

By such a choice of\(\gamma _{n}=c_{3}n^{-\kappa }\) with\(c_{3}\le c_{0}/2\), we have

$$\begin{aligned} P\left( {\mathcal {M}}_{*}\subset \widehat {\mathcal{M}}_{\gamma _{n}}\right) \ge P(S_{n}) \ge 1-2\vert {\mathcal {M}}_{*}\vert \text {exp}\left\{ -c_{2}n^{1-2\kappa }\right\} . \end{aligned}$$
(A16)

Appendix B Example 4: other complicated nonlinear models

In this subsection, we illustrate that the CSIS can be applied to more complicated scenarios. Following DC-SIS (Li et al.2012b) and PC-Screen (Liu et al.2022), we consider the following four sophisticated scenarios with\(\varepsilon \sim N(0,1)\). These models are frequently applied to economic time series analysis (Lovell1963) and income inequality study (Klein et al.2015).

Model 4.a (Dummy variables):\(Y=3X_{1}+3\text {sgn}(X_{2})+3I(X_{3}>0.5) +3I(X_{4}<0.5)+3I(-0.5<X_{5}<0.5)+\varepsilon\)

Model 4.b (Additive structure):\(Y=5X_{1}+2\text {sin}(\pi X_{2}/2)+2X_{3}I(X_{3}>0) +2\text {exp}(5X_{4})+3(1/X_{5})+\varepsilon\)

Model 4.c (A sophisticated nonlinear model & strong signal):\(Y=5\text {arcsin}[X_{1}I(-1\le X_{1}\le 1)] +2\text {arccos}[X_{2}I(-1\le X_{2}\le 1)] +2\text {arctan}(X_{3}) +2\text {tan}[X_{4}I(-\frac{\pi }{2}\le X_{4}\le \frac{\pi }{2})] +2\text {sin}[X_{5}I(-\frac{\pi }{2}\le X_{5}\le \frac{\pi }{2})] +\varepsilon\)

Model 4.d (A sophisticated nonlinear model & weak signal):\(Y=\text {arcsin}[X_{1}I(-1\le X_{1}\le 1)] -0.8\text {arccos}[X_{2}I(-1\le X_{2}\le 1)]+0.6\text {arctan}(X_{3}) -0.4\text {tan}[X_{4}I(-\frac{\pi }{2}\le X_{4}\le \frac{\pi }{2})] +0.2\text {sin}[X_{5}I(-\frac{\pi }{2}\le X_{5}\le \frac{\pi }{2})] +\varepsilon\)

Table 7 The mean of proportion\({\mathcal {S}}\) and minimum model size\({\mathcal {R}}\) in Example 4

The average values of the proportion\({\mathcal {S}}\) and the minimum model size\({\mathcal {R}}\) that includes all five active features are presented in Table7 under 100 replications. In the nonlinear examples, all five competitors perform well. From Model 4.a to Model 4.d, DC-SIS performs comparably to CSIS in each scenario and outperforms reasonably than other three methods, especially in Model 4.c and Model 4.d. As a result, CSIS enjoys the sure screening property requiring a relatively smaller model size to recover the truly important variables in this example.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, X., Li, G. & Wang, H. The concordance filter: an adaptive model-free feature screening procedure.Comput Stat39, 2413–2436 (2024). https://doi.org/10.1007/s00180-023-01399-5

Download citation

Keywords

Profiles

  1. Gang LiView author profile

Access this article

Subscribe and save

Springer+
from ¥17,985 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2026 Movatter.jp