416Accesses
1Citation
Abstract
A new model-free and data-adaptive feature screening procedure referred to as the concordance filter is developed for ultrahigh-dimensional data. The proposed method is based on the concordance filter which measures concordance between random vectors and can work adaptively with several types of predictors and response variables. We apply the concordance filter to deal with feature screening problems emerging from a wide range of real applications, such as nonparametric regression and survival analysis, among others. It is shown that the concordance filter enjoys the sure screening and rank consistency properties under weak regularity conditions. In particular, the concordance filter can still be powerful in the presence of censoring and heavy tails. We further demonstrate the superior performance of the concordance filter over existing screening methods by numerical examples and medical applications.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, books and news in related subjects, suggested using machine learning.References
Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Annals Stat 43(5):2055–2085
Barber RF, Candès EJ (2019) A knockoff filter for high-dimensional selective inference. Annals Stat 47(5):2504–2537
Bing X, Wegkamp MH (2019) Adaptive estimation of the rank of the coefficient matrix in high-dimensional multivariate response regression models. Annals Stat 47(6):3157–3184
Chen B, Qin J, Yuan A (2021) Using the accelerated failure time model to analyze current status data with misclassified covariates. Electron J Stat 15(1):1372–1394
Clayton D, Cuzick J (1985) Multivariate generalizations of the proportional hazards model. J R Stat Soc: Series A (General) 148(2):82–108
Cox DR (1972) Regression models and life-tables. J R Stat Soc: Series B (Methodological) 34(2):187–202
Desmedt C, Piette F, Loi S et al. (2007) Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series. Clin Cancer Res 13(11):3207–3214
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Series B (Stat Methodol) 70(5):849–911
Fan J, Song R (2010) Sure independence screening in generalized linear models with np-dimensionality. Annals Stat 38(6):3567–3604
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:2013–2038
Fan J, Feng Y, Song R (2011) Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Am Stat Assoc 106(494):544–557
Fan J, Li R, Zhang CH et al (2020) Statistical foundations of data science. Chapman and Hall/CRC
Hall P, Miller H (2009) Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graphic Stat 18(3):533–550
Hall P, Xue JH (2014) On selecting interacting features from high-dimensional data. Comput Stat Data Anal 71:694–708
Harrell FE, Califf RM, Pryor DB et al (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30
Huang J, Horowitz JL, Ma S (2008) Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Annals Stat 36(2):587–613
Huang J, Breheny P, Ma S (2012) A selective review of group selection in high-dimensional models. Stat Sci 27(4):481–499
Ishwaran H, Kogalur UB, Blackstone EH et al (2008) Random survival forests. Ann Appl Stat 2(3):841–860
Kalbfleisch JD, Prentice RL (2011) The statistical analysis of failure time data, vol 360. John Wiley & Sons
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Klein N, Kneib T, Lang S et al (2015) Bayesian structured additive distributional regression with an application to regional income inequality in Germany. Ann Appl Stat 9(2):1024–1052
Li G, Peng H, Zhang J et al (2012) Robust rank correlation based screening. Ann Stat 40(3):1846–1877
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Stat Assoc 107(499):1129–1139
Liu W, Ke Y, Liu J et al (2022) Model-free feature screening and FDR control with knockoff features. J Am Stat Assoc 117(537):428–443
Lovell MC (1963) Seasonal adjustment of economic time series and multiple regression analysis. J Am Stat Assoc 58(304):993–1010
Lv J, Liu JS (2014) Model selection principles in misspecified models. J R Stat Soc Ser B Stat Methodol 76(1):141–167
Pan W, Wang X, Xiao W et al (2019) A generic sure independence screening procedure. J Am Stat Assoc 114(526):928–937
Pan W, Wang X, Zhang H et al (2020) Ball covariance: a generic measure of dependence in Banach space. J Am Stat Assoc 115(529):307–317
Pukelsheim F (1994) The three sigma rule. Am Stat 48(2):88–91
Ritchie MD, Van Steen K (2018) The search for gene-gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation. Annals of translational medicine 6(8):157–157
Saldana DF, Feng Y (2018) Sis: an r package for sure independence screening in ultrahigh-dimensional statistical models. J Stat Softw 83(1):1–25
Sellke TM, Sellke SH (1997) Chebyshev inequalities for unimodal distributions. Am Stat 51(1):34–40
Sen PK (1968) Estimates of the regression coefficient based on Kendall’s tau. J Am Stat Assoc 63(324):1379–1389
Shen Y, Ning J, Qin J (2009) Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104(487):1192–1202
Song R, Lu W, Ma S et al (2014) Censored rank independence screening for high-dimensional survival data. Biometrika 101(4):799–814
Stroud JR, Müller P, Polson NG (2003) Nonlinear state-space models with state-dependent variances. J Am Stat Assoc 98(462):377–386
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Vogelsang TJ (2001) Nonlinear econometric modeling in time series analysis, in: Proceedings of the eleventh international symposium in economic theory. 96(453):354–354
Zhao SD, Li Y (2012) Principled sure independence screening for cox models with ultra-high-dimensional covariates. J Multivar Anal 105(1):397–411
Zhao Z (2008) Parametric and nonparametric models and methods in financial econometrics. Stat Surv 2:1–42
Zhu J, Pan W, Zheng W et al (2021) Ball: an r package for detecting distribution difference and association in metric spaces. J Stat Softw 97:1–31
Zhu LP, Li L, Li R et al (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Stat Assoc 106(496):1464–1475
Acknowledgements
We are grateful for resources from the High Performance Computing Center of Central South University. We have developed theMFSIS R package about the proposed CSIS method and this package is available in R-CRAN viahttps://cran.r-project.org/package=MFSIS. We acknowledge the support in part by National Statistical Scientific Research Project of China (No. 2022LZ28 & 2021LY042), Postgraduate Scientific Research Innovation Project of Hunan Province (No. CX20200148), Changsha Municipal Natural Science Foundation (No. kq2202080) and the Open Research Fund from the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen(No. B10120210117-OF04)
Author information
Xuewei Cheng: Conceptualization, Methodology, Software, Writing original draft. Gang Li: Writing–review & editing, Validation. Hong Wang: Supervision, Writing–review & editing.
Authors and Affiliations
MOE-LCSM, School of Mathematics and Statistics, Hunan Normal University, Changsha, China
Xuewei Cheng
Key Laboratory of Applied Statistics and Data Science, College of Hunan Province, Hunan Normal University, Changsha, China
Xuewei Cheng
School of Mathematics and Statistics, Central South University, Changsha, China
Xuewei Cheng & Hong Wang
School of Public Health, University of California Los Angeles, Los Angeles, USA
Gang Li
- Xuewei Cheng
Search author on:PubMed Google Scholar
- Gang Li
Search author on:PubMed Google Scholar
- Hong Wang
Search author on:PubMed Google Scholar
Corresponding author
Correspondence toHong Wang.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Xuewei Cheng, Gang Li and Hong Wang have contributed equally to this work.
Appendices
Appendix A Proof of the sure screening property
In this Appendix, we first prove the property of rank consistency of the proposed method by concentration inequalities. Then, we provide the proof of property of sure screening by Hoeffing’s inequality for U-statistic. All of the techniques of proof of theorems are followed from Li et al. (2012a) and Song et al. (2014).
Proof of Theorem 1
We show that\(E|\omega _{k}|\ge c_{0}n^{-\kappa }\) for some\(c_{0}>0\), if\(k\in {\mathcal {M}}_{*}\), we have
where\(F_{\Delta \epsilon _{k}\vert \Delta m_{k}}\) is the conditional cumulative distribution function of\(\Delta \epsilon _{k}=H(Y_{1})-H(Y_{2})-\rho _{k}^{*}\{m_{k}(X_{1k})-m_{k}(X_{2k})\}\) given\(\Delta m_{k}=m_{k}(X_{1k})-m_{k}(X_{2k})\). The second equality holds since the status\(\delta\) is independent of the responseY. The third equality holds since\(E\{I(Y_{1}<Y_{2})\}=1/2\).
Because\(m_{k}(\cdot )\) is monotone,\(m_{k}(X_{1k})-m_{k}(X_{2k})\) is either greater than or less than zero for all\(X_{1k}<X_{2k}\). This implies that\(1-F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(-\rho _{k}^{*}\Delta m_{k})\) is either greater or less than 1/2 due to Condition (C3).\(E(\omega _{k})\) is either greater or less than zero for\(k\in {\mathcal {M}}_{*}\). From the above analysis and taking into account the symmetry of\(F_{\Delta \epsilon _{k}\vert \Delta m_{k}}(t)\), we have
In the following, we further establish the lower bound of\(\vert E(\omega _{k})\vert\).
Without loss of generality, assuming\(m_{k}(\cdot )\) is monotone increasing and\(\rho _{k}^{*}>c^{*}n^{-\kappa }\). Note that\(E(\omega _{k})\) can be equivalently written as
By the Gaussian inequality for the symmetric unimodal distribution (Pukelsheim1994; Sellke and Sellke1997),
whereX is a unimodal random variable with a mode at the origin zero and variance\(\sigma ^{2}\). Using the Gaussian inequality for the symmetric unimodal distribution obtained above and condition (C4), we have
By the condition (C2) and the inequality\(E|X-Y|>E|X |\), where bothX andY are i.i.d random variables with\(E(X)=E(Y)=0\). We have
for some constant\(M>0\). By Condition (C3-C4), we have
By Chebyshev’s inequality and Condition (C4), the second term\(I_{2}\) can be further bounded below as
where the second inequality holds due to Condition (C3) and Inequality (A6). The above two inequalities for\(I_{1}\) and\(I_{2}\) yield that
Let\(M=8\sigma _{1}^{2}/c_{{\mathcal {M}}_{*}}\). Then,\(M\ge \sqrt{3}\sigma _{2}\) whenn is large. We have\(E(\omega _{k})\ge c_{0}n^{-\kappa }\), where\(c_{0}=\frac{c^{*} c^{2}_{{\mathcal {M}}_{*}}}{32\sigma _{1}^2}\). Similarly, if\(m_{k}(\cdot )\) is monotone decreasing and\(\rho _{k}^{*}<-c^{*}n^{-\kappa }\), we can show that\(E(\omega _{k})\le -c_{0}n^{-\kappa }\). Therefore,\(E(\vert \omega _{k}\vert )\ge c_{0}n^{-\kappa }\) when\(\vert \rho _{k}^{*}\vert >c^{*}n^{-\kappa }\) for any\(k\in {\mathcal {M}}_{*}\). Theorem 1 is then proved.\(\square\)
Lemma 1
Hoeffing’s inequality for U-statistics (Hoeffding1963) Let\(h=(x_{1},x_{2},...,x_{m})\) be a symmetric kernel of the U-statistics\(U_{n}\), with\(a\le (x_{1},x_{2},...,x_{m})\le b\). Then, for\(t>0\) and\(m<n\), we have
Proof of Theorem 2
By the definition of\(\omega _{k}\) (3), we have, for each\(k=1,2,...,p\),
where\(K=\sum _{j \ne i}^{n}I(Y_{i}<Y_{j},\delta _{i}=1)\) is the number of all comparable pairs (in particular,\(K=\left( {\begin{array}{c}n\\ 2\end{array}}\right)\) when data is fully observed) and\(h[(X_{ik},Y_{i}),(X_{jk},Y_{j})]=[\delta _{i}I(X_{ik}<X_{jk})I(Y_{i}<Y_{j}) +\delta _{j}I(X_{ik}>X_{jk})I(Y_{i}>Y_{j})]-\frac{1}{2}\).
This means that\(\omega _{k}\) is a U-statistics with symmetric kernel\(h[(X_{ik},Y_{i}),(X_{jk},Y_{j})]\). As the indicator function is bounded by 1, we have that
Also,\(-\frac{1}{2}\le E(\omega _{k})\le \frac{1}{2}\). Taking application of Hoeffding’s inequality of Lemma 1, for\(0<\kappa <\frac{1}{2}\) and any\(c_{1}>0\), there exists a positive constant\(c_{2}\) such that
where\(c_{2}=c_{1}^{2}\). Thus,
Denote the set\(S_{n}=\left\{ \mathop {\text {max}}\limits _{k \in {\mathcal {M}}_{*}}\mid \omega _{k}-E(\omega _{k})\mid \ge c_{0}n^{-\kappa }/2\right\}\). On this set\(S_{n}\) and the result of Theorem 1, we have
From inequality (A13), it is easy to check that there exists a positive constant\(c_{2}\) such that
By such a choice of\(\gamma _{n}=c_{3}n^{-\kappa }\) with\(c_{3}\le c_{0}/2\), we have
Appendix B Example 4: other complicated nonlinear models
In this subsection, we illustrate that the CSIS can be applied to more complicated scenarios. Following DC-SIS (Li et al.2012b) and PC-Screen (Liu et al.2022), we consider the following four sophisticated scenarios with\(\varepsilon \sim N(0,1)\). These models are frequently applied to economic time series analysis (Lovell1963) and income inequality study (Klein et al.2015).
Model 4.a (Dummy variables):\(Y=3X_{1}+3\text {sgn}(X_{2})+3I(X_{3}>0.5) +3I(X_{4}<0.5)+3I(-0.5<X_{5}<0.5)+\varepsilon\)
Model 4.b (Additive structure):\(Y=5X_{1}+2\text {sin}(\pi X_{2}/2)+2X_{3}I(X_{3}>0) +2\text {exp}(5X_{4})+3(1/X_{5})+\varepsilon\)
Model 4.c (A sophisticated nonlinear model & strong signal):\(Y=5\text {arcsin}[X_{1}I(-1\le X_{1}\le 1)] +2\text {arccos}[X_{2}I(-1\le X_{2}\le 1)] +2\text {arctan}(X_{3}) +2\text {tan}[X_{4}I(-\frac{\pi }{2}\le X_{4}\le \frac{\pi }{2})] +2\text {sin}[X_{5}I(-\frac{\pi }{2}\le X_{5}\le \frac{\pi }{2})] +\varepsilon\)
Model 4.d (A sophisticated nonlinear model & weak signal):\(Y=\text {arcsin}[X_{1}I(-1\le X_{1}\le 1)] -0.8\text {arccos}[X_{2}I(-1\le X_{2}\le 1)]+0.6\text {arctan}(X_{3}) -0.4\text {tan}[X_{4}I(-\frac{\pi }{2}\le X_{4}\le \frac{\pi }{2})] +0.2\text {sin}[X_{5}I(-\frac{\pi }{2}\le X_{5}\le \frac{\pi }{2})] +\varepsilon\)
The average values of the proportion\({\mathcal {S}}\) and the minimum model size\({\mathcal {R}}\) that includes all five active features are presented in Table7 under 100 replications. In the nonlinear examples, all five competitors perform well. From Model 4.a to Model 4.d, DC-SIS performs comparably to CSIS in each scenario and outperforms reasonably than other three methods, especially in Model 4.c and Model 4.d. As a result, CSIS enjoys the sure screening property requiring a relatively smaller model size to recover the truly important variables in this example.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheng, X., Li, G. & Wang, H. The concordance filter: an adaptive model-free feature screening procedure.Comput Stat39, 2413–2436 (2024). https://doi.org/10.1007/s00180-023-01399-5
Received:
Accepted:
Published:
Version of record:
Issue date:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Keywords
Profiles
- Gang LiView author profile

