5.2. Procedure 1 Results: Filtering Method (Correlation-Based)
Table 1 presents a comparison of genes associated with the cumulative hazard function based on varying alpha levels used in a filtering correlation technique.
As the alpha value increases, the number of selected genes also increases. Specifically, at an alpha level of 0.001, 27 genes were selected. When the alpha level was raised to 0.01, the selection increased to 89 genes. At an alpha of 0.02, 158 genes were identified, and, at 0.03, the number rose to 208 genes. This upward trend continues with further increases in alpha, resulting in the selection of 247 genes at 0.04, 522 genes at 0.1, 907 genes at 0.2, 1287 genes at 0.3, 1647 genes at 0.4, and 2032 genes at 0.5.
The line plot in
Figure 2 reveals the performance of the model at various levels of alpha. It is evident that, as the alpha value increases, the number of genes selected increases too, and vice versa. This affirms that the model is at its best when the alpha value is small. In this case, the model performs best, i.e., selected the minimum number (27) of genes, at
= 0.001.
The most common approach to presenting gene expression data is to present a heatmap, which may also be paired with cluster dendrograms. The heatmap’s goal is to discover genes that are over- or underexpressed, as well as biological signatures that are connected with a certain state (i.e., illness condition). The data in the heatmap in
Figure 3 are of the 3835 genes in the dataset, where each row represents a sample and each column represents a gene. The color and intensity of the boxes represent changes in gene expression: shades of blue and green indicate elevated expression (i.e., highly significant to the sample), shades of purple indicate decreased expression (i.e., zero significance to the sample), and blue indicates unchanged expression (i.e., genes with scale measures of zero). A correlated heatmap in
Figure 4 was created to illustrate the connections between these chosen variables (genes). Another approach is to present the genomic data in a correlated heatmap. The color and intensity of the boxes represent the degree of correlation regarding gene expression data: shades of blue and green indicate elevated correlation (i.e., highly significant to the sample), shades of purple indicate slightly weak correlation (i.e., little or no significance to the sample), and blue indicates unchanged correlation (i.e., genes with high correlation). The summary statistics in
Table 2 provide a detailed overview of the results for the 27 selected genes based on the filtering methods.
5.3. Procedure 2 Results: Wrapper Method (IBMA)
Table 3 shows the results of applying the Iterative Bayesian Model Averaging (IBMA) technique, a wrapper method, to reduce dimensionality by selecting key genes. Across various numbers of models (ranging from five to forty), the method consistently identified 15 genes. This stability in gene selection indicates that increasing the number of models did not lead to the selection of more genes as IBMA continually settled on the same 15 variables.
Table 4 summarizes the number of features (genes) selected from a high-dimensional survival dataset of diffuse large B-cell lymphoma (DLBCL). The dataset contains 3835 covariates, and the analysis was performed on a sample of 181 individuals.
Using the filtering method, 27 genes were selected from the original set of 3835 covariates. This method reduces the large number of variables by applying specific criteria to identify a subset of relevant features for further analysis.
After applying the filtering process, the Iterative Bayesian Model Averaging (Iterative BMA) method identified 15 genes. This technique employs Bayesian principles to systematically select those features that are most influential in predicting survival outcomes, emphasizing simplicity and significance in the selection process.
Table 5 provides the results of 15 selected genes via the Iterative Bayesian Model Averaging (IBMA) approach, summarizing the key metrics of each gene’s effect on the outcome.
Table 5 outlines the details of the 15 genes selected through Iterative Bayesian Model Averaging (Iterative BMA) from a high-dimensional survival dataset. For each gene, the posterior coefficient and posterior probability percentage are provided.
The posterior coefficient represents the estimated influence of each gene on the survival outcome. Positive coefficients, such as for (0.2455597), indicate a positive relationship with survival, whereas negative coefficients, such as for (−0.1786791), reflect a negative relationship. The magnitude of the coefficient signifies the strength of this influence.
The posterior probability percentage indicates the likelihood of a gene being included in the predictive model. Genes like,, and have posterior probabilities of 100%, showing strong evidence of their relevance in predicting survival outcomes. In contrast, genes such as (1.4%) and (3.9%) have low probabilities, suggesting limited importance in the model.
The variable importance of the 27 selected genes, as presented in
Figure 5, specifically underscores their contribution to predicting the survival outcomes of patients with diffuse large B-cell lymphoma (DLBCL).
5.5. Comparison with the Parametric Methods
In assessing the influence of the selected genes on patient survival in DLBCL, we begin by considering parametric survival models [
48]. While these models are valuable in scenarios where distributional assumptions hold, their application may not be suitable for many real-world datasets. Therefore, we propose an alternative approach by using a semi-parametric model, specifically the Cox Proportional Hazards (Cox-PH) Model [
These chosen models are suitable for DLBCL survival data as they reflect the influence of gene biomarkers on patient survival. The results from these models will guide clinical decision-making by identifying biomarkers that influence DLBCL cancer, enabling the provision of appropriate treatment for patients.
Table 7 summarizes the comparison of different parametric models applied to the dataset using AIC and BIC as performance measures. Lower values of both AIC and BIC indicate better model fit.
Based on this comparison in
Table 7, the Weibull model has the smallest AIC (70.1692) and BIC (124.5437), indicating that it provides the best fit to the data. In contrast, the exponential model has the highest AIC (159.9177) and BIC (211.0937) values, making it the least suitable model.
Table 8 below presents the comparison between a parametric model (Weibull) and a semi-parametric model (Cox-PH) using AIC and BIC to assess model performance. Lower values indicate a better model fit.
The parametric Weibull model shows far lower AIC (70.1692) and BIC (124.5437) values compared to the semi-parametric Cox-PH Model, indicating that it provides a better fit to the data.
Table 2 summarizes the correlation coefficients, P-values, test statistics, and confidence intervals for 27 genes. The correlation values range from moderate-positive (e.g.,
) to negative (e.g.,
), reflecting varying strengths and directions of associations between the gene expressions. The majority of the genes display P-values below the threshold of 0.001, highlighting statistically significant associations. For instance,
demonstrates strong significance (
), with a confidence interval of
, underscoring the robustness of this result. The confidence intervals further affirm the reliability of these findings. Negative correlations, such as
), have confidence intervals that exclude zero (
), strengthening the evidence of inverse associations. These results emphasize the effectiveness of Iterative BMA in identifying significant genes and characterizing their correlations, offering valuable insights with potential biological and clinical implications.
The tables presenting the results from the Iterative Bayesian Model Averaging (BMA), Cox Proportional Hazards Model, and Weibull model provide insights into the significance of covariates (genes) across different statistical approaches, which can be interpreted through theirp-values.
In the Iterative BMA table, genes are selected based on their posterior probabilities. A high posterior probability, such as 100% for genes like,, and, suggests that these genes are highly likely to be relevant for survival prediction. However, while posterior probabilities indicate the likelihood of relevance, they do not directly reflect statistical significance. Therefore, a gene with a high posterior probability might not necessarily exhibit a lowp-value in the survival models (Cox and Weibull).
Thep-values in the Cox model are crucial for assessing the statistical significance of each gene’s effect on survival. For example, genes such as and havep-values below 0.05, indicating that these genes have a significant impact on survival. In contrast, genes like and show higherp-values (e.g., 0.171501 and 0.796420), suggesting that they do not significantly influence survival. When comparing the posterior probabilities from the Iterative BMA table with thep-values from the Cox model, it is evident that, although some genes are selected by BMA with high posterior probabilities, they may not all have significantp-values in the Cox model (e.g.,).
The Weibull model also usesp-values to determine the significance of each gene’s association with survival. Lowp-values, such as for,, and, suggest that these genes have a strong and statistically significant relationship with survival. Genes with higherp-values (e.g., and) indicate that their effect on survival is weaker or insignificant. Comparing thep-values from both the Weibull and Cox models, we observe that certain genes, like and, are significant in both models, highlighting their importance. However, some genes that were significant in the Weibull model (e.g.,) may show weaker significance in the Cox model, as demonstrated with having ap-value of 0.000324 in the Cox model.
Genes such as,, and appear in both the Iterative BMA selection and exhibit lowp-values in both the Cox and Weibull models. This consistency suggests that these genes are strongly associated with survival and are robust across different statistical approaches.
On the other hand, genes like and, despite being selected in the Iterative BMA process, show higherp-values in the survival models, indicating that, while they may have been considered relevant during feature selection, their impact on survival is not statistically significant.
The analysis from
Table 5 revealed that the following genes, with 100% posterior probabilities—
, and
—which were consistently selected by the Correlation-Based IBMA algorithm, were also found to be statistically significant in both the Cox-PH and Weibull models’ extended results presented in
Table 9 and
Table 10. Thus, these genes demonstrated strong predictive potential and maintained their importance across different statistical approaches, reinforcing their relevance for predicting the survival of patients with DLBCL cancer.
Although some genes, such as and, were selected by the Correlation-Based IBMA algorithm but did not exhibit statistically significantp-values in the survival models, they still contribute to the power of the Correlation-Based IBMA algorithm by narrowing down the pool of candidate genes. Even if their direct effects are not immediately evident, these genes may play a role in a broader biological context, further highlighting the value of the Correlation-Based IBMA algorithm in identifying potentially important genes.