. Author manuscript; available in PMC: 2022 Mar 8.

Published in final edited form as:IEEE J Biomed Health Inform. 2022 Mar 7;26(3):1128–1139. doi:10.1109/JBHI.2021.3097735

VoxelHop: Successive Subspace Learning for ALS Disease Classification Using Structural MRI

Xiaofeng Liu^{1},Fangxu Xing^{1},Chao Yang^{2},C-C Jay Kuo^{3},Suma Babu^{4},Georges El Fakhri^{1},Thomas Jenkins^{5},Jonghye Woo^{1}

^{1}Gordon Center for Medical Imaging, Dept. of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA

^{2}Facebook AI, Boston, MA, USA

^{3}Dept. of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA

^{4}Sean M Healey & AMG Center for ALS, Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA

^{5}Sheffield Institute for Translational Neuroscience, University of Sheffield, 385a Glossop Road, Sheffield S10 2HQ, UK

^✉

Correspondence to Jonghye Woo, PhD (jwoo@mgh.harvard.edu) and Thomas Jenkins, MD (t.m.jenkins@sheffield.ac.uk)

Roles

C-C Jay Kuo:Fellow, IEEE

Georges El Fakhri:Fellow, IEEE

Jonghye Woo:Member, IEEE

Issue date 2022 Mar.

PMC Copyright notice

PMCID: PMC8807766 NIHMSID: NIHMS1735875 PMID:34339378

Abstract

Deep learning has great potential for accurate detection and classification of diseases with medical imaging data, but the performance is often limited by the number of training datasets and memory requirements. In addition, many deep learning models are considered a “black-box,” thereby often limiting their adoption in clinical applications. To address this, we present a successive subspace learning model, termed VoxelHop, for accurate classification of Amyotrophic Lateral Sclerosis (ALS) using T2-weighted structural MRI data. Compared with popular convolutional neural network (CNN) architectures, VoxelHop has modular and transparent structures with fewer parameters without any backpropagation, so it is well-suited to small dataset size and 3D imaging data. Our VoxelHop has four key components, including (1) sequential expansion of near-to-far neighborhood for multi-channel 3D data; (2) subspace approximation for unsupervised dimension reduction; (3) label-assisted regression for supervised dimension reduction; and (4) concatenation of features and classification between controls and patients. Our experimental results demonstrate that our framework using a total of 20 controls and 26 patients achieves an accuracy of 93.48% and an AUC score of 0.9394 in differentiating patients from controls, even with a relatively small number of datasets, showing its robustness and effectiveness. Our thorough evaluations also show its validity and superiority to the state-ofthe-art 3D CNN classification approaches. Our framework can easily be generalized to other classification tasks using different imaging modalities.

Index Terms—: Successive Subspace Learning, Clinical Decision-Making System, MRI, Amyotrophic Lateral Sclerosis

I. Introduction

Over the last years, deep learning has shown state-of-the-art performance in a variety of tasks, including prediction and classification, surpassing previous machine learning techniques [1]. In addition, the recent development of deep learning with medical imaging data outperformed human performance in some cases, thus showing the potential to aid clinicians in the diagnosis or decision-making process [2]. While, for neurologic disorders, deep learning has shown great potential for accurate detection and prediction with medical imaging data, there are still several challenges in developing robust and accurate models [3]. For example, successful deep learning models require massive training datasets (e.g., hundreds to thousands of 3D imaging data) for accurate model fitting [1]. Compared with over one million natural 2D image datasets already available (e.g., ImageNet), however, it is challenging to collect such massive 3D medical imaging data in many clinical applications, which limits the ability to learn a suitable image representation for downstream tasks, including classification. In addition, 3D medical imaging data have a larger size than 2D images, thereby demanding complex deep learning models [4]; without sufficient data and memory, it is not easy to apply deep learning models successfully used in 2D natural image datasets, such as the winner models of ImageNet Large Scale Visual Recognition Challenge (e.g., AlexNet [5], VGG [6], and ResNet [7]). There are several works to alleviate this problem, by using patches, 2D slices instead of 3D volumes, or by downsampling volumes [8]. These approaches, however, can miss out on important details inherent in the datasets, leading to performance loss. Furthermore, many deep learning models are considered a “black-box” model [9], [1]. As a result, whereas the performance is promising, the adoption of deep learning models in clinical practice is still in its infancy. Therefore, it is of great importance to develop a lightweight and transparent model that can reliably and efficiently deal with a small number of 3D datasets for clinical applications, while maintaining an accuracy level comparable or better than existing models.

In this work, we are interested in developing a clinical decision-making system for Amyotrophic Lateral Sclerosis (ALS) using a successive subspace learning (SSL) model. ALS is a neurodegenerative disorder characterized by loss of cortical and spinal motor neurons in the brain, leading to progressive muscle weakness across multiple body regions, including bulbar regions [10], [11], [12]. A various mix of cortical and spinal motor neuron signs contribute to the clinical heterogeneity of ALS. Clinical diagnosis of ALS is mostly based on subjective assessments, such as inclusion and exclusion clinical criteria—i.e., the El Escorial criteria, and there are no confirmation tests or objective biomarkers available yet that are clinically adopted. Therefore, a fundamental challenge in ALS research and clinical practice is to detect the disease early and track its progression accurately and objectively to reduce the duration and expense of clinical trials and to ensure patients have access to therapeutic trials in a timely manner. To this end, a clinical decision-making system that can differentiate ALS patients from healthy controls is of great need and importance.

Magnetic resonance imaging (MRI) has been an effective tool to identify structural abnormalities in ALS, especially those in the brain and tongue [13]. Structural MRI allows measuring the volume and shape of different parts of the brain and tongue. Patients with ALS have shown widespread gray matter atrophy in frontotemporal regions [14], and atrophy in almost all internal muscles of the tongue [15] compared with healthy controls. To assess volume differences between an atlas constructed using healthy subjects and a cohort of patients, voxel-based morphometry (VBM) [16] has been widely used in the brain to assess volume differences for a variety of neurological disorders, such as Alzheimer’s disease [17], stroke [18], traumatic brain injury [19], depression [20], and ALS [14]. Independent of the brain, there are also a few works to characterize volume or motion differences of the tongue between ALS patients and healthy controls [21], [15], [22], [23]. Since hypoglossal neurons gradually degenerate due to ALS, previous works focused on how ALS causes muscle atrophy and weakness by measuring anatomical characteristics, such as volume and muscle fibers in the tongue using diffusion and structural MRI methods [15], [22]. To our knowledge, however, there is no previous work that simultaneously assesses differences in both the brain and tongue between controls and ALS patients.

In this work, a lightweight machine learning framework (seeFig. 1), termed VoxelHop, is presented for classifying between ALS patients and healthy controls using T2-weighted MRI. Specifically, we first construct a head and neck atlas using 20 healthy controls only from T2-weighted MRI, and carry out diffeomorphic registration of each subject with the atlas. The deformation fields, which contain voxel expansion and contraction, are then input into our VoxelHop framework. Our VoxelHop framework has four key components, including (1) sequential expansion of near-to-far neighborhood for multi-channel 3D deformation fields; (2) subspace approximation for unsupervised dimension reduction; (3) label-assisted regression for supervised dimension reduction; and (4) concatenation of features and classification between controls and patients. The dimension reduction is achieved by Principal Component Analysis (PCA), thereby removing less important features. Inspired by the recent stacked design of deep neural networks, the SSL principle has been designed for classifying 2D images (e.g., PixelHop [24], [25]) and point clouds (e.g., PointHop [25]). However, SSL-based PixelHop for multi-channel 3D data has not been explored previously. In addition, the subspace approximation with adjusted bias (Saab) transform [26], a variant of PCA, is applied as an alternative to nonlinear activation, which helps avoid the sign confusion problem [9]. Furthermore, the Saab transform can be more mathematically interpretable than nonlinear activation functions used in CNNs [26], [27], since the model parameters are determined stage-by-stage in a feedforward manner, without any backpropagation. Therefore, the training of Voxelhop can be more efficient and interpretable than that of 3D CNNs [24].

Here, we propose an SSL-based VoxelHop classification framework for successive channel-wise local-to-global neighborhood information analysis using 3D volume data, which can be easily generalized to other classification tasks using different imaging modalities [4]. The main contribution of this work can be summarized as follows:

To the best of our knowledge, this is the first attempt at exploring 3D deformation fields with an SSL framework for ALS disease classification.
Our framework is lightweight and mathematically interpretable by adapting SSL for multi-channel 3D volume data, which works reliably with a small number of datasets.
Our framework achieves a superior classification performance with 10× fewer parameters and much less training time, compared with state-of-the-art 3D CNN-based classification approaches.
The systematic and thorough comparisons with 3D CNNs provide further insights into the potential benefits of our framework.

II. Methodology

A. MRI Data Acquisition and Preprocessing

We collected T2-weighted MRI of the head and neck from a total of 20 controls and 26 patients via a fast spin-echo sequence (Philips Ingenia, Best, Netherlands) with the following parameters: TR/TE = 1,107/80 ms and interpolated voxel size = 0.78×0.78×5 mm³. The statistics of the collected data are shown inTable I. There were no significant differences in age, gender, or weight between the patient and healthy control groups. Patients were assessed at a relatively early stage of disease, as evidenced by well-preserved ALS functional rating scale-revised (ALSFRS-R) and Medical Research Council (MRC) composite muscle scores, and median disease duration of just over a year from symptom onset (median delay from symptom onset to diagnosis in ALS is approximately a year). We note that in one patient, weight was not available (due to advanced disability). Additional acquisition details are described in [10].

TABLE I.

Statistics of the collected data.

Characteristic	Patients	Controls	P value
Mean age, years (SD)	56.6 (13.5)	51.8 (14.2)	0.251^*
Gender (females)	7/20 (35%)	5/26 (19%)	0.948^**
Weight, kg (SD)	78.6 (15.2)	77.3 (13.8)	0.758^*
Median weeks from symptom onset (range)	62 (25–261)
Mean MRC score (SD)	102/110 (6.7)	-	-
Mean ALSFRS-R (SD)	40/48 (4.1)	-	-

Open in a new tab

Unpaired t-test, unequal variance;

^**

Chi-squared test;

MRC: Medical Research Council; SD: standard deviation; kg: kilograms; ALSFRS-R: amyotrophic lateral sclerosis functional rating scale-revised.

A head and neck image atlas was constructed with group-wise diffeomorphic registration [28], as shown inFig. 2. Then, registration between each subject and the atlas was carried out to generate the 3D deformation field, each with the size of 704×704×50×3. Based on the manually identified brain and tongue masks as shown inFig. 2, we cropped the region of interest (ROI) with the size of 330×220×30, which was slightly larger than the exact ROI.

B. Our VoxelHop Framework

The input to our VoxelHop is the multi-channel 3D deformation field $x \in ℝ^{S_{0} \times S_{0} \times K_{0} \times C_{0}}$ , whereS₀ andK₀ denote the horizontal and vertical dimensions, respectively, andC₀ denotes the number of channels of the input data (we setC₀=3 for the deformation field). In this work, we use 3D deformation fields as our input, since volume differences between the atlas and individual subjects, as embedded in the deformation fields, play a crucial role in the classification task. It is worth noting that the atlas serves as a reference volume for which the deformation field obtained can be an objective and salient descriptor on the volume difference. Additionally, the vertical dimensions are the same for different channels, but this can be easily generalized to different vertical dimensions for each channel.

The inputx is fed intoI cascaded multi-channel VoxelHop units (M-VoxelHop) andI − 1 max-pooling operations to extract the features (i.e., attributes in [24], [29]) at different spatial scales in the unsupervised Module 1. Then, the features extracted at thei-th M-VoxelHop,i ∈ 1,2, ⋯ ,I, are aggregated, followed by carrying out the supervised label-assisted regression (LAG) module for further dimensional reduction to generate aM′-dimensional feature vector in Module 2. In Module 3, theM′-dimensional feature vectors of all M-VoxelHop units are concatenated to form aM′ ×I-dimensional feature vector for the final classification task. A block diagram of our framework is illustrated inFig. 1.

Our modules are trained based on the previous one rather than independent of each other. For example, the supervised dimension reduction in Module 2 is trained on the feature extracted by Module 1. Then, the classifier in module 3 is trained with the feature extracted in Module 2. We note that both Modules 2 and 3 are supervised by the ground-truth label, while the PCA-like Module 1 is used to efficiently extract the general and discriminative information from the information theory perspective. We detail each step below.

1). Cascaded Saab transforms for multi-channel 3D MRI:

The cascaded M-VoxelHop units (i.e., Module 1) are used to extract the features of neighboring spatial content, following unsupervised feature learning. With the multiple cascade M-VoxelHop units, the neighborhood union can be correlated with more voxels ofx to extract global information. This is akin to the process of CNNs, which have a larger reception field in the deeper layers.

The previous SSL-based PixelHop [24] operates on 2D gray valued or color images, thus exploring the neighboring area on a 2D plane. For the 2D images with the size ofS₀ ×S₀ ×C₀, there is no vertical dimension. (i.e.,C₀ = 1 for grayscale images andC₀ = 3 for RGB images.) As a result, we can take the union of a pixel and itss₁ ×s₁ ×C₀ nearest neighbors as a neighborhood union at the first PixelHop unit, wheres₁ is the dimension of the kernel in the horizontalS₀ ×S₀ plane. In addition, we set the stride size to 1 to form the input in the first PixelHop unit.

Considering a boundary effect, there are (S₀ −s₁ + 1) × (S₀ −s₁ + 1) unions. In what follows, each neighborhood union is flattened to a vector $x \in ℝ^{1 \times (s_{1} \times s_{1} \times C_{0})}$ . We then generate a feature matrix with the size of (S₀ −s₁ + 1) × (S₀ −s₁ + 1) × (s₁ ×s₁ ×C₀). The channel dimension, however, can be too large along with the increase of successive PixelHop units. Therefore, it is important to control the dimension explosion. The unsupervised dimension reduction can be achieved by the Saab transforms [26], which compact $x \in ℝ^{1 \times (s_{1} \times s_{1} \times C_{0})}$ to $y \in ℝ^{1 \times F_{1}}$ , whereF₁ is a hyperparameter to control the output dimension of the first PixelHop unit.

Specifically, we use the terms of direct current (DC) and alternating current (AC) analogous to the circuit theory. In the first Saab transform, we configure one DC andF₁ − 1 AC anchor vectors with the size ofs₁ ×s₁ ×C₀. With the processing of one DC andF₁ − 1 AC anchor vectors,x is reshaped to a vector $y \in ℝ^{1 \times F_{1}}$ [24]. More formally, thef-th dimension ofy can be the affine transform ofx, i.e.,

y_{f} = a_{f}^{T} x + b_{f}, f = 0, 1, \dots, F_{1} - 1,

(1)

and the Saab transform has a special design of the anchor vector $a_{f} \in ℝ^{1 \times (s_{1} \times s_{1} \times C_{0})}$ and the bias term $b_{f} \in ℝ$ [26]. By following [26], we can set $b_{f} \equiv d \sqrt{F_{1}}$ , $d \in ℝ$ , and divide the anchor vector into two categories:

$DC anchor vector a_{0} = \frac{1}{\sqrt{s_{1} \times s_{1} \times C_{0}}} {(1, \dots, 1)}^{T},$
$AC anchor vector a_{f}, f = 1, \dots, F_{1} - 1.$ (2)

At each VoxelHop unit, we can project the vector $x \in ℝ^{1 \times (s_{1} \times s_{1} \times C_{0})}$ ontoa₀ to calculate its DC component $x_{D C} = a_{f}^{T} x$ . The subspace of AC is an orthogonal complement to the subspace of DC. Then, the AC component ofx is expressed asx_AC =x −x_DC. In what follows, PCA is applied tox_AC, and we choose the topF₁ − 1 principal components as our AC anchor vectorsa_f,f = 1, ⋯ ,F₁ − 1. Therefore, an image with the size ofS₀ ×S₀ ×C₀ is reshaped to the size ofS₁ ×S₁ ×C₁, whereS₁ =S₀ −s₁ + 1, andC₁ = 1 + (F₁ − 1) =F₁ is the sum of the number of DC and AC anchor vectors. The overall transformation is shown inFig. 3 top. In the SSL-based PixelHop [24], there are several Saab units to extract relevant features at different scales.

Fig. 3. — Illustration of the conventional Saab transform (top) and channel-wise Saab transform for the multi-channel 3D data (bottom).

An anchor vector operates on as₁ ×s₁ ×C₀ region of the input image, and generates a scalar, which is similar to the convolution operation in CNNs. We can also regard the anchor vector as a filter with the kernel size ofs₁ ×s₁ ×C₀ [26]. Moreover, the use of multiple anchor vectors is also analogous to the multiple filters in modern CNNs. Namely, the filters in CNNs are learned iteratively, during which the loss is computed, and gradients are backpropagated, while the anchor vectors in our VoxelHop are defined with PCA in an unsupervised manner.

Our input 3D deformation fields, however, have a multi-channel 3D structure; thus, we cannot directly apply our data to the vanilla 2D PixelHop model. Two conditions are needed to solve this problem: (1) we should tackle the three channels of each voxel in the deformation fields; and (2) the local-to-global spatial expansion should involve three dimensions. A common practice in the conventional 3D CNN models to deal with multi-direction optical-flow sequences is to process the multiple directions independently, and concatenate the features of all channels in the first fully-connected (FC) layer [30], [31]. More recently, in addition, a new SSL model [29] is proposed to process each spectral channel independently in such a way that the Saab coefficients can be weakly correlated in the channel direction. Similarly, in this work, we propose to apply the Saab transforms to the three channels of 3D deformation fields separately, as shown inFig. 4, followed by fusing them in the subsequent modules similar to 3D CNNs [30], [31].

Fig. 4. — Illustration of the neighborhood union construction in 3D space and Saab Transform for one channel of 3D data. The same operation is applied to all channels in parallel.

To achieve the local-to-global neighborhood feature extraction of 3D data, we first construct the neighborhood union with the size ofs_i ×s_i ×k_i, wheres_i andk_i indicate the horizontal and vertical dimensions at thei-th VoxelHop unit, respectively. Considering a boundary effect, there are (S_i−1 −s_i + 1)² × (K_i−1 −k_i + 1) neighborhood unions for an input with the size ofS_i−1 ×S_i−1 ×K_i−1. We then flatten each neighborhood union to a vector $x \in ℝ^{1 \times (s_{i} \times s_{i} \times k_{i})}$ .

To achieve the unsupervised dimension reduction with the Saab transform, we apply one DC andF_i − 1 AC anchor vectors at thei-th VoxelHop unit. Each neighborhood union generates a processed vector $y \in ℝ^{1 \times (s_{i} \times s_{i} \times k_{i})}$ . Therefore, the output of a single channel VoxelHop has the size ofS_i ×S_i ×K_i, whereS_i =S_i−1 −s_i + 1 andK_i = (K_i−1 −s_i + 1) ×F_i. To involve more data in the vertical dimension at the subsequent VoxelHop units, we setk_i+1 =v_i ×F_i andF₀ = 1—i.e., the neighborhood union at the next VoxelHop unit coversv_i output vectors in the vertical dimension.

Since the horizontal and vertical stride size is usually set to be smaller than $\frac{s_{i}}{2}$ or $\frac{v_{i}}{2}$ , there is a spatial redundancy in the horizontal plane. Following the previous SSL works [24], [29], we configure the maxpooling operation to compact the size of extracted features fromS_i ×S_i ×K_i to $S_{i}^{'} \times S_{i}^{'} \times K_{i}^{'}$ . Considering that the vertical dimension (e.g.,K₀ = 30) is smaller than the horizontal dimension (e.g.,S₀ = 110) in our application, we only apply the maxpooling in the horizontal plane for the first two VoxelHop units, i.e., we use (2 × 2 × 1)-to-(1 × 1 × 1) maximum pooling and $K_{i}^{'} = K_{i}$ . For the later maxpooling units, we use the standard (2 × 2 × 2F_i)-to-(1 × 1 ×F_i), which halves both the horizontal and vertical spatial size. The detailed structure of five-stage VoxelHop withs_i = 3 andv_i = 3 is shown inTable II.

TABLE II.

The detailed structure of our five consecutive 3-channel VoxelHop

Input Size	TyPe	Filter Shape
[110 × 110 × (30 × 1)] × 3	M-VoxelHop	[F₁ kernels of 3 × 3 × 3]×3
[108 × 108 × (28 ×F₁)] × 3	MaxPool	(2×2×1)-(1×1×1)
[54 × 54 × (28 ×F₁)] × 3	M-VoxelHop	[F₂ kernels of 3 × 3 × 3]×3
[52 × 52 × (26 ×F₂)] × 3	MaxPool	(2×2×1)-(1×1×1)
[26 × 26 × (26 ×F₂)] × 3	M-VoxelHop	[F₃ kernels of 3 × 3 × 3]×3
[24 × 24 × (24 ×F₃)] × 3	MaxPool	(2×2×2F₃)-(1×1×F₃)
[12 × 12 × (12 ×F₃)] × 3	M-VoxelHop	[F₄ kernels of 3 × 3 × 3]×3
[10 × 10 × (10 ×F₄)] × 3	MaxPool	(2×2×2F₄)-(1×1×F₄)
[5 × 5 × (5 ×F₄)] × 3	M-VoxelHop	[F₅ kernels of 3 × 3 × 3]×3
[3 × 3 × (3 ×F₅)] × 3	MaxPool	(2×2×2F₅)-(1×1×F₅)

Open in a new tab

There are three types of losses in the multi-layer PCA like operations: the approximation loss (due to dimension reduction), the sign confusion loss, and the rectification loss (due to nonlinear activation.) If we do not use nonlinear activation, then there will be the sign confusion loss [26].

The approximation loss is unavoidable. Otherwise, the output feature dimension at each hop will be the same as the input feature dimension. This is too expensive for storage and computation. Feature dimension reduction via PCA is essential in our system.

It is important to address the remaining two types of losses. A straightforward implementation of the multi-layer PCA-like operation will suffer from the sign confusion loss [26]. The Saab transform (or channel-wise Saab transform) can avoid both the sign confusion loss (by adding a bias) and the rectification loss. For the former, we add a constant bias term to shift all responses to the positive region. For the latter, no-rectification is applied, since the bias-adjusted output is always positive.

2). Aggregation & cross-entropy guided feature selection:

The output of thei-th VoxelHop unit has the size ofS_i ×S_i ×K_i. In order to extract a diverse set of features at thei-th stage, the maxpooling scheme is used to summarize the response in small non-overlapping regions. The spatial size of features after the unsupervised aggregation is denoted byP_i ×P_i ×Q_i, whereP_i andQ_i are the hyperparameters to define the compactness, and we usually set the ratio ofK_i/S_i to $\frac{1}{4}$ or $\frac{1}{2}$ .

After the unsupervised aggregation, supervised feature dimension reduction is applied. For each feature with the size ofP_i ×P_i × 1, we flatten it to a vector $ℝ^{1 \times (P_{i} \times P_{i} \times 1)}$ . Following the cross-entropy guided feature selection scheme [29], the cross-entropy of each feature is given by

L = \sum_{j = 1}^{J} [- \sum_{m = 1}^{M} l_{j, m} log (p_{j, m})],

(3)

whereM is the number of classes (in this work, we setM=2),l_j,m is a binary scalar to indicate if samplej ∈ {1,2, ⋯ ,J} is classified correctly, andp_j,m is the prediction probability of samplej for classm. A lower cross-entropy indicates better discriminability of the features. The features are ordered based on their corresponding cross-entropy. Then, the topN_i features of each channel with the least cross-entropy are selected for the subsequent LAG unit. The extracted features of each channel at each stage that has the size ofP_i ×P_i ×Q_i can be compacted to the size ofP_i ×P_i ×N_i, andN_i can be much smaller thanQ_i, while achieving a similar performance. The cross-entropy guided feature selection can be helpful to simplify the model complexity of the subsequent LAG unit [29].

3). LAG for feature selection:

The supervised label-assisted regression (LAG) unit is motivated by two objectives: (1) unifying the size of each stage’s features; and (2) utilizing the label for supervised dimension reduction. Firstly, each VoxelHop unit outputs the features of the neighborhood units via successive neighborhood expansion and subspace approximation, which have a different size. Then, we concatenate all features to integrate the local-to-global features across multiple VoxelHop units. However, the dimension of the final feature vector can be too high. Secondly, CNNs learn the projection with the help of labels with backpropagation. Therefore, we would expect to utilize the data labels in SSL for supervised dimension reduction. The features extracted from the same class are desired to distribute in a smaller subspace in high-dimensional feature space.

After the cross-entropy guided feature selection, thei-th VoxelHop stage yields the features of each channel with the size ofP_i ×P_i ×N_i, which is further flattened to a vector with the size of $ℝ^{1 \times (P_{i} \times P_{i} \times N_{i})}$ , whereN_i denotes the selected feature number. Then, we explore the distribution of these flattened feature vectors, according to their class labels, following three steps:

Constructing the class-oriented subspaces by clustering the samples from the same class, and computing the center of each subspace.
Defining the soft association of each sample and its corresponding center to convert the one-hot output into a probability vector.
Solving the linear least-squared regression (LSR) with the probability vectors.

The label-assisted regressors utilize the regression matrix calculated in the first step. Moreover, k-means is simply adopted for unsupervised clustering. Note that we apply k-means within each class independently to group each class intoL clusters. Specifically, we set the cluster number in the k-means toL. Suppose that there areM classes, denoted bym=1, 2, the concatenated feature vector has the dimension ofn =P_i ×P_i ×N_i × 3 for three channels. In the second step, the flattened feature vector of thej-th sample is denoted as $x_{j} = {[x_{j, 1}, x_{j, 2}, \dots, x_{j, n}]}^{T} \in ℝ^{n}$ . Besides, we denote the centers ofL clusters by $c_{m, l} \in ℝ^{n}$ ,l = 1, 2, ⋯ ,L. The probability vector of a samplex_j belonging to the centerc_m,l can be formulated as

Prob (x_{j}, c_{m, l}) = 0, if the class of x_{j} \neq m, Prob (x_{j}, c_{m, l}) = \frac{exp (- ω d (x_{j}, c_{m, l}))}{\sum_{l = 1}^{L} exp (- ω d (x_{j}, c_{m, l}))},

(4)

whered(x_j,c_m,l) is the distance measure betweenx_j andc_m,l. We simply adopt the Euclidean distance ford(·). Besides,ω is used to balance the Euclidean distance and the likelihood of a sample belonging to a cluster. With a largerω, the probability decay can be faster along with the distance increase. The smallerd(x_j,c_m,l), the larger the likelihood. Then, the probability of a samplex_j belonging to the subspace spanned by theL centers in each classm is given by

p_{m} (x_{j}) = 0, if the class of x_{j} \neq m,

(5)

where0 is the zero vector of dimensionL, and

p_{m} (x_{j}) = {(Prob (x_{j}, c_{m, 1}), \dots, Prob (x_{j}, c_{m, L}))}^{T} .

(6)

Finally, a set of linear LSR equations can be formulated to relate the input feature vector and the output probability vector as

[\begin{matrix} α_{11} & α_{12} & \dots & α_{1 n} & β_{1} \\ α_{21} & α_{22} & \dots & α_{2 n} & β_{2} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ α_{M^{'} 1} & α_{M^{'} 2} & \dots & α_{M^{'} n} & β_{M^{'}} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{n} \\ 1 \end{matrix}] = [\begin{matrix} p_{1} (x_{j}) \\ ⋮ \\ p_{m} (x_{j}) \\ ⋮ \\ p_{M} (x_{j}) \end{matrix}] .

(7)

There areM′ =M ×L centers for all of the classes.β₁,β₂, ⋯,β_M′ are the bias terms.p_m(x_j) is theL dimensional probability vector inEq. (6), which is the likelihood ofx_j belonging to the subspace spanned by theL centers in each classm.x_j only belongs to one class, and we have zero probability w.r.t. the otherm − 1 classes.

We concatenateM′-dimensional features from all of the VoxelHop units to construct the final representation for the classifier in Module 3. In our implementation, we adopt the linear LSR. The detailed structure of five cascaded three-channel VoxelHop units is provided inTable II.

III. Experiments

In this section, we compare the classification performance of our VoxelHop against 3D CNN-based classification approaches. We also provide a systematic ablation study and sensitive analysis to demonstrate the effectiveness of the design choice of our framework.

A. Implementation Details

All the experiments were implemented using Python on a server with a Xeon E5 v4 CPU/Nvidia Tesla V100 GPU with 128GB memory. We also used the widely adopted deep learning library, Pytorch, to implement the 3D CNN approaches. For a fair comparison, we downsampled the deformation fields to the size of 110×110×30×3, which was consistent with the input of the 3D CNN approaches.

B. Framework Details

The architecture of our five-stage multi-channel VoxelHop for the input $x \in ℝ^{110 \times 110 \times 30 \times 3}$ is detailed inTable II. With five blocks of VoxelHop and maxpooling, the horizontal dimension was reduced to 3 × 3. By configuring the pooling in the aggregation at the fifth stage, i.e., $P_{5} = \frac{1}{2} S_{5}$ , $Q_{5} = \frac{1}{2} K_{5}$ , the output had the size of 1×1×(1×F₅)×3. As a result, we were able to configure, at most, VoxelHop with six stages for the input $x \in ℝ^{110 \times 110 \times 30 \times 3}$ , and dropped the aggregation at the sixth stage to maintain the horizontal size of 1 × 1.

To investigate the proper choice ofF_i at each stage, we examined the relationship between the number of Saab AC filters and the energy preservation ratio, as shown inFig. 6. We can see that the leading AC filters account for a large amount of energy, while the energy drops, as the index gets larger. In addition, we plot five energy thresholds, where the orange, green, blue, red, and purple dots represent the cumulative energy ratio of 95%, 96%, 97%, 98%, and 99%, respectively. This indicates that different energy ratios can be selected to balance the classification performance and complexity. In this work, we chose the number of Saab AC filters in the unsupervised dimension reduction procedure in a way to preserve the total energy ratio to 98%.

Fig. 6. — The log energy plot as a function of the number of AC filters. We plot five energy thresholds using the dots with the different color: 95% (orange), 96% (green), 97% (blue), 98% (red), and 99% (purple).

We compared our VoxelHop with a series of 3D CNN approaches. The VGG [6] and ResNet [7] are the popular networks in 2D computer vision, which have been adopted for single-channel 3D medical data [32] as a strong backbone for many applications [4]. In order to adapt them for multi-channel 3D deformation fields, we followed [31] to configure independent convolutional layers for each channel. We then concatenated the extracted features in the first fully connected layer. The detailed structure of 3D VGG [32], [31] is shown inFig. 5 andTable III. We used the threshold of 0.5 for the final prediction of the 3D CNN approaches.

Fig. 5. — Illustration of the three-channel 3D VGG network based on the 3D VGG backbone [32] with the separate convolution for multi-channel processing [30], [31].

TABLE III.

The detailed structure of 3D VGG

Input Size	Type	Filter Shape
[110 × 110 × (30 × 1)] × 3	M-Conv	[8 kernels of 3 × 3 × 3]×3
[108 × 108 × (28 × 8)] × 3	M-Conv	[8 kernels of 3 × 3 × 3]×3
[106 × 106 × (26 × 8)] × 3	MaxPool	(2×2×1)-(1×1×1)
[53 × 53 × (26 × 8)] × 3	M-Conv	[16 kernels of 3 × 3 × 3]×3
[51 × 51 × (24 × 16)] × 3	M-Conv	[16 kernels of 3 × 3 × 3]×3
[49 × 49 × (22 × 16)] × 3	MaxPool	(2×2×1)-(1×1×1)
[24 × 24 × (22 × 16)] × 3	M-Conv	[32 kernels of 3 × 3 × 3]×3
[22 × 22 × (20 × 32)] × 3	M-Conv	[32 kernels of 3 × 3 × 3]×3
[20 × 20 × (18 × 32)] × 3	M-Conv	[32 kernels of 3 × 3 × 3]×3
[18 × 18 × (16 × 32)] × 3	MaxPool	(2×2×2)-(1×1×1)
[9 × 9 × (8 × 32)] × 3	M-Conv	[64 kernels of 3 × 3 × 3]×3
[7 × 7 × (6 × 64)] × 3	M-Conv	[64 kernels of 3 × 3 × 3]×3
[5 × 5 × (4 × 64)] × 3	M-Conv	[64 kernels of 3 × 3 × 3]×3
[3 × 3 × (2 × 64)] × 3	MaxPool	(2×2×2)-(1×1×1)
[1 × 1 × (1 × 64)] × 3	Flatten	N/A
192	FC	128-dim
128	FC	64-dim
64	sigmoid	1-dim

Open in a new tab

To aggregate features spatially in Module 2, we applied the pooling of (4 × 4 × 4F_i)-to-(1 × 1 ×F_i) or (2 × 2 × 2F_i)-to-(1 × 1 ×F_i) at different VoxelHop units to reduce the spatial dimension of feature vectors. Specifically, we have

Q_{1} = \frac{1}{4} K_{1}; Q_{2} = \frac{1}{4} K_{2}; Q_{3} = \frac{1}{4} K_{3}; Q_{4} = \frac{1}{2} K_{4}; Q_{5} = \frac{1}{2} K_{5}; P_{1} = \frac{1}{4} S_{1}; P_{2} = \frac{1}{4} S_{2}; P_{3} = \frac{1}{4} S_{3}; P_{4} = \frac{1}{2} S_{4}; P_{5} = \frac{1}{2} S_{5} .

(8)

Moreover, we empirically setα=10 andL=3 in the LAG unit of Module 2. The performance was stable for a relatively large range ofα ∈ [5, 20] andL ∈ [2, 5].

C. Experimental Results

For quantitative analysis, we carried out leave-one-out cross-validation, where we used the same hyperparameters for all folds. We simply used the standard Randomized Search Cross-Validation¹ for hyperparameter selection in our cross-validation. Briefly, the accuracy was calculated by running each learning method 46 times, each time removing one of the 46 training subjects, and testing on the training subject that was removed. The final results were computed by averaging all of the 46 folds. We tested both our framework and the comparison methods five times, and the standard deviation was reported.

Without backpropagation, a fold training for multi-channel VoxelHop was completed within 20 mins with a single CPU, while the training of 3D CNNs took about two to three hours to achieve the convergence with a V100 GPU, and we set the batch size to 2. The PCA-like processing in our VoxelHop can be costly w.r.t. computation. However, our framework only required one time forward pass, and did not need iterative training. In contrast, each iteration of CNNs was fast, while the training usually took 200 epochs (i.e., 9,000 times forward pass and 4,500 times backpropagation for the setting of batch size=2).

The accuracy and the area under the curve (AUC) of our proposed M-VoxelHop are given inTable IV. We used the prefic “M−” to denote the multi-channel version of VoxelHop or 3D CNNs. The receiver operating characteristic (ROC) curves of VoxelHop and 3D VGG and 3D ResNet are given inFig. 7. The proposed five-stage M-VoxelHop with 98% energy ratio achieved superior accuracy and AUC over the compared 3D CNNs in our ALS classification task.

TABLE IV.

Comparison of the classification performance

Methods	Accuracy	AUC
M-VoxelHop	93.48±0.7%	0.9394±0.012
M-3D ResNet [32]+[31]	91.30±0.6%	0.9048±0.010
M-3D VGG [32]+[31]	89.13±0.5%	0.8808±0.014
M-3D DenseNet [33]+[31]	86.96±0.8%	0.8762±0.012
M-3D AlexNet [34]+[31]	84.78±1.0%	0.8575±0.013

Open in a new tab

Fig. 7. — Comparison of the receiver operating characteristic curve between VoxelHop and the multi-channel 3D CNNs, including 3D VGG and 3D ResNet [32].

We also analyzed the performance of different input sizes, as shown inTable IV. By applying the maxpooling operation to input data with the size of 110×110×30×3 at each stage, we were able to set up one to six stages.Fig. 8 shows an ablation study of configuring different stages. We can see that cascading multiple SSL operations effectively improve classification accuracy. Since the spatial size of the extracted features was small at the late stages, e.g., 3×3×(3×F₅)×3 at the fifth stage, the additional sixth stage did not substantially contribute to the performance. The result of the AUC score is also given inTable IV. Of note, the use of five or six stages for $x \in ℝ^{110 \times 110 \times 30 \times 3}$ yielded similar results, while the sixth stage added an additional cost.

Fig. 8. — Sensitivity study with respect to the number of VoxelHop units.

We also used the downsampling ofx for a fair comparison even though our VoxelHop was flexible for a larger input size, and without the downsampling, there could be more information contained in the input sample $x \in ℝ^{330 \times 220 \times 30 \times 3}$ . With the six-stage VoxelHop framework, we achieved a state-of-the-art AUC score of 0.9427 and a classification accuracy of 93.48%. We note that we also definedF_i by keeping 98% energy. To demonstrate the effectiveness of the cropping operation, we inputted the original sample $x \in ℝ^{704 \times 704 \times 50 \times 3}$ to the five-, six-, and seven-stage VoxelHop. Of note, we only considered the bulbar region, including the brain and tongue. By doing so, we were able to reduce the signal-to-noise ratio of the input sample resulting from other regions.

InTable V, we compared the classification performance with different inputs, including the original 3D MRI volumes. We found that deformation fields (DF) with SyN outperformed other input data, partly because DFs as in VBM provide volume expansions and compressions at each voxel in an objective manner. In other words, the size or volume difference of the tongue and brain relative to an atlas was the most important cue for our classification task as shown inTable V. In addition, the combination of 3D MRI volumes with DFs was likely to introduce redundant information, thus leading to worse performance. In addition to the registration with SyN [28], we also evaluated our framework with the registration using diffeomorphic demons (DD) [37]. The performance of our VoxelHop was on par with that of 3D ResNet for both registration methods; therefore, our proposed framework was robust against different registration approaches. With the segmentation masks, we were able to remove the unnecessary parts, and alleviated the difficulty of information extraction. For both our VoxelHop and 3D ResNet, there was a 1–2% improvement w.r.t. accuracy of using the segmentation masks. Of note, accurate segmentation of the brain and tongue region was not necessary; instead, we only cropped the ROI to include the brain and tongue region.

TABLE V.

Classification performance with different inputs

Methods	Accuracy	AUC
M-VoxelHop (DF with SyN)	93.48±0.7%	0.9394±0.012
M-3D ResNet (DF with SyN)	91.30±0.6%	0.9048±0.010
M-VoxelHop (volume)	76.85±0.9%	0.7537±0.008
M-3D ResNet (volume)	75.11±0.7%	0.7239±0.011
M-VoxelHop (volume+DF)	86.18±0.8%	0.8425±0.010
M-3D ResNet (volume+DF)	84.54±1.2%	0.8357±0.011
M-VoxelHop (DF with DD)	93.06±1.0%	0.9273±0.009
M-3D ResNet (DF with DD)	91.14±0.9%	0.8985±0.011
M-VoxelHop (w/o seg)	91.25±1.1%	0.9081±0.012
M-3D ResNet (w/o seg)	90.40±1.0%	0.8857±0.009

Open in a new tab

InTable VI, we compared our VoxelHop with the classic subspace learning, e.g., Eigenface [35] and PCANet [36] which use only one or multiple stages of PCA for classification. In addition, we were able to use the one-stage PCA coefficients as the feature, and kept the rest the same like our VoxelHop. However, it was extremely difficult to apply the one-stage PCA directly. This is because the input dimension is 110×110×30 = 363,000, which makes the covariance matrix dimension to be (363,000) × (363,000); thus it can be impractical for the implementation. An alternative is to replace the Saab transform at each stage with the standard PCA, while keeping the rest the same. VoxelHop with the Saab transform outperformed its multi-stage PCA version, since the latter could have the sign confusion problem.

TABLE VI.

Classification performance with conventional PCA-based approaches.

Methods	Accuracy	auc
M-VoxelHop	93.48±0.7%	0.9394±0.012
Eigenface [35]	82.75±0.8%	0.8139±0.009
PCANet [36]	88.28±0.9%	0.8664±0.010
M-VoxelHop (multi-stage PCA)	89.15±1.1%	0.8690±0.009

Open in a new tab

The number of AC filters was defined by the energy ratio at each stage, thus affecting the performance and complexity.Table VII shows the classification performance w.r.t. the energy ratio set in our VoxelHop units. The use of the threshold of 99% achieved the best performance, while it increased the number of AC filters, as shown inFig. 6. Therefore, the threshold of 98% was a good trade-off for our ALS disease classification.

TABLE VII.

Sensitivity study of input size

Input size	Stages	AUC
(110 × 110 × 30 × 3)	5	0.9394±0.012
(110 × 110 × 30 × 3)	6	0.9402±0.013
(330 × 220 × 30 × 3)	5	0.9387±0.011
(330 × 220 × 30 × 3)	6	0.9427±0.014
(704 × 704 × 50 × 3)	5	0.9021±0.013
(704 × 704 × 50 × 3)	6	0.9208±0.010
(704 × 704 × 50 × 3)	7	0.9332±0.011

Open in a new tab

The cross-entropy-guided feature selection was developed to simplify the LAG module. We defined the selected numberN_i based on the proportion.Fig. 9 shows the AUC score of keeping different proportion features. We can see that the top 30% features contributed to the performance, and the AUC score was stable with the top 50% features. We used the AUC metric, since the accuracy was not sensitive despite a relatively small number of datasets used in this work. The accuracy was saturated for 40% feature selection. Therefore, we simply dropped the last 40% features in all of our experiments, which largely reduced the to be processed features in the subsequent LAG unit and maintained the performance.

Fig. 9. — Sensitive analysis of the cross-entropy-guided feature selection.

In order to demonstrate the robustness of our VoxelHop with the fewer training samples, we further randomly removed 5, 10, 15, and 20 training samples in each leave-one-out evaluation fold.Fig. 10 shows the AUC of our VoxelHop and 3D ResNet of using fewer training data. We note that we removed the control and patient subjects iteratively to keep the datasets balanced between two categories as demonstrated inFig. 10, where the performance drop of 3D ResNet was more pronounced than our VoxelHop framework, when we removed more training data.

Fig. 10. — Sensitive analysis of using fewer training data. The vanilla training set involves 45 subjects in our leave-one-out evaluation.

IV. Discussion

A. Summary of Results

In this work, we presented a lightweight and transparent SSL framework, and applied it to a small number of 3D medical imaging data for classifying between ALS patients and healthy controls. To the best of our knowledge, this is the first attempt at analyzing both the brain and tongue to differentiate controls from patients using T2-weighted structural MRI [38]. ALS is a relentlessly progressive neurodegenerative disease [39], and MRI has been widely used to study ALS [40] to date. In full-blown ALS, the diagnosis is generally clear-cut, and actually can usually be made clinically. However, this is challenging in early diagnosis or disease progression monitoring, due to the lack of a fully validated biomarker, and so clinical trials are reliant on tools such as the ALSFRS-R questionnaire, which have limitations. When combined with considerable heterogeneity in terms of disease progression rate, this necessitates that clinical trials are currently prolonged and expensive. Thus, developing an objective biomarker and decision-making system to rapidly make go/no-go decisions is crucial. Prior research showed that ALS patients exhibited atrophy of gray matter in frontotemporal regions [14], and atrophy of internal muscles of the tongue [15]. In this work, therefore, we used brain and tongue regions simultaneously for our analysis through deformation fields obtained via registration between a head and neck atlas and all the subjects. Our framework achieved an accuracy of 93.48% and an AUC score of 0.9394, which was better than the state-of-the-art 3D CNN classification approaches, including 3D VGG, ResNet, and DenseNet.

B. Comparison of VoxelHop and 3D CNNs

In this subsection, we provide a thorough comparison between VoxelHop and 3D CNNs. CNNs are, in general, well-suited to analyzing 3D input data [4], by extending their 2D counterparts, while many of CNNs were first applied to 2D input data. For example, the performance of a 3D version of VGG [6] and ResNet [7] has been demonstrated in many applications [32]. Targeting multi-channel 3D input data, parallel convolutional layers were applied to each channel independently [30], [31]. The three-channel version of 3D VGG is detailed inTable III, and the corresponding architecture is provided inFig. 5. Both VoxelHop and 3D CNNs constructed the successively growing neighborhoods, and used spatial pooling to reduce the redundancy of neighborhood overlapping.

Although VoxelHop and 3D CNNs have a similar high-level concept, they are different in their model construction, training procedures, and training complexities. We list the differences between VoxelHop (SSL) and CNNs inTable X and elaborate on the details below.

TABLE X.

Comparison of VoxelHop and 3D CNNs

	VoxelHop (SSL)	3D CNNs
Mathematical interpretability	Easy	Difficult
Weak supervision	Easy	Difficult
Training/testing complexity	Low	High
Model parameter search	Feedforward design	Backpropagation
Model expandability	Non-parametric model	Parametric model

Open in a new tab

• Mathematical interpretability

Although the effectiveness of CNNs has been demonstrated through numerous applications, there are several properties that are not well understood [42]. Many CNNs are considered a “black-box,” partly because parameters are determined with backpropagation in an iterative manner [27]. By contrast, the parameters of our VoxelHop are computed, following a feedforward fashion without any backpropagation.

In this work, instead of “interpreting” or “explaining” the “black-box” CNNs with visualization techniques [42], we constructed a system with mathematically interpretable modules. The multi-stage SSL was proposed to interpret the benefits of the multi-layer architecture in CNNs from a forward design perspective [9], [26]. Specifically, Kuo et al. [26] proposed to use multiple Saab transforms and linear LSR to mimic the convolutional and fully connected layers, respectively. Since the parameters from our framework were determined by a kernel-based Saab transform, i.e., a variant of PCA, the parameters are deemed mathematically transparent and interpretable, compared with the parameters in CNNs, which are determined via backpropagation. In addition, VoxelHop made use of Saab transforms to find filter coefficients based on PCA and a bias term to resolve the sign confusion problem of the nonlinear activation unit used in CNNs [9], [26]. The mathematical interpretability in VoxelHop can be an attractive property for clinical applications, since VoxelHop offers a better understanding of how the parameters are determined, and how the obtained parameters are used for the final decision-making process.

• Weak supervision with limited training data

Recent deep representation learning approaches are typically data starved, and rely on large amounts of labeled training datasets for supervised learning via backpropagation [1]. 3D CNNs usually need massive labeled datasets for their training. Data augmentation is also usually demanded to generate additional datasets. This constraint can be largely alleviated in VoxelHop, due to its unsupervised dimension reduction process. The class label is only utilized in the cross-entropy guided feature selection and the LAG units, and the classifier is based on a straightforward linear LSR model. This property is particularly beneficial for clinical applications, because the collection of a large number of 3D medical imaging data is challenging [3]. We note that PixelHop has demonstrated the effectiveness of SSL in 2D MNIST database, which contains 50,000 training examples and 10,000 testing examples.

• Training and testing complexity

Deep learning usually requires extensive computing resources for model fitting at the training stage, due to its backpropagation. The computing cost of 3D data can be more prohibitive than 2D data [4], since the input sample itself and the corresponding network parameters can be much larger. The training of our SSL-based VoxelHop is considerably simpler than that of 3D CNNs, as the VoxelHop is based on the one-pass feedforward network structure. In this work, the parameter used in our multi-channel VoxelHop was approximately 10× fewer than the compared 3D CNNs. Our SSL-based VoxelHop with 20 mins learning with a CPU outperformed the 3D CNNs with three hours of training with a GPU. The training and testing complexity of VoxelHop could also be balanced with the stage number, energy ratio, and cross-entropy guided feature selection.

• Model expandability

3D CNNs are based on a parametric learning framework, which is usually data starved. For 3D CNNs, much more model parameters are typically required than the number of training samples, resulting in an over-parameterized network [1]. Moreover, it is challenging to adjust the network structure to fit into different datasets. However, our SSL-based VoxelHop is based on a non-parametric framework, which allows us to flexibly adjust the number of AC filters at each unit, by considering the scale of datasets, task complexity, and hardware constraints with performance trade-off [9], [26]. Specifically, in this work, we simply set the energy threshold around 95% to 99%, and used the cross-entropy-guided feature selection to achieve the balance between performance and efficiency.

V. Conclusion and Future Direction

In this work, we presented a lightweight and mathematically interpretable SSL framework using multi-channel 3D data for the task of ALS disease classification from T2-weighted MRI. Extensive experiments carried out with a total of 20 controls and 26 patients demonstrated that our framework achieved superior accuracy and AUC with 10× fewer parameters and much less training time, compared with the state-of-the-art 3D CNNs. Our framework thus opens new vistas to develop a clinical decision-making system, which is mathematically transparent and lightweight, with a small number of subjects.

There are several aspects that are not fully explored in the present work. First, we will extend our framework to deal with longitudinal 3D data with multiple time points, and develop a predictive model that can be used for the fine-grained characterization and classification task. Second, to date, segmentation of anatomical structures, such as the brain and tongue [43], [44], [45], has played an important role in characterizing anatomical structures and their variations. In this work, we manually segmented the brain and tongue region in which to localize the deformation fields. In our future work, we will investigate a fully automated framework to jointly carry out SSL-based 3D segmentation in conjunction with the classification task. Finally, although we tackled the challenging ALS disease classification task in this work, we will apply our framework to a host of other neurological disorders with different imaging modalities.

TABLE VIII.

Analysis of the energy ratio used in our M-VoxelHop

Energy Ratio	Accuracy	AUC
95%	86.95±0.7%	0.8746±0.010
96%	89.13±0.8%	0.9023±0.014
97%	91.30±0.6%	0.9155±0.012
98%*	93.48±0.7%	0.9394±0.012
99%	93.48±0.6%	0.9405±0.013

Open in a new tab

TABLE IX.

Comparison of the model complexity w.r.t. the number of parameters

Methods	Accuracy	Parameters
M-VoxelHop	93.48±0.7%	~0.11M
M-3D ResNet style [32]+[31]	91.30±0.6%	~1.13M
M-3D EfficientNet style [41]+[31]	91.25±0.7%	~0.72M
M-3D VGG style [32]+[31]	89.13±0.5%	~1.26M
M-3D DenseNet style [33]+[31]	86.96±0.8%	~0.96M
M-3D AlexNet style [34]+[31]	84.78±1.0%	~0.86M

Open in a new tab

Acknowledgment

This work was partially supported by NIH R01DC018511 and P41EB022544.

Footnotes

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV

References

[1].Goodfellow I, Bengio Y, Courville A, and Bengio Y, Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2. [Google Scholar]
[2].Ravì D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, and Yang G-Z, “Deep learning for health informatics,” IEEE journal of biomedical and health informatics, vol. 21, no. 1, pp. 4–21, 2016. [DOI] [PubMed] [Google Scholar]
[3].Shen D, Wu G, and Suk H-I, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Singh SP, Wang L, Gupta S, Goli H, Padmanabhan P, and Gulyás B, “3d deep learning on medical images: A review,” arXiv preprint arXiv:2004.00218, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. [Google Scholar]
[6].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [Google Scholar]
[7].He K, Zhang X, Ren S, and Sun J, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645. [Google Scholar]
[8].Peng H, Gong W, Beckmann CF, Vedaldi A, and Smith SM, “Accurate brain age prediction with lightweight deep neural networks,” BioRxiv, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Kuo C-CJ, “Understanding convolutional neural networks with a mathematical model,” Journal of Visual Communication and Image Representation, vol. 41, pp. 406–413, 2016. [Google Scholar]
[10].Jenkins TM, Alix JJ, Fingret J, Esmail T, Hoggard N, Baster K, McDermott CJ, Wilkinson ID, and Shaw PJ, “Longitudinal multi-modal muscle-based biomarker assessment in motor neuron disease,” Journal of neurology, vol. 267, no. 1, pp. 244–256, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Turner MR, Grosskreutz J, Kassubek J, Abrahams S, Agosta F, Benatar M, Filippi M, Goldstein LH, van den Heuvel M, Kalra Set al. , “Towards a neuroimaging biomarker for amyotrophic lateral sclerosis,” The Lancet Neurology, vol. 10, no. 5, pp. 400–403, 2011. [DOI] [PubMed] [Google Scholar]
[12].Babu S, “Upper motor neuron burden measurement in motor neuron diseases: Does one scale fit all?” Muscle & Nerve, vol. 61, no. 4, pp. 431–432, 2020. [DOI] [PubMed] [Google Scholar]
[13].Abrahams S, Goldstein L, Simmons A, Brammer M, Williams S, Giampietro V, and Leigh P, “Word retrieval in amyotrophic lateral sclerosis: a functional magnetic resonance imaging study,” Brain, vol. 127, no. 7, pp. 1507–1517, 2004. [DOI] [PubMed] [Google Scholar]
[14].Chang J, Lomen-Hoerth C, Murphy J, Henry R, Kramer J, Miller B, and Gorno-Tempini M, “A voxel-based morphometry study of patterns of brain atrophy in ALS and ALS/FTLD,” Neurology, vol. 65, no. 1, pp. 75–80, 2005. [DOI] [PubMed] [Google Scholar]
[15].Lee E, Xing F, Ahn S, Reese TG, Wang R, Green JR, Atassi N, Wedeen VJ, El Fakhri G, and Woo J, “Magnetic resonance imaging based anatomical assessment of tongue impairment due to amyotrophic lateral sclerosis: a preliminary study,” The Journal of the Acoustical Society of America, vol. 143, no. 4, pp. EL248–EL254, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Ashburner J and Friston KJ, “Voxel-based morphometry—the methods,” Neuroimage, vol. 11, no. 6, pp. 805–821, 2000. [DOI] [PubMed] [Google Scholar]
[17].Baron J, Chetelat G, Desgranges B, Perchey G, Landeau B, De La Sayette V, and Eustache F, “In vivo mapping of gray matter loss with voxel-based morphometry in mild alzheimer’s disease,” Neuroimage, vol. 14, no. 2, pp. 298–309, 2001. [DOI] [PubMed] [Google Scholar]
[18].Särkämö T, Ripollés P, Vepsäläinen H, Autti T, Silvennoinen HM, Salli E, Laitinen S, Forsblom A, Soinila S, and Rodríguez-Fornells A, “Structural changes induced by daily music listening in the recovering brain after middle cerebral artery stroke: a voxel-based morphometry study,” Frontiers in Human Neuroscience, vol. 8, p. 245, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Gale SD, Baxter L, Roundy N, and Johnson S, “Traumatic brain injury and grey matter concentration: a preliminary voxel based morphometry study,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 76, no. 7, pp. 984–988, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Bergouignan L, Chupin M, Czechowska Y, Kinkingnéhun S, Lemogne C, Le Bastard G, Lepage M, Garnero L, Colliot O, and Fossati P, “Can voxel based morphometry, manual segmentation and automated segmentation equally detect hippocampal volume differences in acute depression?” Neuroimage, vol. 45, no. 1, pp. 29–37, 2009. [DOI] [PubMed] [Google Scholar]
[21].Cha CH and Patten BM, “Amyotrophic lateral sclerosis: abnormalities of the tongue on magnetic resonance imaging,” Annals of Neurology, vol. 25, no. 5, pp. 468–472, 1989. [DOI] [PubMed] [Google Scholar]
[22].Xing F, Prince JL, Stone M, Reese TG, Atassi N, Wedeen VJ, El Fakhri G, and Woo J, “Strain map of the tongue in normal and als speech patterns from tagged and diffusion mri,” in Medical Imaging 2018: Image Processing, vol. 10574. International Society for Optics and Photonics, 2018, p. 1057411. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Woo J, Xing F, Prince JL, Stone M, Green JR, Goldsmith T, Reese TG, Wedeen VJ, and El Fakhri G, “Differentiating post-cancer from healthy tongue muscle coordination patterns during speech using deep learning,” The Journal of the Acoustical Society of America, vol. 145, no. 5, pp. EL423–EL429, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Chen Y and Kuo C-CJ, “Pixelhop: A successive subspace learning (ssl) method for object recognition,” Journal of Visual Communication and Image Representation, p. 102749, 2020. [Google Scholar]
[25].Zhang M, You H, Kadam P, Liu S, and Kuo C-CJ, “Pointhop: An explainable machine learning method for point cloud classification,” IEEE Transactions on Multimedia, 2020. [Google Scholar]
[26].Kuo C-CJ, Zhang M, Li S, Duan J, and Chen Y, “Interpretable convolutional neural networks via feedforward design,” Journal of Visual Communication and Image Representation, vol. 60, pp. 346–359, 2019. [Google Scholar]
[27].Fan F, Xiong J, and Wang G, “On interpretability of artificial neural networks,” arXiv preprint arXiv:2001.02522, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Avants B, Epstein C, Grossman M, and Gee J, “Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain,” Medical image analysis, vol. 12, no. 1, pp. 26–41, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Chen Y, Rouhsedaghat M, You S, Rao R, and Kuo C-CJ, “Pixelhop++: A small successive-subspace-learning-based (ssl-based) model for image classification,” arXiv preprint arXiv:2002.03141, 2020. [Google Scholar]
[30].Ji S, Xu W, Yang M, and Yu K, “3D convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012. [DOI] [PubMed] [Google Scholar]
[31].Nie D, Lu J, Zhang H, Adeli E, Wang J, Yu Z, Liu L, Wang Q, Wu J, and Shen D, “Multi-channel 3D deep feature learning for survival time prediction of brain tumor patients using multi-modal neuroimages,” Scientific reports, vol. 9, no. 1, pp. 1–14, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Korolev S, Safiullin A, Belyaev M, and Dodonova Y, “Residual and plain convolutional neural networks for 3d brain mri classification,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). IEEE, 2017, pp. 835–838. [Google Scholar]
[33].Ruiz J, Mahmud M, Modasshir M, Kaiser MS, f. t. Alzheimer’s Disease Neuroimaging Initiativeet al. , “3D densenet ensemble in 4-way classification of alzheimer’s disease,” in International Conference on Brain Informatics. Springer, 2020, pp. 85–96. [Google Scholar]
[34].Polat H and Danaei Mehr H, “Classification of pulmonary CT images by using hybrid 3D-deep convolutional neural network architecture,” Applied Sciences, vol. 9, no. 5, p. 940, 2019. [Google Scholar]
[35].Turk M and Pentland A, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991. [DOI] [PubMed] [Google Scholar]
[36].Chan T-H, Jia K, Gao S, Lu J, Zeng Z, and Ma Y, “Pcanet: A simple deep learning baseline for image classification?” IEEE transactions on image processing, vol. 24, no. 12, pp. 5017–5032, 2015. [DOI] [PubMed] [Google Scholar]
[37].Vercauteren T, Pennec X, Perchant A, and Ayache N, “Diffeomorphic demons: Efficient non-parametric image registration,” NeuroImage, vol. 45, no. 1, pp. S61–S72, 2009. [DOI] [PubMed] [Google Scholar]
[38].Grollemund V, Pradat P-F, Querin G, Delbot F, Le Chat G, Pradat-Peyre J-F, and Bede P, “Machine learning in amyotrophic lateral sclerosis: achievements, pitfalls, and future directions,” Frontiers in neuroscience, vol. 13, p. 135, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Foerster BR, Welsh RC, and Feldman EL, “25 years of neuroimaging in amyotrophic lateral sclerosis,” Nature Reviews Neurology, vol. 9, no. 9, pp. 513–524, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Kassubek J and Pagani M, “Imaging in amyotrophic lateral sclerosis: Mri and pet,” Current opinion in neurology, vol. 32, no. 5, pp. 740–746, 2019. [DOI] [PubMed] [Google Scholar]
[41].Tan M and Le Q, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 6105–6114. [Google Scholar]
[42].Zhang Q, Wu YN, and Zhu S-C, “Interpretable convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8827–8836. [Google Scholar]
[43].Ibragimov B, Prince JL, Murano EZ, Woo J, Stone M, Likar B, Pernuš F, and Vrtovec T, “Segmentation of tongue muscles from super-resolution magnetic resonance images,” Medical image analysis, vol. 20, no. 1, pp. 198–207, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Lee J, Woo J, Xing F, Murano EZ, Stone M, and Prince JL, “Semi-automatic segmentation of the tongue for 3d motion analysis with dynamic mri,” in 2013 IEEE 10th International Symposium on Biomedical Imaging. IEEE, 2013, pp. 1465–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Woo J, Lee J, Murano EZ, Xing F, Al-Talib M, Stone M, and Prince JL, “A high-resolution atlas and statistical model of the vocal tract from structural MRI,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 3, no. 1, pp. 47–60, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

Movatterモバイル変換

PERMALINK

VoxelHop: Successive Subspace Learning for ALS Disease Classification Using Structural MRI

Xiaofeng Liu

Fangxu Xing

Chao Yang

C-C Jay Kuo

Suma Babu

Georges El Fakhri

Thomas Jenkins

Jonghye Woo

Roles

Abstract

I. Introduction

Fig. 1.

II. Methodology

A. MRI Data Acquisition and Preprocessing

TABLE I.

Fig. 2.

B. Our VoxelHop Framework

1). Cascaded Saab transforms for multi-channel 3D MRI:

Fig. 3.

Fig. 4.

TABLE II.

2). Aggregation & cross-entropy guided feature selection:

3). LAG for feature selection:

III. Experiments

A. Implementation Details

B. Framework Details

Fig. 6.

Fig. 5.

TABLE III.

C. Experimental Results

TABLE IV.

Fig. 7.

Fig. 8.

TABLE V.

TABLE VI.

TABLE VII.

Fig. 9.

Fig. 10.

IV. Discussion

A. Summary of Results

B. Comparison of VoxelHop and 3D CNNs

TABLE X.

• Mathematical interpretability

• Weak supervision with limited training data

• Training and testing complexity

• Model expandability

V. Conclusion and Future Direction

TABLE VIII.

TABLE IX.

Acknowledgment

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases