Multiple sclerosis eetection via 6-layer stochastic pooling convolutional neural network and multiple-way data augmentation

Background: Multiple sclerosis is one of most widespread autoimmune neuroinflammatory diseases which mainly damages body function such as movement, sensation, and vision. Despite of conventional clinical presentation, brain magnetic resonance imaging of white matter lesions is often applied to diagnose multiple sclerosis at the early stage. Methods: In this article, we proposed a 6-layer stochastic pooling convolutional neural network (CNN) with multiple-way data augmentation for multiple sclerosis detection in brain magnetic resonance imaging. Our approach does not demand hand-crafted features unlike those traditional machine learning methods. Via application of stochastic pooling and multiple-way data augmentation, our 6-layer CNN achieved equivalent performance against those deep learning methods which consist of so many layers and parameters that ordinarily bring difficulty to training. Further, we also conducted ablation experiments to examine the contribution of stochastic pooling and multiple-way data augmentation to the original CNN model. Results: The results showed that this 6-layer CNN obtained a sensitivity of 95.98 ± 0.46%, a specificity of 95.67 ± 0.92%, and an accuracy of 95.82 ± 0.58%. According to comparison experiments, our results are better than state-of-the-art approaches. Conclusion: Our scheme of stochastic pooling and multiple-way data augmentation enhanced the original 6-layer CNN model compared to those using maximum pooling or average pooling and inadequate data augmentation.


Introduction
Multiple sclerosis (MS) is an autoimmune disease characterized by demyelinating inflammatory white matter lesions of the central nervous system. MS does harm for patient's health by impeding nerve-signal transmitted between brain and other parts of body. It often involves periventricular white matter, spinal cord, brainstem, cerebellum, and optic nerve. Multiple sclerosis may cause loss of muscle coordination, impaired vision, and loss of body function to people. And it is still not clear about its etiology and pathogeneses, which needs further study by medical researchers. Multiple sclerosis can be divided into four categories: (ⅰ) relapsing-remitting MS, (ⅱ) secondaryprogressive MS, (ⅲ) primary-progressive MS, and (ⅳ) progressive-relapsing MS. Relapsing-remitting MS (R-R) is most commonly observed in clinic, accounting for around 85% of the total. R-R patients usually endure several times of relapse while in remission period the condition is stable. Secondary-progressive MS (S-P) is derived from R-R gradually. About 80% of R-R patients lead to S-P within twenty-five years and the condition would not be relieved like what it occurs in R-R period. Unlike the R-R and S-P MS, primary-progressive MS (P-P) skips the beginning stage. Patient's condition is aggravated since firstly affected by multiple sclerosis. This category of MS accounts for almost 10% of the total. As for progress-relapsing (P-R) MS, it is rarely seen in clinic. As we can see, R-R and S-P take major proportion of multiple sclerosis. If patients receive effective and suitable treatment in the early stage of MS, it will decrease the chance of turning from R-R to S-P which means patients could suffer less relapse and pain. Therefore, detection for multiple sclerosis as soon as possible is momentous to doctors fighting against multiple sclerosis.
Though researchers realized the significance of diagnosing multiple sclerosis in the early stage, it is not effortless to identify MS from healthy people accurately. In terms of clinical manifestations, multiple sclerosis is similar with other white matter diseases including disseminated encephalomyelitis (ADEM), acute cerebral infarction (ACI) and neuromyelitis optica (NMO). Under this circumstance, researchers had to look for other techniques to improve the success rate of MS diagnosis. Magnetic resonance imaging (MRI) is often utilized for the diagnosis of MS due to its characteristics of less ionizing radiation damage to human body, clear soft tissue imaging quality, and the ability to obtain original three-dimensional cross-sectional images without reconstruction. In the meanwhile, scientists also realized that computer-aided diagnosis was playing an increasingly important role in the field of medical image analysis. The methods applying computer vision and digital image processing to brain MRI have surpassed humans in diagnosing diseases such as Alzheimer's [1], epilepsy [2], Creutzfeldt-Jakob disease [3], and cerebral glioma [4]. Therefore, the application of computer vision and digital image processing in craniocerebral MRI to improve the diagnosis rate of multiple sclerosis has become the focus of researchers. For example, Wang, et al. [5] proposed a method for multiple sclerosis detection based on biorthogonal wavelet transform, RBF kernel principal component analysis, and logistic regression. Nayak, et al. [6] presented an approach using discrete wavelet transform and AdaBoost with random forests. Recently, Zhang, et al. [7] applied dropout and parametric ReLU in building convolutional neural network (CNN) for MS identification. Eitel, et al. [8] proposed their CNNbased method for MS detection with layer-wise relevance propagation. Alijamaat, et al. [9] put forward wavelet CNN for MS detection in brain MRI images. Han, et al. [10] used adaptive genetic algorithm (AGA) for MS recognition. Han, et al. [11] employed particle swarm optimization (PSO) for MS recognition. Tang [12] used a five-layer CNN (5l-CNN) for MS detection.
These previous works could be divided into two categories. The first category of methods [5,6] is based on traditional hand-crafted features. They need to coin specific features manually and it is usually boring and time-consuming. The second category of methods [7][8][9] is based on deep learning. It is common that they adopted deep neural networks which may contain over fifty layers or even two hundred layers to conduct the classification. These huge neural networks, nevertheless, are hard to train and cost too much computational resources (mainly GPUs) which are expensive for some researchers to afford. In this study, we proposed an approach based on 6-layer convolutional neural network to identify brain MRI images for diagnosis of multiple sclerosis. Compared with the traditional methods based on manual feature extraction, our approach applies CNN. So it has stronger capability of feature extraction and object classification and also avoid the complicated process of manual feature selection. Compared with the methods based on deep neural network, our model structure has only six layers, instead of dozens or even hundreds of layers. Large networks tend to be time-consuming, laborious, and difficult to reach convergence, and also easy to overfit, while our 6-layer neural network does not have these disadvantages. The second strength of the proposed 6-layer CNN is that it adopts stochastic pooling, which brings better generalization performance compared to those deep neural networks using max pooling. And crucially, we conducted up to sixteen methods of data augmentation. To the best of our knowledge, there is no other previous work in a model of diagnosing MS using so many ways of data augmentation. Our approach has the most diverse and comprehensive methods of data augmentation at present. In general, the proposed approach has the characteristics of simple network architecture, fast training speed and easy convergence. At the same time, due to the application of stochastic pooling and multiple-way data augmentation, this approach achieved competitive results in the detection of multiple sclerosis on brain MRI images.
In the next chapter of the paper, we will first introduce the experiment data and the preprocessing operation for the data set. Data and preprocessing are also vital for building a successful neural network of vision task. In the third chapter, we will present the CNN architecture, stochastic pooling, and multiple-way data augmentation step by step. Then we will show the design of our experiments, including validation and evaluation. In the fourth chapter, we will give discussion of experimental results. It is worth noting that our experimental results contain ablation experimental results to demonstrate how much stochastic pooling and multiple-way data augmentation we applied could boost the performance of a simple CNN in MS diagnosis. Finally, in the fifth chapter, we will provide the summary of this study and put forward some possible improvement directions in the future.

Dataset Sources
We acquired the same dataset as [7]. This dataset totally consists of 1357 MRI images in which 676 slices [13] are multiple sclerosis images and 681 slices [7] are health controls. We randomly selected two samples from the dataset as shown in Figure 1. Figure 1 (a) presented one original MS slice and Figure 1 (b) described the delineated plaques on Figure 1 (a). We also provided Table 1 to illustrate the demographic characteristics of the dataset.

Data preprocessing
As we mentioned, our dataset was combined from two sources of images. This would lead to difference of image characteristics between two sources of images on account of factors such as scanning equipment and reconstruction process. In order to restrain the difference, we need to apply contrast normalization technique to balance the two sources of images into a same range of gray-level intensity. In this study, we adopted histogram stretching [14] as our method of contrast normalization because of its effectiveness and simplicity.
Via this operation, we can observe that the distribution of gray-level intensity in two sources of images are stretched to the same field. As a result, we as far as possible formed the two sources of images into one entire dataset and avoided the negative influence on subsequent process.

Pooling Layer
Pooling layer, also named as subsampling or downsampling, is often used behind of convolutional layer in a classic architecture of CNN [15]. Its main purposes include reducing feature dimension of convolutional layer output [16], suppressing noise, reducing quantity of parameters and computation cost, and dampening overfitting [17].
Unlike most of other neural network using CNN [18] as backbone, our 6-layer CNN applied stochastic pooling rather than max pooling or average pooling. Suppose there existed a pooling window upon the region of feature map which covered k elements. Each element of feature map was recorded as v_i, and ⅈ was the index of element [19]. After the pooling window slide upon this region, the output of pooling operation was written as u. Then the max pooling operation could be described as: which means max pooling always selects the biggest element within a region of feature map [20]. The average pooling operation could be described as: which means average pooling adopts the mean value of k elements. In stochastic pooling, we first calculated the probability map of the chosen region.
Then stochastic pooling would choose the value of one element as sampling value according to the probability distribution [21]. The p_i was bigger, v_i was more likely chosen as the sampling value, but not definitely. The mechanism could be described as: u=vi ,i~P(p_1,…,p_i,..,p_k) (7)  average pooling, and stochastic pooling.

Structure of our six-layer CNN
Our proposed CNN structure consisted of three convolutional layers and three fully-connected layers [22]. Generally speaking, convolutional layers are meant to extract features while fully-connected layers are used for classification. Each convolutional layer was followed by activation function and a stochastic pooling layer [23,24]. Activation function was applied for nonlinear transformation after convolution calculation. During the procedure of activation function and pooling layer, there exists no learnable weights [25]. Hence, we usually do not count them in neural network structure. As it is shown in Table 3, three convolutional layers and three stochastic pooling layers formed the 3-layer convolutional network in the structure.
In our 6-layer CNN structure, it contained three fullyconnected layers and three dropout layers. Ahead of each fully-connected layer, a dropout layer was inserted to make CNN more robust to training. The retention probabilities of three dropout layers are set as 0.5, 0.5, and 0.5, respectively, by trial-and-error method. Table 4  presented the structure of fully-connected layers in our proposed model. At last, we offered Figure 3 to portray the whole structure of the proposed 6-layer stochastic pooling CNNs.

Multiple-way data augmentation
It is known that the learning process of neural network cannot leave the support of massive data samples. On most occasions, the more training data are fed to neural network, the better model we attain [26,27]. However, in reality, data samples is often insufficient. Lack of samples will not only impair the model to obtain the best performance, but also lead to difficult training and frequent overfitting. Data augmentation (DA) technology [28,29] is aimed at expanding the original small dataset into a lager one by means of digital signal processing, so as to alleviate the problem of insufficient samples. In previous work, data augmentation has been applied but with only five ways (rotation, scaling, Gaussian noise, random translation, and Gamma correction) [30]. It is our contribution that we exploited up to 18-way data augmentation methods. As far as we are aware, this study applied most ways of data augmentation among existing  CNN-based MS detection approaches [31]. Our data augmentation methods contained three categories that were geometric-based methods, noise-based methods, and photometric-based methods. At first, there were nine ways of data augmentation [32]. Then via reflecting these augmented samples horizontally, we got double sets of data augmentation methods which contained twenty ways in total. Table 5 provided a list of data augmentation methods we used.

Geometric-based methods
In geometry, geometric-based methods are also named as affine transformation, which indicates transforming one existing vector space to another one [33]. Affine transformation is combined with a linear transformation plus a shift [34]. Assume an original vector space was recorded as , and the linear transformation could be described as a matrix written as A, meanwhile the shift was written as . Then the new transformed vector space was calculated as below. = A + Back to digital image processing, affine transformation means transform points in the image from their previous    coordinates to new ones. Assume the coordinates of the raw image were recorded as , and the transformed coordinates were written as . Then the transformation process could be described as: (9) in which was called affine transformation matrix and was also frequently written as . Every affine transformation can be represented through particular affine transformation matrix. Here we introduce six ways of affine transformation that employed in this study.
Horizontal flipping. It is a geometric transformation operation performing on the raw image to generate a mirror image which is symmetrical about y-axis [35]. Compared to vertical flipping, horizontal flipping is more often adopted and has been tested effectiveness on popular datasets such as ImageNet and CIFAR-10 [36]. The affine transformation matrix of horizontal flipping could be written as . Thus, we get the transformed coordinates as follows.
Horizontal shear. It is defined as changing location of each point in the image horizontally, along the x-axis. And the amount of displacement along the x-axis is determined by each point's coordinate of y-axis [37]. The affine transformation matrix of horizontal shear could be written as . So we attain the transformed coordinates as below.     . Then the transformation operation could be described as below.

Noise-based methods
In data augmentation, noise-based methods are defined as injecting noise to image samples. Via adding noise into images, training dataset would enhance its sampling variance so as to overcome the lack of data.
Gaussian noise. As one the most commonly-used noises, Gaussian noise is often added to raw images in data augmentation. Mark z as the gray level, z obeys the probability density function as follow: (16) where μ represents the mean gray value, and σ means standard deviation of z. Salt-and-pepper noise. It is a widely-used noise in data augmentation as well. In salt-and-pepper noise augmentation, z obeys the probability density function which could be depicted as: (17) where a and b are threshold values for salt noise and pepper noise. Figure 4 (g) showed examples of salt-andpepper noise.
Speckle noise. As a granular interference, speckle noise naturally occurs in radar or ultrasound images. Suppose F was the observed image, f was the image without noise, N m referred to multiplicative noise, and N a referred to additive noise. Then speckle noise could be defined as follows. Figure 4(h) showed examples of speckle noise.
Photometric-based method Gamma correction. In the beginning, Gamma correction was made for luminance adjustment in imaging or display system [38]. Because human's perception of luminance is not linear with light power, but with a relation of power function. And the exponent of this power function was recorded as γ. The gamma correction is usually written as follows.
v out = Av in γ (19) In this equation, v out represents output gray value and v in represents input gray value. When γ < 1, we often regard this gamma correction as gamma compression. While γ>1, we regard it as gamma expansion [39]. In this study, we applied gamma correction on raw images of datasets as one of data augmentation methods to enlarge our training samples. Figure 4 (i) showed examples of Gamma correction.
In the end we provided Figure 4 to illustrate effects of multiple-way data augmentation.

10-fold Cross Validation
In this study, we utilized 10-fold cross validation as our method of dataset division. We divided the whole dataset into ten folds [40]. Each fold contains 67 multiple sclerosis slices and 68 health controls. In every iteration, we adopted nine folds of data as training set while the other one fold as testing set and repeated this procedure ten times. Figure 5 described this process of splitting the dataset into ten folds and repeating training-testing for ten iterations.
In the end, we obtained ten expectations of model performance and calculated the final expectation by averaging the expectations of each iteration [41]. This calculation could be written as below: (20)

Measure
In this study, we applied confusion matrix (shown in Table 6) to measure the performance. In confusion matrix, we counted true positive (TP), true negative (TN), false positive (FP), and false negative (FN) and used these values to calculate sensitivity (SEN), specificity (SPC), precision (PRC), accuracy (ACC), F 1 score, Matthews correlation coefficient (MCC), and Fowlkes-Mallows index (FMI). The calculation processes were described as below: In addition to original measures of sensitivity, specificity, precision, accuracy, F_1 score, MCC, and FMI, we calculated the standard deviation and the average based on these seven measures for further performance comparison experiments as well. Table 7 showed the results of 10 runs. Our approach based on 6-layer stochastic pooling CNNs and multipleway data augmentation secured a sensitivity of 95.98 ± 0.46%, a specificity of 95.67 ± 0.92%, a precision of 95.66 ± 0.89%, an accuracy of 95.82 ± 0.58%, a F1 score of 95.81 ± 0.57%, a MCC of 91.65 ± 1.16%, and a FMI of 95.82 ± 0.57%. Also, we gave Figure 6 to present the error bar of 10-run results.

Pooling Methods Comparison
In order to inspect into the contribution stochastic pooling made on the model's performance, we conducted comparison experiments which replaced stochastic pooling with average pooling and max pooling respectively in proposed CNNs. As we can see in Table 8

Gaussian noise
Salt-and-pepper noise

Photometric-based method
Gamma correction v out = Av in γ Table 5. 18-way data augmentation.  Figure 7.
As it was presented in Table 9, with max pooling, the proposed CNN obtained a sensitivity of 94.06 ± 1.54%, a specificity of 94.56 ± 1.44%, a precision of 94.54 ± 1.41%, an accuracy of 94.31 ± 1.27%, a F1 score of 94.30 ± 1.29%, a MCC of 88.64 ± 2.54%, and a FMI of 94.30 ± 1.28%. Figure 8 depicted the error bar of results using max pooling.
Via these comparison experiments, we could observe that using stochastic pooling earned best performance in almost every measure including sensitivity, specificity, precision, accuracy, F1 score, MCC, and FMI. To make it convenient for presenting the advantage of stochastic pooling, we drew Figure 9. It revealed that our model achieved improvement of nearly 2% in each measure when applying stochastic pooling compared with results using average pooling or max pooling.

Comparison with State-of-the-art Algorithms
We compared our 6-layer stochastic pooling CNN with state-of-the-art algorithms for multiple sclerosis detection such as AGA (10), PSO (11), and 5l-CNN (12). These three state-of-the-art algorithms were tested with the same dataset as ours. The comparison results was given in Table 10.     We also offered Figure 10 to show our method's strength against state-of-the-art algorithms. We could observe that our proposed method acquired the best sensitivity, specificity, accuracy, F1 score, MCC, and FMI, surpassing the second-best algorithm by nearly 1% in every measure. Besides, our method gained the least standard deviation compared state-of-the-art algorithms. These results showed the effectiveness of our proposed approach. There are some shortcomings of our proposed approach: (ⅰ) The dataset we used in this study is not abundant. We will seek for bigger datasets or collecting more sample images. (ⅱ) We will try some new deep learning technologies in multiple sclerosis detection such as attention mechanism.

Conclusions
In this study, we proposed a novel framework for multiple sclerosis detection using 6-layer stochastic pooling CNN combined with multiple-way data augmentation. We added stochastic pooling in our framework and tested its superiority to other pooling methods via comparison experiments. We also proposed 18-way data augmentation methods including geometric-based methods, noisebased methods, and photometric-based methods. Our approach beat several state-of-the-art algorithms, attaining a sensitivity of 95.98 ± 0.46%, a specificity of 95.67 ± 0.92%, a precision of 95.66 ± 0.89%, an accuracy of 95.82 ± 0.58%, and a F1 score of 95.81 ± 0.57%. The experimental results showed that our approach achieved highest performance in multiple sclerosis detection compared to several state-state-of-the-art algorithms.