Alcoholism via 6-layer customized deep convolution neural network

Background: Alcoholism is caused by excessive alcohol into the human body. Alcohol primarily damages the central nervous system of the human body and causes the nervous system function disorder and inhibition. Severe addiction can lead to respiratory circulation center inhibition, paralysis and even death. So far, the diagnosis of alcoholism is done by radiologist's manual CT examination. However, the diagnosis process is time-consuming, subjective and boring for doctors. External factors, such as extreme fatigue, lack of sleep and mental concentration, can easily affect the diagnosis process. Methods: In order to solve this problem, this paper proposed a new neural network based on computer vision, which used deep convolution neural network to diagnose alcoholism automatically. A total of 216 brain images were collected. In the 6-layer customized deep convolution neural network structure, there were four convolution layers and two fully connected layers, and each convolution layer was connected with a pooling layer. Results: The results showed that the accuracy, sensitivity, specificity, precision, F1, MCC and FMI were 95.96% ± 1.44%, 95.96% ± 1.66%, 95.95% ± 1.67%, 95.73% ± 1.72%, 95.84% ± 1.48%, 91.92% ± 2.87% and 95.84% ± 1.48% respectively. Conclusion: It can be concluded from comparison results that the proposed neural network structure is more effective than four state-of-the-art approaches. The proposed method has high accuracy and can be used as a diagnostic method for alcoholism.


Introduction
Alcohol has inhibitory effect on the central nervous system, so moderate drinking will make the human body relaxed and warm, which can relieve fatigue, calm the nerves, relieve pain and help sleep. But a lot of drinking can cause harm to human body, even lead to alcoholism which refers to the mental and physical disorders caused by excessive drinking. Alcoholism can be divided into acute alcoholism and chronic alcoholism. Acute alcoholism can be caused by a large amount of drinking. A large amount of alcohol into the body would firstly cause the central nervous system excited, and then inhibit the state. In the whole process, people could do a lot of unreasonable behavior, such as giggling, crying, attacking, forgetfulness. Severe alcoholism may result in liver glucose failure, hypoglycemia, respiratory failure, circulatory failure, and even death. Long-term heavy drinking can cause chronic alcoholism, which can cause mental and physical disorders. Patients with chronic alcoholism usually have a drinking history of more than 10 years. The cerebral cortex, cerebellum, pontine and other lesions occur in the victims. In addition, the liver, heart and endocrine glands are damaged, leading to deficiency of various enzymes and vitamins in the body and severe malnutrition. Patients with this disease can develop alcohol dependence syndrome and have Department of Civil Engineering, University of Florida, Gainesville, United States withdrawal reactions. Alcoholism, whether acute or chronic, can lead to many serious complications, and the prognosis is not optimistic.
With constant development of computer technology, computer-based technology has been applied to many fields, such as medicine. Kumar, S. et al. used support vector machine and fuzzy c-means clustering algorithm to reduce the feature dimensions of EEG, so as to detect the effect of alcohol on cerebral cortex more accurately (1). Rodrigues, J. d. C. et al. presented the classification of alcoholic electroencephalographic (EEG) signals using Wavelet Packet Decomposition (WPD) and machine learning techniques (2). Anuragi, A. et al. proposed a novel empirical wavelet transform (EWT) based machine learning framework for the classification of alcoholic and normal subjects using EEG signals (3). In the framework, the adaptive filtering is used to extract Time-Frequencydomain features from Hilbert-Huang Transform (HHT). Hou, X.-X. proposed to use Hu moment invariants (HMIs) (4). Han, L. employed three-segment encoded Jaya (3SEJ) algorithm for alcoholism recognition (5). Qian, P. presented a novel method based on cat swarm optimization (CSO) (6). Chen, X. used linear regression classifier for alcoholism detection (7).
The diagnosis of alcoholism relies on doctors' manual observation based on brain images. However, the diagnosis process is time-consuming, subjective and boring for doctors. External factors, such as extreme fatigue, lack of sleep and mental concentration, can easily affect the diagnosis process. In order to solve this problem, this paper proposed a 6-layer customized deep convolution neural network structure for automatic diagnosis of alcoholism. The main innovation and contribution of this paper: (i) we proposed an automatic diagnosis method of alcoholism based on deep convolution neural network; (ii) compared with four state-of-the-art approaches, the proposed neural network structure is more effective. Based on its excellent experimental and comparison results, the proposed neural network structure in this paper can be used as one of the methods for diagnosing alcoholism.
The rest of the structure of this paper is as follows. The second section introduces the data sources and preprocessing. The third section contains the 6-layer customized deep convolution neural network structure. The fourth section introduces the experimental results and comparison results with four state-of-the-art approaches. Section 5 discusses the conclusion, the shortcomings of this paper and the future research.

Materials Database
In this study, only the samples that meet the standards would be used in database for further experiments. The applicants joined this research through advertisement or participated in Nanjing Brain Hospital of Jiangsu Province, Provincial Hospital and Nanjing Children's hospital. To ensure precision of the experiment, the applicants would be carefully examined and excluded those with major mental illness. Applicants that were not proficient in Putonghua would be excluded, and the data would also be rejected if applicants had the following diseases or symptoms, such as stroke, epilepsy, liver cirrhosis, liver failure and HIV. If applicants had experienced a loss of consciousness for more than 15 minutes due to seizures, we would also exclude these applicants.
With the full acknowledge and consent from the participants, it took us three years to complete the data collection. Total of 235 participants (males-117, females-118) were tested, consisted of 114 long-term abstinence participants (males-58, females-56) and 121 non-alcohol control participants (males-59, females-62). All participants were tested by the "Alcohol Use Disorder Identification Test (AUDIT)" (8). The test results are shown in Table 1, in grams (9).

MRI Scan
In the paper, Siemens Verio Tim 3.0T MRI scanner was used, and a total of 216 sagittal slices covering the whole brain were obtained. During the scan, all applicants remained awake, laid down quietly and closed their eyes. MP-RAGE sequence was used in 216 images covering the whole brain. In the experiment, our final image was 8-bit gray depth instead of 16 bit gray depth, because alcoholism can change brain structure, but not the gray

Slice
We extracted the brain from all the 3D images and removed the skull by using FSL (FMRIB Software library) (10,11). All images were converted to MINI as standard template and sampled as 2 mm isotropic voxel, as shown in Figure 1. We selected the 80th slice (Z = 80.8 mm) at the coordinates of mini 152. Compared with other brain slices, this brain slice contained two characteristics of alcoholism patients: (1) small gray matter, (2) large ventricle. After clipping the background, a 176 × 176 matrix was left for subsequent training.

Methodology
The main method used in this paper is deep convolution neural network. Deep convolution neural network (DCNN) has advanced dramatically over the past decade in numerous fields related to pattern recognition from image processing to voice recognition (12). DCNN can reduce the number of parameters in neural network where this advantage makes it widely used in image recognition, speech recognition and many other fields.
Although there are many different DCNN frameworks, the basic components of DCNN framework are the same or similar, as shown in Figure 2. The input layer, convolution layers, pooling layers, activation layers, fullyconnected layers and the output layer constitute a DCNN framework.

Convolution
The convolution layer is composed of input, convolution kernel and output. The convolution kernel is used to learn and extract input features (13,14). In the neural network, there would be many convolution layers to increase efficiency (15). Convolution is a fairly simple operation: we start with a small weight matrix (16), the convolution kernel and let it gradually "scan" the input data. As the convolution kernel "slides", it computes the product of the weight matrix and the scanned data matrix (17), and then aggregates the results into an output pixel, the formula is as follow: In the above formula, the size of the input matrix is Wi×Hi×Di (W is width, H is height and D is depth), the size of the output matrix is W i+1 × H i+1 × D i+1 , F w represents the width of the convolution kernel (18), F h is the height of convolution kernel, P represents padding, S represents the stride, K represents the number of filters.
As shown in Figure 3, the input size is the matrix with the size of 4 × 4, the filter is 3 × 3 matrix and the output is the matrix with the size of 2 × 2. The filter elements at each position are multiplied by the corresponding input elements and then summed (sometimes called multiplication, summation, and addition). The results are then saved to the appropriate location in the output. The output of the convolution operation can be obtained by performing the procedure at all locations. Each step is calculated as follows:

Pooling
Pooling is another important concept in deep convolution neural network, which is actually a form of downsampling (19). There are many different forms of nonlinear pooling functions, which "Max pooling" is the most common, as shown Figure 4. It divides the input image into several rectangular regions and outputs the maximum value for each sub region. The max pooling formula is as follows: In the above formula, MP is the max pooling, N represents the pooling region, R is the activation set and within the pooling region (20,21).
Intuitively, the reason why max pooling can work effectively is that after a feature is found, its precise  position is far less important than its relative position with other features [22]. The pooling layer will continuously reduce the spatial size of the data, so the number of parameters and the amount of calculation will also decrease, which also controls the over fitting to a certain extent [23,24]. Another method is called average pooling, which is to calculate the average value of an region instead of the maximum value, as shown in Figure 4. The average pool formula is as follows: After the pooling operation, the output size formula is as follows: W output = (W input -G pooling )/S+1 [6] H output = (H input -G pooling )/S+1 [7] D output = D input [8] In the above formula, the size of the input matrix is W input × H input × D input (W is width, H is height and D is depth), the size of the output matrix is W output × H output × D output , Gpooling represents the size of the pooling kernel and S represents the stride (25). Generally speaking, the pooling layer is periodically inserted between the convolution layers of DCNN. In short, the pooling layer is to remove redundant information and retain key information.

Batch Normalization
To comprehensively study the effects of OA treatment on LPS-treated Raw264.7 cells, transcriptome analysis was conducted for the Control group, the LPS (LPS treatment alone) group, and the LPS/OA (LPS plus OA treatment) group. Figure 4 showed the general correlation and component analysis results. As could be seen, gene profiles with changed expression levels were quite different among the three groups. However, the pattern change of the LPS group was the most significant (with a correlation of 0.92), while the correlation increased to 0.94 between the LPS/OA group and either the LPS or the Control group, suggesting that OA treatment restored, at least, part of the gene expression profiles from the LPS group to the Control group. Batch normalization (BN) is a way to unify the scattered data and optimize the neural network. In the process of neural network training, with the increase of the network depth (26), the input value of each layer (i.e. x = Wu + B, u is the input) gradually shifts and changes (27). The reason why the training convergence is slow is that the whole distribution is close to the upper and lower limits of the value range of nonlinear function (28). Therefore, it will lead to the disappearance of the gradient of the lower layer network in the back propagation, which is the reason for training deep convolution neural network. BN is the standard normal distribution that reverses the input value of any neuron in each layer back and forth with the mean value of 0 and the variance of 1 (29).
For the deep convolution neural network, the activation value of each hidden layer of neurons can be batch standardized, which can be imagined as adding a BN operation layer to each hidden layer (30). The operation layer is located after the activation value of x = Wu + B is obtained and before the nonlinear function transformation. The specific formula of batch normalization is as follows: In a word, BN gradually maps the input distribution to a nonlinear function, approximates the limit saturated area of the value range, and compulsorily returns the normal distribution with the mean value of 0 and the variance of 1 (32). In this way, the input value of the nonlinear transformation function falls into the input sensitive region to avoid the problem of gradient disappearance. When the gradient is large, the efficiency of parameter adjustment can be improved and the convergence speed can be accelerated (33).

Rectified Linear Unit
In the neural network, the activation function is responsible for transforming the sum weighted input from the node into the activation of the node or output of the input. In order to train a deep convolution neural network with random gradient descent with error back propagation, an activation function is needed. The activation function is actually a nonlinear function that allows learning the complex relationships in the data. At the same time, it must provide higher sensitivity to activation and input and avoid easy saturation. In this paper, we used the rectified linear unit activation function (ReLu). The rectified linear unit activation function is a simple calculation, the formula is as follows: The function is linear for values greater than zero. ReLu has many required properties of linear activation function. However, it is a nonlinear function because negative values always output zero. If the input is 0 or less, it returns 0. as shown in Figure 5.
Compared with other activation functions, such as sigmoid and tanh, ReLu is simple to calculate, because it is just a max function. ReLu can output true zero value, while sigmoid, tanh and other functions can only output very close to zero value.

Structure of customized DCNN
In this paper, we mainly used DCNN as the main method. In the DCNN structure, we used four convolution layers, four pooling layers and two fully connected layers. The number of filters in the first convolution layer was 32, the number of filters in the second and third convolution layers was 64, and the number of filters in the fourth convolution layer was 128. The filters of the four convolution layers were all matrices of 3 × 3 size.
The pooling layer was connected behind each convolution layer, and the size of the four-layer pooling layer filter was 2 × 2 matrix. After convolution and pooling operation, two fully connected layers were connected. The structure and detailed parameters of DCNN are shown in Table 2.
The specific flow chart is as shown in Figure 6.
After four convolution layers and four pooling layers, the output size was 11x11x128. In the first fully connected layer, the parameter value was obtained by multiplying the dense size and the processed data value. The specific calculation was 15488 × 1000 = 15488000. The calculation of the second layer was the same as that of the first fully connected layer, the parameters of the second fully connected layer were 2000 (2 × 1000 = 2000).

Measures
There will still be some deviation between the machine prediction and the actual one, so we introduce the following concepts to evaluate the performance of the classifier, such as sensitivity, specificity, and so on. Before introducing the concepts, we firstly introduce the confusion matrix. We use a two-class model, so we mix all the results of the forecast and the actual situation, as shown in Table 3.

Layer
Size Parameters  As a result, the following four situations would appear. True positive (TP), which is called true positive rate, indicates the number of positive samples predicted by positive samples. False positive (FP), which is called false positive rate, indicates that it is the number of negative samples predicted to be positive samples. False negative (FN), which is called false negative rate, indicates that it is the number of positive samples predicted to be negative samples. True negative (TN), which is called true negative rate, indicates the number of negative samples predicted by negative samples.
Sensitivity represents the proportion of pairs in all positive cases, and measures the recognition ability of classifier to positive cases, the formula is as follows: Sensitivity = TP/(TP + FN) [11] Specificity refers to the proportion of negative cases to all negative cases, which measures the ability of the classifier to recognize negative cases, the formula is as follows: Specificity = TN/(FP + TN) [12] Precision is for the prediction results. It means the probability of the actual positive samples among all the predicted positive samples. It means how many of the predicted positive samples can we predict correctly, the formula is as follows: Precision = TP/(TP + FP) [13] Accuracy is defined as the percentage of the total sample that predicted the correct results, the formula is as follows: Accuracy = (TP + TN)/(TP + FP + TN + FN) [14] Precision and Recall sometimes contradict each other, so they need to be considered comprehensively. The most common method is F-Measure (also known as F-Score), the formula is as follows: F1 = 2TP/(2TP + FP + FN) [15] MCC is a balanced index, which is mainly used to solve the problem of binary classification. The value range of MCC is between -1 and 1. When the value is -1, it means that the predicted result is completely opposite to the actual result. When the value is 0, it means that the random predicted result is better than the predicted result. When the value is 1, it means that the predicted result is consistent with the actual result, the formula is as follows: The measurement of clustering performance is also called clustering validity index. For clustering results, we need to use some performance measure to evaluate their quality. On the other hand, if the final performance measurement is defined, it can be directly used as the optimization objective of the clustering process, so as to better obtain the required clustering results. In this paper, we use the FMI index as the evaluation criterion, the formula is as follows: The mean value is the sum of all the values and then divided by the number, the formula is as follows: Where n is the number of runs and y_i is the results of each run.
The standard deviation (SD) variance reflects the degree of data dispersion, the formula is as follows: [19]

Statistical Analysis
We ran ten operations and got ten sets of data in this paper, as shown in Table 4. We got the highest sensitivity value of 98.25 in the fourth group and the lowest sensitivity value of 92.98 in the tenth group. For the specificity, the highest specificity value of 97.52 in the fourth group to the seventh group and the lowest specificity value of 93.39 in the third and tenth groups. Among the ten groups of data, the highest precision was 97.39 in the fourth group, and the lowest was 92.98 in the tenth group. The highest accuracy was 97.87 in the fourth group, and the lowest was 93.19 in the tenth group. The maximum value of F1 was 97.82 in the fourth group and the minimum value was 92.98 in the tenth group. The maximum value of MCC was 95.75 in the fourth group and the minimum value was 86.37 in the tenth group. The maximum value of FMIS was 97.82 in the fourth group and the minimum value was 92.98 in the tenth group.
As shown in Figure 7, it can be concluded that the fourth group of data results were the best, and all the values were the highest in the ten groups of data. The result of the tenth group was the worst and all the values were the lowest among the ten groups of data.

Comparison to State-of-the-art
In order to verify the effectiveness of the proposed neural network structure, we used four state-of-the-art approaches to conduct comparative experiments, HMI (4), 3SEJ (5), CSO (6) and LRC (7). The comparison test results are shown in Table 5. For the sensitivity, our method got the highest sensitivity value (95.96). For the specificity, the highest value of specificity was 95.95 from our method. Compared with four state-of-the-art approaches, our method got the highest precision value (95.73). The accuracy of our method was the highest, which was 95.96. For the F1, the highest value was 95.84 from our method. The MCC value of our method (91.92) was much larger than four state-of-the-art approaches. Our method also got the largest value of FMI (95.84). As shown in Figure 8, our experimental method is better than four state-of-the-art approaches. All the classification index values we got are better than four state-of-the-art approaches.

Conclusion
With the continuous development of computing, the application of computer technology is more and more widely. In recent years, computer technology has been continuously applied in medicine and obtained a lot of innovation. In this paper, a method of self-diagnosis of  Compared with four state-of-the-art approaches, the results obtained by the proposed neural network structure are more accurate. Therefore, according to its excellent experimental results, the proposed neural network structure in this paper can be used as one of the methods for diagnosing alcoholism. Even though this paper obtained great data, there are still some shortcomings to be solved in the future. (1) there are only 216 images in this paper. For deep convolution learning, the number of training set and test set is too small. (2) We did not compare the performance of convolution layer and fully connected layer with different number of layers so we did not get the optimal number of convolution and fully connected layers.
In the future work, we will collect more data to do research. Second, we will do more experiments to get the optimal number of convolution and fully connected layers. Finally, we will test more new network technologies.

Conflict of interest
The authors declare that they have no conflicts of interest to disclose.