Cough Audio Recognition for Early Detection of Respiratory Diseases: Algorithm Development and Validation Study


Introduction

Coughing is one of the common clinical symptoms, often serving as a prominent symptom in various diseases such as chronic obstructive pulmonary disease (COPD), lung cancer, COVID-19, and pneumonia [,]. Accurately discerning the association between cough and these related diseases is of paramount importance for clinical diagnosis and treatment. However, conventional methods for cough diagnosis frequently rely on the expertise and observation of medical professionals, posing issues related to subjectivity and uncertainty [,].

With the continuous advancement of machine learning technology, the ability to use computer-driven analysis and determination of cough audio data has become feasible, offering a novel avenue for the diagnosis of cough-related diseases [-].

Most of the current cough detection and classification methods use standard speech recognition algorithms to extract feature parameters from cough audio []. The most common methods in this regard include short-time energy, amplitude, average zero-crossing rate in the time domain, and linear prediction cepstral coefficients (LPCCs) [], perceptual linear prediction (PLP) [], and Mel-frequency cepstral coefficients (MFCCs) in the frequency domain [-]. These traditional methods provide effective means for extracting features from cough sounds and have played a significant role in early cough classification research. With the advancement of machine learning technology, significant progress has been made in cough detection and classification. People can now use artificial neural networks to recognize and classify cough sounds, achieving a higher level of accuracy. Deep learning models, such as a convolutional neural network, residual neural network, long short-term memory network, and so on, have demonstrated excellent performance in cough sound classification tasks. Many studies provide novel and effective methods that demonstrate high performance and practical value in clinical practice.

The focus of Liyue et al’s [] research is on classifying pneumonia and asthma based on children’s cough sounds. Initially, cough sound segments from patients with asthma and pneumonia undergo preprocessing steps, such as pre-emphasizing and framing. Subsequently, a feature extraction process is used, which includes the extraction of 24-dimensional MFCCs and short-term energy mixed feature parameters. Classification is carried out using support vector machines [,], achieving a sensitivity of 96.3% and a specificity of 93.6%, demonstrating high feasibility and effectiveness. However, the limited sample size used in their experiments may restrict the generalizability of their model to broader datasets and the reliability of practical applications.

Huang et al [] conducted a critically important study in the field of cough detection and classification. Their research primarily focused on using spectrogram-based methods in conjunction with a parallel 1D deep convolutional neural network (CNN) to extract new features for the classification of dry coughs and wet coughs. They performed feature analysis using linear predictive coding coefficients, Mel spectrograms, MFCCs, and other methods. They extracted the first and second derivatives of the original spectrogram and created a single feature vector. This feature vector was combined with a parallel 1D deep convolutional network. The validation on the dataset indicated that the performance of the parallel flow network significantly surpassed that of the single flow network, achieving an F1-score of 98.61%. This result underscores the effectiveness of their approach. However, this method used various manually designed feature extraction techniques, which require manual selection and design, and may not fully capture all complex variations in cough sounds.

In their research on cough detection and classification systems, Bales et al [] primarily focused on using CNNs for feature extraction from cough sounds. They used a low-complexity network structure and had a limited dataset. Nevertheless, their developed cough detection and classification system successfully identified three respiratory diseases: bronchitis, bronchiolitis, and whooping cough. The success of this system serves as evidence of the significant potential of artificial neural networks in the task of cough sound detection and classification.

In their study on the task of patient with COVID-19 detection, Bansal et al [] proposed an audio classifier based on a CNN. They used an open dataset that had been manually labeled into COVID and non-COVID classes. Extracted MFCC features were used as input and fed into the CNN for training and classification. Through testing and validation of the data, the MFCC-based method achieved an accuracy of 70.6% and a sensitivity of 81%, demonstrating its effectiveness in identifying patients with COVID-19. However, in clinical practice, the complexity and computational costs of their method may limit its application in mobile health settings.

This study aimed to explore a machine learning–based method for cough audio classification and recognition, enabling more precise automated assessment of the relationship between coughing and associated diseases [,]. In the context of early disease diagnosis and treatment, this approach has the potential to provide medical professionals with a more accurate and efficient diagnostic tool, facilitating early disease detection and intervention, thereby significantly improving patient treatment outcomes and quality of life [].

The main contributions of this study are as follows:

Proposing the CAM-ResNet18 neural network model, which enhances the model’s ability to focus on key features by introducing the CAM mechanism in the last convolutional layer of each residual block.Converting cough audio into spectrograms, visualizing the spectral features of the audio signals to allow the deep learning model to understand the audio data more intuitively.Constructing a unique large-scale audio dataset by collecting cough audio data from professional hospitals, covering patients with COPD, lung cancer, pneumonia, COVID-19, and healthy participants, providing a solid foundation for model training and evaluation.Demonstrating that the proposed model outperforms traditional models in terms of accuracy and average F1-score on the collected dataset, validating the effectiveness of the model.

In this research, we selected the ResNet18 residual neural network as the foundational CNN [,]. This network has demonstrated outstanding performance in image classification tasks and has found widespread applications in the field of computer vision []. To enhance the model’s ability to focus on key features within cough audio, we introduced a channel attention mechanism (CAM). Attention mechanisms have been extensively studied in visual and speech processing tasks and have shown significant performance improvements. They enable the model to concentrate its attention on critical features, leading to more accurate classification [,]. To meet the input requirements of the machine learning model, we converted the collected cough audio recordings into spectrograms for representation. Spectrograms offer notable advantages in extracting spectral information from audio signals, aiding machine learning models in comprehending audio data more comprehensively []. These generated spectrograms served as input data for the model, leveraging the ResNet18 architecture and incorporating the attention mechanism to enhance the classification and recognition performance of cough audio []. This series of processing steps contributes to improving the model’s accuracy and robustness, enabling it to effectively handle the analysis and classification tasks associated with cough audio data. During the research experimental phase, we actively collected a large number of cough audio samples. These samples were obtained from various groups, including patients with different respiratory diseases, such as COPD, lung cancer, pneumonia, and COVID-19, as well as samples from healthy participants, totaling 5 categories. Subsequently, rigorous data preprocessing and feature extraction were performed to ensure data quality and usability. On the constructed dataset, a series of experiments was conducted to assess the effectiveness of the proposed method, with comprehensive comparisons made against traditional classification approaches. The experimental results unequivocally demonstrate a significant advantage of our model in analyzing the association between cough audio and related diseases. This pivotal discovery holds potential clinical application value, presenting a valuable breakthrough in the field of disease diagnosis [,].


MethodsEthical Considerations

The investigation received approval from the ethics committee of Zhejiang Provincial Centre for Disease Control and Prevention (2020‐024). Informed consent was obtained from all participants and their legal guardians. Research involving human participants was conducted in accordance with the Declaration of Helsinki. All cough audio data involved in this study have been anonymized and de-identified, with all personal identifiers such as name, gender, age, and medical record number removed. No information that can directly or indirectly identify participants is included, and the study strictly complies with relevant privacy protection regulations and data security requirements for medical research. The research protocol and data collection procedures are in accordance with ethical review requirements, and the privacy and data confidentiality rights of participants are fully protected. As this study is basic algorithmic research without clinical intervention, no financial or material compensation was provided to participants.

Cough Audio Sample Collection

The collected cough audio data primarily come from several top-tier hospitals in Zhejiang and Anhui provinces. With the explicit consent of the patients, professional physicians were responsible for collecting cough audio data from 5 different categories, including healthy participants, and participants with COPD, lung cancer, pneumonia, and COVID-19. The data collection process took place in a quiet environment, with physicians using standardized recording equipment directed at the source of the patient’s sound to ensure recording quality was not affected by environmental noise. Patients aged between 10 and 90 years were selected from different occupational groups, socioeconomic statuses, and genders. The selection criteria included patients with clear diagnoses and significant cough symptoms, whereas the exclusion criteria included patients unwilling to provide audio data or unable to produce clear cough recordings. The participant population included 724 women and 736 men. Each collected cough audio file was clipped into 3-second audio segments, and each file was labeled with corresponding information, such as the patient’s disease diagnosis, age, and gender. Ultimately, we constructed an audio dataset consisting of 2610 audio samples. To support effective model training and evaluation, this study used 5-fold cross-validation for all experiments. This method effectively avoids inconsistencies that may arise from data splitting biases, further enhancing the reliability of the experimental results. illustrates the distribution of these 5 types of cough audio in the dataset.

Table 1. The distribution of 5 types of cough audio in the training and testing datasets.CharacteristicsTraining, nCOPD (n=420)Lung cancer (n=435)Pneumonia (n=413)COVID-19 (n=398)Healthy (n=425)Gender      Male317236251252254 Female208309264244275Age (y)      10‐1400234319 15‐24224231831 25‐343841603683 35‐446566576541 45‐541001108183121 55‐641191249591102 65‐7410699867988 75‐9075101918341

aCOPD: chronic obstructive pulmonary disease.

Cough Features

The general time-domain waveform of cough audio is depicted in . When conducting a detailed analysis of cough audio, we typically perform analyses in both the time and frequency domains [,]. However, in time-domain analysis, it is impossible to know the distribution of sound at different frequencies, and in frequency-domain analysis, it is impossible to understand how the sound changes over time, and the time-domain and frequency-domain analyses are highly sensitive to noise and signal changes, which will have a great impact on the result analysis. To provide a more intuitive display of the acoustic characteristics of cough audio, we have developed a spectrogram-based representation that combines data points from the spectrogram and time-domain waveform. This approach is used to present rich information related to speech characteristics and facilitates a more comprehensive understanding of the acoustic features of cough audio. The spectrogram representation of cough audio is illustrated in .

Figure 1. The time-domain waveform of cough audio. Figure 2. The spectrogram of cough audio. Feature Extraction From Cough Audio

Before inputting raw cough audio into the neural network model for the classification of cough-related diseases, a crucial preprocessing step is necessary. This preprocessing procedure consists of 2 fundamental steps: data preprocessing and the generation of spectrograms for the cough audio.

Preprocessing

First, preprocessing operations are executed on the raw cough audio, including end point detection and normalization. All cough audio is converted into the WAV format. Background noise segments within the cough audio are manually eliminated through human intervention, and subsequently, all cough audio is segmented into multiple audio segments, each with an approximate duration of 3 seconds. As most of the collected cough audio samples are between 3 and 6 seconds long, selecting a length of 3 seconds helps maintain the relative stability of the audio signal within a short time frame. For audio segments longer than 3 seconds, they are clipped to 3 seconds. For segments shorter than 3 seconds, zero-padding is used to extend them to 3 seconds.

SpectrogramOverview

A spectrogram is a graphical representation used to depict the variation of cough spectra over time. Its horizontal axis represents time, whereas the vertical axis represents frequency. It combines the characteristics of a spectrogram and a time-domain waveform, enabling a clear depiction of how the cough spectrum changes over time. Through spectrograms, phonemic attributes of cough can be better observed, and the identification of cough sounds is improved by observing resonance peaks and transitions. The process of obtaining a spectrogram is illustrated in .

Figure 3. Process flowchart for spectrogram generation.

Each section’s specific workflow is as follows.

Section 1

Pre-emphasis in audio signals is performed to enhance high-frequency components in cough, reducing the influence of lip radiation and improving the high-frequency resolution of the cough signal. Typically, pre-emphasis is achieved using a first-order finite impulse response high-pass digital filter, with a pre-emphasis coefficient typically set to 0.97. Its transfer function is as follows:

H(z)=1−μz−1   H(z)=1−μz−1   0.9≤μ≤1(1)Section 2

Speech signals exhibit short-term stability. After pre-emphasis, these signals are segmented and windowed to divide the original signal into several small blocks, with each block referred to as a frame. Each frame has a length of 25 milliseconds with a frame shift of 10 milliseconds. This is because speech signals can be considered stationary over short periods, and a frame length of 25 milliseconds balances time-domain and frequency-domain characteristics effectively. Once frame segmentation is completed, a window function is applied to each frame to achieve better sidelobe attenuation. Subsequently, a discrete Fourier transform is applied to each frame. Then, each frame undergoes a 512-point fast Fourier transform. Choosing 512 points ensures spectral resolution while also balancing computational efficiency, transforming the time-domain signal into the frequency-domain signal to obtain the necessary spectral information. The discrete Fourier transform formula used is as follows:

X(k)=∑n=0N−1x(n)w(n)e−2πiNkn k=0,...,N−1(2)

where x(n) represents the speech signal, N is the window size, w(n) is the Hamming window function, and n denotes the time-domain sampling point.

Section 3

The process involves stacking the spectral graphs obtained after the discrete Fourier transform for each frame. Subsequently, the amplitudes are mapped to a grayscale representation, where darker colors correspond to higher amplitudes. Finally, by concatenating multiple frames of spectra, a spectrogram is generated.

CAM Component

CAM is an attention mechanism used in image processing. It enhances the expressive power of images by automatically learning the importance of each channel. This helps networks focus on crucial features, suppress unimportant ones, and improve image classification performance and robustness. illustrates the conceptual flowchart of the CAM [].

The input to CAM is a feature map with dimensions H×W×C, where H represents the height of the feature map, W denotes the width of the feature map, and C is the number of channels. Its fundamental concept is as follows.

Figure 4. The conceptual flowchart of the channel attention mechanism. MLP: multilayer perceptron.

First, the input feature map is subjected to both global max-pooling and global average-pooling operations along the spatial dimensions. These operations are performed to compress the spatial dimensions, making it easier to extract the most significant regions in the feature map. The input feature map is processed for global max-pooling and average-pooling based on the following formulas:

GlobalMaxPooling(x)=maxi=1hmaxj=1wxi,j(3)GlobalAvgPooling(x)=1h×w∑i=jh∑j=1w  xi,j(4)

where x is the input feature map, and h and w represent the height and width of the input feature map, respectively.

Second, the results obtained from global max-pooling and average-pooling are fed into a multilayer perceptron (MLP) for feature learning, which involves learning features in the channel dimension and the importance of individual channels. Subsequently, the output from the MLP undergoes addition, followed by applying the Sigmoid activation function to obtain attention weight values for each channel in the input feature map. The final channel attention weights are obtained based on the following formula:

Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPol(F)))    =σ(W1(W0(Favgc))+W1(W0(Fmaxc)))(5)

In this context, Mc(F) represents the channel attention weights, where F denotes the input feature map, and c signifies the number of channels. W0 and W1 represent the parameters learned in the fully connected layers. MLP consists of 2 fully connected layers [], and σ represents the Sigmoid activation function.

In the CAM module, the ‘shared MLP’ consists of 2 fully connected layers. The first layer has 256 neurons, the second layer has 128 neurons, and the final output is the size of the number of channels, which is used to generate the attention weights for each channel.

Finally, the obtained attention weights are multiplied by the initially input feature map to obtain the feature map weighted by channel attention. This is achieved using the following formula:

Z=Mc×[(F(x)+x)](6)

where Z represents the feature map after attention weighting, Mc represents the channel attention weights, x denotes the input, and F(x) signifies the output after passing through convolutional layers and activation functions.

The Network Architecture of the CAM-ResNet18 Model

In this study, the backbone network chosen for the model is the ResNet18 residual neural network. We enhance the neural network model, referred to as CAM-ResNet18, by introducing the CAM. The CAM is added to the last convolutional block in each residual block of the ResNet18. This choice is made because the last convolutional block is the final convolutional operation within each residual block, and as the structure of each residual block in ResNet18 is identical, incorporating the CAM here allows for targeted improvement in the feature learning capability of each residual block. Importantly, it does not affect the feature extraction in other residual blocks, thereby enhancing the model’s ability to distinguish between different types of cough sounds. depicts the neural network architecture of CAM-ResNet18.

Figure 5. The neural network architecture diagram of CAM-ResNet18. CAM: channel attention mechanism; FC: fully connected.

The CAM-ResNet18 model network structure consists of the following components:

Input layer: It receives transformed spectrograms as input, which capture the spectral information of cough sounds.Initial convolutional layer and max pooling layer: Initially, the input undergoes a convolutional layer with a kernel size of 7×7, a stride of 2, and padding of 3, resulting in an output of 64 channels. This is followed by a max-pooling operation with a 3×3 kernel size, a stride of 2, and padding of 1, aimed at reducing the feature map’s dimensions.Residual blocks: Each residual block consists of two 3×3 convolutional layers and a skip connection, designed to address the issues of gradient vanishing and exploding gradients in deep CNNs.CAM is inserted after the last convolutional layer in each residual block, enhancing the learning capabilities of each residual block selectively and improving the model’s focus on critical features. This is accomplished through the following steps:Global max-pooling and average-pooling operations are applied to the feature maps from the final convolutional block. These operations compress spatial dimensions and extract the most salient regions in the feature maps.Fully connected layer: the results obtained after global max-pooling and global average-pooling are fed into an MLP consisting of 2 fully connected layers for feature learning. Subsequently, the Sigmoid activation function is applied to obtain weight values for each channel.Channel weighting: the generated channel weights are applied to the original feature map by multiplying each channel’s features by the corresponding weight, thus achieving channel weighting.Global average pooling and fully connected layer: feature maps undergo average pooling to reduce dimensionality and map the features from the final layer to output categories.Output layer: feature maps undergo average pooling to reduce dimensionality and map the features from the final layer to 5 output categories, corresponding to the 5 classes.Model Training Process

Our model training was conducted with the following environment configuration: Intel Core i5-13600K CPU, 32GB of RAM, CUDA 11.3, the deep learning framework PyTorch 1.12.1, and dated March 3, 2022. We used spectrograms of cough audio from the training dataset as input, which were then fed into the CAM-ResNet18 neural network model for training. During the training process, the hyperparameter settings are presented in .

Table 2. Hyperparameter settings.HyperparameterValueLoss functionCross-entropy loss functionOptimizerAdamLearning rate0.001Training rounds50Batch size8Weight initializationHe initializationLearning rate schedulerStepLRStep size7Attenuation coefficient0.1

The cross-entropy loss function is used to quantify the disparity between the model’s predicted probability distribution and the actual probability distribution [,]. In this experiment, we focus on 5 categories, namely, healthy participants and patients with COPD, lung cancer, pneumonia, and COVID-19. We use one-hot encoding for mathematical representation, where each category is represented as a 5D vector with only 1 dimension set to 1, indicating the presence of that category, while all other dimensions are set to 0.

In the experiment, assuming we input a spectrogram of a lung cancer cough audio, and its true label is represented by a one-hot encoding vector with 5 dimensions [0, 1, 0, 0, 0]. This signifies that only the dimension corresponding to lung cancer has a value of 1, whereas the values on the dimensions corresponding to the other 4 categories are all zero. The model’s output is also a vector with the same dimensions as the true label, where each dimension’s value represents the model’s predicted probability for healthy participants and patients with COPD, lung cancer, pneumonia, and COVID-19. These probability values sum to 1.

To calculate the loss value, we compare the model’s predicted probabilities with the true labels using the following cross-entropy formula:

H(p,q)=−∑i=1np(xi)log(q(xi))(7)

wherein, H represents the loss value, p(x) represents the true label value, and q(x) represents the predicted probability.

During the training process, under the constraint of the cross-entropy loss function, the neural network continuously updates its parameters to increase the probability of the model predicting correctly. For instance, if the input spectrogram corresponds to a type of lung cancer, its true label is [0, 1, 0, 0, 0], and the model’s prediction result is, with the cross-entropy loss function yielding a result of. To make the model’s prediction as close as possible to the true result, it is only necessary to minimize the value of the cross-entropy loss function. Therefore, during training, the model will continuously update its parameters to make the output prediction values on the second dimension, corresponding to the true pathological lung cancer, close to 1, whereas the other 4 dimensions approach 0. This enhances the model’s recognition accuracy, achieving the goal of using deep learning algorithms for cough sound recognition.

After the model training is complete, we save the best training weights as the pretrained neural network model. Then, we input the spectrograms from the test set into the pretrained neural network model. In the final layer of the model, the fully connected layer outputs a feature vector, which is then mapped to a set of probability values between 0 and 1 using the Softmax function. These probability values sum up to 1, representing a probability distribution for each possible class. Finally, we select the class with the highest probability as the ultimate prediction.


ResultsEvaluation Metrics

In this experiment, we used a set of evaluation metrics to assess the performance of the model in the task of cough-related disease classification. These metrics include accuracy, training loss, recall, precision, SD, and F1-score. We verify the performance of the proposed model by demonstrating a 95% CI for the proposed method [].

In these evaluations, we used the following terms: true negative (TN), false negative (FN), false positive (FP), and true positive (TP), which represent the following meanings.

For instance, in the case of the COPD category:

TP: the number of instances correctly classified as COPD, meaning the prediction is COPD, and the ground truth is also COPD.FP: the number of instances incorrectly classified as COPD, meaning the prediction is COPD, but the ground truth is non-COPD.TN: the number of instances correctly classified as non-COPD, meaning the prediction is non-COPD, and the ground truth is also non-COPD.FN: the number of instances incorrectly classified as non-COPD, meaning the prediction is non-COPD, but the ground truth is COPD.Accuracy: the proportion of correctly classified samples among all cough samples.Precision: the proportion of true positive predictions for the COPD category among all predictions made for the COPD category.Recall: the proportion of true positive predictions for the COPD category among all actual COPD samples.The SD can be used to measure the range of variation in prediction results. A high SD may indicate that the model is highly sensitive to changes in the input data, leading to greater uncertainty in predictions.The CI typically refers to estimating the uncertainty range of a model’s performance metric by using the evaluation results from each fold in the cross-validation process.F1-score: the harmonic mean of precision and recall, resulting in the F1-score.

Below are the formulas for calculating these evaluation metrics:

Accuracy=TP+TNTP+TN+FP+FN(8)Recall=TPTP+FN(9)Precision=TPTP+FP(10)F1-score=2×(precision×recallprecision+recall)(11)Model Comparison

To further evaluate the effectiveness of the model, a performance comparison between the proposed CAM-ResNet18 model and the ResNet18 model was conducted on the test dataset. The objective was to observe their precision, recall, F1-score, SD, positive predictive value, and accuracy. Cross-validation was used to calculate the precision, recall, F1-score, and their 95% CIs for these 2 models. Specifically, the dataset was divided into 5 folds, and each fold was used for training and validation. The performance metrics for each fold were calculated, and by aggregating the results of all folds, the precision, recall, F1-score, and their 95% CIs were obtained.

Additionally, we compared the performance of different acoustic features, including spectrogram, MFCC, LPCC, and PLP, when used as input to the CAM-ResNet18 model. This comparison was conducted to identify which feature set yielded the best classification results. We then compared the performance of the CAM-ResNet18 model with common traditional models, such as VGG16, LeNet, and ResNet34. Furthermore, we evaluated the robustness of the model under noise interference by introducing different levels of noise during testing. The results demonstrated how the model’s performance in terms of precision, recall, and F1-score changed under various noise conditions.

Finally, we displayed the confusion matrix of the classification results using the CAM-ResNet18 model. These performance metrics are presented in -, as well as -.

Table 3. Comparison of different acoustic characteristics of the CAM-ResNet18 model.Acoustic featurePrecision (%)Recall (%)F1-score (%)Spectrogram84.987.482.52MFCC82.185.080.5LPCC80.784.379.2PLP81.383.879.8

aMFCC: Mel-frequency cepstral coefficient.

bLPCC: LPC cepstral coefficient.

cPLP: perceptual linear prediction.

Table 4. Classification performance of the ResNet18 network model.Performance metric and categoryResults (%)PPVPrecisionCOPD80.9573.77‐88.13Lung cancer71.4263.36‐79.48Pneumonia7567.37‐82.63COVID-1978.9471.69‐86.19Healthy82.3575.70‐88.99RecallCOPD8578.64‐91.36Lung cancer83.3376.60‐90.06Pneumonia70.5862.68‐78.48COVID-1978.9471.69‐86.19Healthy77.7770.37‐85.17F1-scoreCOPD82.9175.97‐89.85Lung cancer76.9175.95‐89.85Pneumonia72.7269.01‐84.81COVID-1978.9371.68‐86.18Healthy79.9972.65‐87.33Standard deviationCOPD1.846 N/ALung cancer2.511 N/APneumonia2.308 N/ACOVID-191.829 N/AHealthy2.03 N/A

aPPV: positive predictive value

bCOPD: chronic obstructive pulmonary disease.

cN/A: not applicable.

Table 5. Classification performance of the CAM-ResNet18 network Model.Performance metric and categoryResults (%)PPVPrecisionCOPD90.0078.28‐93.14Lung cancer80.8769.86‐87.28Pneumonia82.2273.49‐89.01COVID-1986.1476.69‐91.73Healthy85.7181.86‐94.60RecallCOPD94.2990.52‐98.93Lung cancer84.5578.58‐90.64Pneumonia92.3689.59‐94.85COVID-1987.0082.22‐95.54Healthy79.2571.62‐86.25F1-scoreCOPD89.9984.73‐95.25Lung cancer74.5974.51‐88.43Pneumonia76.7468.78‐84.14COVID-1986.5480.72‐92.24Healthy82.3876.71‐89.93Standard deviationCOPD1.633N/ALung cancer2.233N/APneumonia2.066N/ACOVID-191.6N/AHealthy1.76N/A

aPPV: positive predictive value.

bCOPD: chronic obstructive pulmonary disease.

cN/A: not applicable:

Table 6. Comparison of performance across different models.ModelAccuracySpecificitySensitivityVGG1678.682.380.5LeNet79.682.979.1ResNet3482.984.381.4CAM-ResNet1883.986.487.4Table 7. Classification overall accuracy and average F1-score of different models.Network structureOverall accuracyAverage F1-scoreResnet-1878.1678.29CAM-Resnet1883.9082.52Table 8. Performance table of CAM-ResNet18 model under different noises.SNRPrecision (%)Recall (%)F1-score (%)Noiseless84.987.482.5230 dB83.286.182.120 dB81.584.681.010 dB80.683.579.7

aSNR: signal-to-noise ratio.

Figure 6. CAM-ResNet18 confusion matrix. COPD: chronic obstructive pulmonary disease; LC: lung cancer. Figure 7. Test the accuracy of CAM-ResNet18 and ResNet models. Figure 8. Training loss of CAM-ResNet18.

Comments (0)

No login
gif