The Development and External Testing of Deep Learning Model for Automated Classification of Intervertebral Disc Degeneration on CT Images

This study was approved by the Institutional Review Board (No. K2023-053–01) and compliant with the Health Insurance Portability and Accountability Act. A waiver of consent was granted because of the retrospective nature of the study. The design and reporting of the study followed the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) [22].

Patient Selection

The study retrospectively screened patients from Center A between January 2018 and July 2024 as internal dataset, and from Center B between January 2018 and July 2024 as external testing dataset. The inclusion criteria included: (1) patients aged ≥ 18 years; (2) diagnosed with intervertebral disc herniation or bulge on CT. The exclusion criteria included: (1) patients with internal fixation or prior intervertebral disc surgery, (2) severe scoliosis and spondylolisthesis, (3) poor image quality (severe artifacts or incomplete spinal coverage), (4) non-degenerative diseases (e.g., tumors, trauma, infections, congenital malformations).

Image Acquisition and Reconstruction

The patients underwent CT scanning using various scanners in center A (LightSpeed VCT, GE Healthcare; Brilliance iCT, Philips; uCT760, United Imaging; Aquilion ONE, Canon) and center B (IQon Spectral CT, Philips). The scans were performed using standard protocols for the cervical and lumbar spine with tube voltage of 120‒140kVp, tube current of 20‒500 mAs, conventional slice thickness of 3‒5 mm, thin slice thickness of 0.5‒1.5 mm, and slice spacing of 1.0 mm. The field of view was set at 15‒25 cm, with a matrix resolution of 512 × 512. The coronal and sagittal images were reconstructed with a slice thickness of 3.0 mm. Detailed parameters were provided in Table S1.

Reference StandardAnnotation of Intervertebral Disc

The region of interest for each intervertebral disc was delineated at the pixel level using thin-slice 1 mm multiplanar reconstruction imaging. Relevant information, including segment, type, zone, and grade of IDD, was systematically annotated. The annotation process included three stages: (1) Initial manual annotation; (2) Optimized annotation: based on established criteria for classifying IDD, initial segmentation was automatically performed utilizing pre-trained models, followed by manual refinement; (3) Quality control: A radiologist with 3 years of musculoskeletal imaging experience performed independent annotations, and two radiologists with 5 years of experience reviewed randomly assigned cases for blinded assessment and corrections. When discrepancies were found in the annotation results, the three doctors reached a consensus after thorough discussion. These manual annotations were regarded as the ground truth.

Vertebrae Key Points

Intervertebral disc degeneration type, zone, and grading depended on accurate localization and identification of key anatomical landmarks. The key anatomical landmarks were selected based on vertebral structure, the classification system for intervertebral disc pathology established by the North American Spine Society (NASS), and the spinal stenosis grade method [23,24,25,26]. Figure 1 illustrated the location and description of all identified landmarks.

Fig. 1Fig. 1

Key anatomical points of the vertebra. (a) Top view, (b) Rotated side view, (c) Side view. Landmark descriptions: 1 superior margin of the spinous process, 2 inferior margin of the spinous process, 3 midpoint of the posterior vertebral foramen wall, 4‒6 superior anterior, left, and right endpoints, 7‒9 inferior anterior, left, and right endpoints (partially obscured in the image), 10‒11 midpoints of the superior margins of the left and right pedicles, 12‒13 lateral and medial endpoints of the superior margin of the left pedicle, 14‒15 lateral and medial endpoints of the superior margin of the right pedicle, 16‒18 trisection points along the vertical midline of the posterior vertebral body edge, 19‒20 outermost endpoints of the left and right transverse processes, 21‒22 endpoints of the medial edge of the inferior articular process

(Mark landmarks based on the original image from Jānis Šavlovskis & Kristaps Raits's work “Anatomy Standard—Typical Lumbar Vertebra” available via https://www.anatomystandard.com/ossa-et-juncturae/columna-vertebralis/vertebra-lumbalis-typica.html. accessed on 4 Dec 2024. The original material is licensed under a Creative Commons Attribution 4.0 International License. For more details, see https://creativecommons.org/licenses/by-nc/4.0/.)

Classification of Intervertebral Disc Degeneration

Herniation and bulging were the primary forms of IDD, suggesting that the disc had undergone structural changes exceeding its normal boundaries. According to the NASS classification system [23], a normal intervertebral disc was characterized as a complete oval structure with a circumference of 360°. If the degeneration affected more than 25% of the disc’s circumference, corresponding to an angular extent greater than 90°, it was classified as a bulge. Otherwise, it was a herniation. By utilizing the key points data on the lower edge of the vertebral body adjacent to the degenerative disc lesion, the range and angle of the degenerative changes were calculated to determine the type of disc degeneration.

According to NASS and Michigan State University (MSU) classification system [23, 24], intervertebral disc herniation zones were classified into seven distinct regions relative to the posterior vertebral body pedicle position. Central canal zone (CCZ): the area between the left and right central quadrants, formed by dividing the region medial to the bilateral pedicle margins into four equal parts. Left and right subarticular zone (SAZ): the left and right lateral quadrants within this same quadrisected area. Left and right foraminal zone (FZ): the area between the medial and lateral endpoints of the pedicle. Left and right extraforaminal zone (EFZ): the area beyond the lateral endpoints of the pedicle.

Grade was utilized to quantify the degree of intervertebral disc herniation into the spinal canal. The lumbar spine was classified using the MSU grading system, while the cervical spine was assessed based on the anteroposterior diameter of the spinal canal [24,25,26]. Grade 0 indicated a normal condition. As spinal stenosis severity increased, it was categorized into grade 1, 2, and 3. Figure 2 showed the calculation procedure for the type, zone and grading of IDD.

Fig. 2Fig. 2

Calculation procedure for intervertebral disc degeneration type, zone, and grade. (a, e) Cervical and lumbar herniation. (b, f) Cervical and lumbar bulge. Vertebrae key points 7‒9 defined center point c. The herniated disc endpoints were connected through c to calculate angle θ. If θ > 90°, the disc was classified as bulge type. (c, g) Cervical and lumbar disc degeneration zones. Points 12‒15 defined herniation zones relative to the posterior vertebral pedicle. (d) For cervical disc herniation, distance from point 3 to point 18 was divided into three equal parts, defining grades 1‒3. (h) For thoracic and lumbar herniation, the midpoint between points 21 and 22 defined a horizontal line parallel to the coronal plane, serving as the level 2‒3 boundary. The vertical distance from point 18 to this line was bisected to define grades 1 and 2

DL Model Development

As shown in Fig. 3, image data was normalized and the herniated or bulged intervertebral disc region identified using a segmentation model (see Fig. 4 for the network architecture). Combined with key points from the adjacent vertebrae, clinical diagnostic indicators for IDD, including type, zone, and grade, were calculated to provide precise and comprehensive information.

Fig. 3Fig. 3

Roadmap for intervertebral disc degeneration segmentation and classification

Fig. 4Fig. 4

3D U-Net architecture for intervertebral disc degeneration segmentation

(Adapted Image in “Vertebrae landmarks” marked based on the original image from Jānis Šavlovskis & Kristaps Raits's work “Anatomy Standard—Typical Lumbar Vertebra”).

Vertebrae Key Points Detection Model

A case of whole spine CT imaging was selected, covering all cervical, thoracic, lumbar, and sacrococcygeal vertebrae. 22 key points of each vertebra were manually annotated by a radiologist observing through 3D multi-angle visualization. Leveraging the key point data annotated by the radiologist, an EfficientNet-B4 network architecture [27] was employed to construct a regression model for predicting the coordinate values of the 22 key points for each complete vertebra. All key points utilized in this study were derived from the vertebral key points detection model. Clinical indicators including intervertebral disc type, zone, and grade were subsequently quantified.

Intervertebral Disc Degeneration Segmentation Model

All CT data were preprocessed with intensity normalization, region cropping, and resampling. Segmentation performance was evaluated using the Dice score. The 3D U-Net model [28], built on the CNN architecture, employed deep learning techniques to automatically extract features and integrated skip connections to combine high-level semantic information with low-level spatial details. It comprised two main components: an Encoder that progressively reduced input dimensionality while increasing feature channels for abstract representations, and a Decoder that restored spatial resolution via upsampling, generating a segmentation map matching the input image dimensions. Skip connections enabled direct transfer of low-level features from the Encoder to the Decoder, preserving local structural information and accurately delineating boundaries of organs or lesions in medical images [7, 29,30,31]. To address small dataset challenges, the model incorporated Batch Normalization layers to accelerate training and prevent overfitting [32, 33]. The input to the model was a 1 × 512 × 512 × 32 voxel patch, and the output was a probability mask of the same dimensions. The model comprised 16,467,297 parameters.

In addition, two models, nnU-Net [34] and SwinUNETR [35], were trained and compared with 3D U-Net for segmentation performance. nnU-Net used the same architecture but with optimized training parameters. Five models were trained using fivefold cross-validation, and their predictions were combined via ensemble inference for final intervertebral disc segmentation. SwinUNETR followed a U-Net encoder-decoder structure. Its encoder used the Swin Transformer [36] to divide the input into non-overlapping 3D patches and alternated between regular and shifted windows across layers, enabling efficient feature exchange and global modeling. Patch Merging reduced spatial resolution and increased channels at each stage, extracting multi-scale features from local to global levels. The decoder mirrored the encoder and used a CNN-based upsampling module to restore feature maps.

Model PerformanceTraining and Validation

The dataset was split into 140 training and 36 test cases at a 4:1 ratio. fivefold cross-validation was used to find the best hyperparameters, including model input size, known as the patch size, learning rate, and data augmentation techniques. The dataset was divided into 5 subsets of 28 cases each. In each iteration, one subset was the validation set and the other four for training. After training, model performance was evaluated on the validation set. After five iterations, the average performance metrics determined the best hyperparameter configuration. This optimal configuration was applied with holdout cross-validation, splitting 140 training cases into 123 training and 17 validation cases at a 7:1 ratio.

During the training process, the original image data were normalized to a range of 0 to 1 [37]. Subsequently, multiple data augmentation techniques were employed to enhance the dataset. Specifically, 1:1 sampling was utilized to balance the ratio of negative and positive samples [38, 39], effectively mitigating challenges such as reduced segmentation performance, training instability, and convergence difficulties caused by foreground–background class imbalance. Additionally, random scaling, rotation, and flipping were applied, followed by cropping fixed-size patches centered at arbitrary points [37]. Gaussian smoothing [40], gamma-based contrast adjustment and gray-level shifting were performed to adjust the gray values [37], expanding the dataset size and increasing the diversity of training samples. Furthermore, Gaussian noise [37] was introduced to improve the model's robustness against interference. The AdamW optimizer was used for training, with a learning rate of 0.001 and weight decay of 0.00001 [41]. The DiceCE loss served as the loss function [42, 43]. A polynomial decay schedule with warmup adjusted the learning rate [44,45,46]. The Dice score was the evaluation metric for the validation set [47, 48]. The batch size was set to 1 during training.

Training iterations were set to 1200, with validation performed every three iterations. A sliding window of size 512 × 512 × 32 with 20% overlap traversed the CT dataset to extract voxel patches. A binary mask was generated where values exceeded 0.5 probability; otherwise, values were set to 0. The Dice score was computed based on the predicted binary mask and the ground truth provided by radiologists. If the average Dice score of the validation set surpassed previous highs, the current model parameters were saved as the best.

External Testing

Utilizing the optimal model derived from the training process, the Dice coefficient was computed on an internal test set comprising 36 cases, with subsequent analysis of its distribution and mean value. Furthermore, to ensure robust validation, an independent external evaluation was performed using a dataset of 97 cases obtained from a separate medical institution.

Statistical Analyses

Pandas was used for descriptive statistics, Scikit-learn for model evaluation, and Matplotlib and Seaborn for visualization. Scikit-learn computed classification metrics. The Dice coefficient (range: 0 to 1, 1 indicating perfect overlap) was calculated per patient and examination for intervertebral disc segmentation. Classification results (type, zone, grade) were determined using predefined anatomical landmarks. To diagnose disc herniation or bulge, lesion analysis was compared to the reference standard, and binary classification metrics were computed. Multilabel outputs were generated for disc zones and three-class outputs for grades. Accuracy, recall, and precision were calculated. Scipy was used for statistical analysis. The Student’s t-test evaluated differences in Dice scores between internal and external test sets. A paired t-test compared segmentation performance across models. For classification, the Wilson score interval estimated confidence intervals, and the Z-test for independent proportions assessed performance differences. A p-value less than 0.05 was considered a significant difference.

Comments (0)

No login
gif