Machine Learning–Based Multidimensional Oximetry for Obstructive Sleep Apnea Screening: Development and External Validation

Introduction

Obstructive sleep apnea (OSA) affects nearly one billion individuals globally [], and untreated OSA significantly increases the comorbidity burden and the risk of motor vehicle crashes []. Although polysomnography (PSG) remains the diagnostic gold standard, its high cost and operational complexity limit widespread accessibility [,]. Current screening tools, such as the STOP-BANG questionnaire or single physiological parameters, demonstrate limited diagnostic accuracy, with reported area under the receiver operating characteristic curve (AUC) values ranging from 0.55 to 0.83 [,]. Therefore, developing robust OSA screening tools using readily available physiological parameters remains imperative.

The pathophysiology of OSA is characterized by recurrent upper airway collapse, resulting in intermittent nocturnal hypoxia. Pulse oximetry-derived metrics, including the oxygen desaturation index (ODI), percentage of sleep time with SpO2 < 90% (ST90), and minimum oxygen saturation (MinSpO2), offer accessible alternatives to PSG [], yet they reflect only a single dimension of nocturnal desaturation, thereby limiting their clinical utility []. ODI quantifies the frequency of desaturation events and correlates with PSG-derived apnea-hypopnea index (AHI), but does not capture hypoxic duration or desaturation depth []. ST90 reflects cumulative hypoxic burden (HB) but cannot distinguish between distinct hypoxic patterns, such as single prolonged versus multiple brief desaturations []. MinSpO2 identifies the instantaneous nadir but does not characterize cumulative hypoxic exposure []. The novel integrated HB metric, which combines desaturation depth, duration, and frequency, demonstrates superior predictive performance for OSA-related comorbidities compared with AHI and ODI [], though direct comparisons with conventional metrics within the same datasets remain scarce []. Entropy and frequency-domain analyses of SpO2 complexity can capture dynamic nocturnal fluctuations overlooked by traditional metrics [,]. However, most existing studies evaluate parameters in isolation or focus on linear associations, leaving multidimensional feature integration and nonlinear relationships among parameters largely unexplored [,,]. Thereby, multi-parameter models leveraging complementary oximetric indices may yield improved robustness for OSA screening [,,].

Machine learning (ML) holds great potential for OSA diagnosis. Although deep learning models (eg, OxiNet) enable high-precision AHI estimation [], their “black box” nature compromises clinical interpretability and raises clinical skepticism, thus limiting real-world utility [,]. Traditional ML algorithms show inconsistent performance across cohorts, with support vector machines (SVM) and random forests (RF) demonstrating variable performance [,]. Extreme gradient boosting (XGBoost) models for moderate-to-severe OSA exhibit limited accuracy (sensitivity 72.5%, specificity 62.8%) []. Least squares boosting for AHI estimation illustrates the benefits of ensemble approaches but does not resolve generalization issues between community and clinical cohorts []. Recent evidence suggests that categorical boosting (CatBoost) is superior for OSA classification, outperforming XGBoost, light gradient boosting machine (LightGBM), and RF in several studies [,], yet its application to oximetry-based OSA screening has not been evaluated.

Our study has three main aims: (1) to develop a parsimonious and robust OSA screening tool by evaluating multidimensional oximetric parameters using ML; (2) to validate model generalizability across community and clinical populations in an independent external cohort undergoing home sleep apnea test (HSAT); and (3) to assess performance heterogeneity across sex and age subgroups to inform personalized screening strategies.

MethodsStudy Design and Population

We consecutively enrolled adults with suspected OSA who underwent in-laboratory PSG at the Sleep Center of Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, between June 2022 and July 2024. During the same period, we additionally recruited community-based participants who underwent HSAT. Inclusion criteria were age ≥18 years, prominent snoring, and provision of informed consent. Exclusion criteria included: (1) chronic diseases that may contribute to hypoxemia, such as heart failure, chronic obstructive pulmonary disease, chronic kidney disease; (2) chronic use of medications affecting sleep, including sedative-hypnotics, anxiolytics, antidepressants, and antipsychotics; (3) other concurrent sleep disorders, such as upper airway resistance syndrome, restless legs syndrome, or hypersomnia; (4) prior treatment for OSA; and (5) incomplete data. The participant flowchart is shown in .

‎

Figure 1. Flowchart of the overall study. CatBoost: categorical boosting; LightGBM: light gradient boosting machine; LR: logistic regression; OSA: obstructive sleep apnea; PSG: polysomnography; RF: random forest; SMOTE: synthetic minority over-sampling technique; SVM: support vector machine; XGBoost: extreme gradient boosting. Ethical Considerations

The study protocol complied with the Declaration of Helsinki and was approved by the Ruijin Hospital Ethics Committee (approval number: 2018-107). Due to its retrospective design, all data were fully deidentified, and the ethics committee waived the requirement for informed consent. All data were securely stored in accordance with institutional research data management standards, and no compensation was provided to participants.

PSG and HSAT

Participants abstained from sedatives, alcohol, and caffeinated beverages for at least 24 hours prior to the study. In-laboratory PSG was performed using the Alice 6 system (Philips Respironics, Murrysville, PA, USA) with standard monitoring including electroencephalography, submental electromyography, bilateral electrooculography, electrocardiography, pulse oximetry, oronasal airflow via thermistor and pressure transducer, thoracoabdominal effort, snoring, and body position. HSAT used the Alice NightOne device (Philips Respironics, Murrysville, PA, USA) to record nasal airflow, respiratory effort, and fingertip SpO2. Recordings with more than 4 hours of analyzable data following manual review were considered valid. Two certified sleep specialists independently scored PSG and HSAT data according to the AASM scoring manual []: apnea was defined as ≥ 90% airflow reduction for ≥ 10 seconds, and hypopnea as ≥ 30% airflow reduction for ≥ 10 seconds accompanied by ≥ 4% SpO2 desaturation. AHI was calculated as the total number of apneas and hypopneas per hour of sleep, and OSA was defined as AHI ≥ 5 events/hour.

Definition and Calculation of Pulse Oximetry ParametersSummary of Signal Processing

During PSG and HSAT, SpO2 signals were collected at a sampling rate of 500 Hz and down-sampled to 1 Hz for computational efficiency. Eight parameters were extracted to quantify different aspects of nocturnal hypoxemia:

Mean SpO2 (MeanSpO2) and Minimum SpO2 (MinSpO2)

The average and the lowest SpO2 values during sleep:

These reflect overall oxygenation status and the most severe desaturation [].

Oxygen Desaturation Index (ODI)

Number of desaturation events (≥ 4% drop from baseline) per hour of sleep:

Where TST is the total sleep time (hours). ODI is a key indicator of the frequency of respiratory disturbances and serves as a surrogate for the AHI [].

T90 and ST90

Total time and percentage of sleep spent with SpO2 below 90%:

These quantify cumulative exposure to clinically significant hypoxemia [].

Hypoxic Burden (HB)

The normalized total area under the SpO2 desaturation curve associated with respiratory events.

where AUCi is the area of the i-th desaturation event identified by the “Trapping Rain Water” algorithm (), and TRT is the total recording time. HB integrates frequency, depth, and duration of desaturations, representing total oxygen debt [].

‎

Figure 2. Hypoxia burden calculation using the rainwater collection algorithm applied to pulse oximetry signals. Attention Entropy (AttnEn)

AttnEn is a complexity measure of the SpO2 signal waveform variability [].

Where Pi is the distribution of intervals between adjacent local extrema. Higher entropy reflects fragmented, unstable desaturation patterns typical of severe OSA.

Total Spectral Power (TotalPower)

Integrated Lomb–Scargle periodogram power within the ultradian band (0.014-0.035 Hz), corresponding to respiratory cycles of 30-70 seconds.

Elevated power in this band indicates the repetitive oscillatory desaturation dynamics characteristic of OSA [,].

The algorithm proceeds as follows: (1) Event Identification: The pulse oximetry (SpO2) signal is analyzed to detect all local minima (valleys), thereby identifying the nadir (lowest saturation) of each desaturation event. (2) Window Initialization: From each nadir, a bidirectional search is performed to delineate the event window (Winstart and Winfinish). Boundaries are established at the nearest peaks that recover to ≥75% of the preceding peak-to-nadir amplitude. (3) Boundary Refinement: The search window is further adjusted based on the mean event duration to ensure temporal consistency. (4) Baseline Determination: The baseline for each event is defined as the maximum SpO2 value within the 100-second window preceding the event onset. (5) Area Integration: The under the curve (AUC for each event is computed by integrating the deficit between the baseline and the SpO2 signal within the defined window. (6) Hypoxia burden (HB) Calculation: All individual AUCs are summed to obtain the total desaturation area, which is then divided by the total recording time to derive the HB.

Establishment and Validation of ML ModelsData Preprocessing

To mitigate bias from varying feature magnitudes, the data were first standardized via Z‑score normalization, transforming each feature to a mean of 0 and SD of 1 using:

where μ and σ represent the mean and SD of the feature. This step ensures stable distance‑based computations and gradient optimization. Subsequently, class imbalance was addressed using the synthetic minority over‑sampling technique (SMOTE) [,]. SMOTE synthesizes minority‑class samples by interpolating between an instance xi and a randomly chosen neighbor x̂i from its k-nearest neighbors.

Algorithm Introduction

This study evaluates multiple ML models, grouped into three categories: (1) linear and kernel‑based models, (2) ensemble learning methods, and (3) gradient boosting decision trees, to balance interpretability with predictive performance. For linear and kernel-based models, logistic regression (LR) is a foundational model for clinical binary classification. It extends linear regression by applying the Sigmoid function to map linear outputs to a probability range between 0 and 1:

Where P is the predicted probability, β0 is the bias, βi are coefficients, and xi represent input features. Its transparency and low computational cost make it a standard benchmark in medical research [,].

SVM constructs an optimal separating hyperplane by maximizing the margin between classes. Its decision function is:

w·x + b = 0

Where w is the normal vector, x is the input feature, and b is the bias. The model is trained by minimizing subject to the constraint yi (w·xi + b) ≥ 1, ensuring correct classification with a margin of at least one. SVMs excel in high-dimensional spaces and can capture nonlinear patterns through kernel functions, making them a widely adopted method [].

For ensemble learning methods, RF is a bagging ensemble method that reduces overfitting by aggregating predictions from multiple decision trees. Each tree is trained on a bootstrap sample of the data and a random subset of features. The final prediction is obtained through majority voting:

where ŷ is the final predicted result, ht(x) denotes the prediction of the t-th tree, and T is the total number of trees. By averaging across trees, RF improves stability and accuracy, making it effective for high-dimensional data and widely used in practice [].

For gradient boosting decision trees, this kind of method iteratively combines weak learners, typically decision trees, to minimize a regularized objective function:

Where l(yi, ŷi) represents the loss function, Ω(fj) controls model complexity; and θ denotes the parameters.

Three prominent variants, including XGBoost, LightGBM, and CatBoost, share this framework but differ in optimization and implementation: XGBoost uses second-order gradient approximation and explicit regularization, offering high precision and efficiency, especially with structured or sparse data [,]. LightGBM uses a leaf-wise growth strategy with gradient-based sampling and feature bundling, enabling faster training on large-scale datasets []. CatBoost is optimized for categorical features, using ordered target statistics and symmetric trees to prevent prediction shift and effectively handle high-dimensional categorical variables [].

Modeling Process

The modeling pipeline followed a 2‑stage design: internal development with cross‑validation followed by independent external validation (). In the internal phase, a cohort of 2195 subjects was preprocessed and evaluated using 5-fold cross-validation. To prevent data leakage, SMOTE was applied exclusively to the training folds, with validation sets retaining the original class distribution. Six ML algorithms were trained under fixed random seeds to ensure reproducibility, and hyperparameters are detailed in . Model selection was based on the average performance across validation folds. The best-performing model was subsequently retrained on the full internal dataset (n=2195) without SMOTE to preserve the original data distribution. The selected model’s generalization ability was then assessed on an independent external cohort (n=446). Performance on this external set reflects the model’s robustness for real-world OSA screening.

Table 1. Hyperparameters of the 6 machine learning models for obstructive sleep apnea screening.ModelHyperparametersSVMa‘C’: 1.0, ‘gamma’: ‘scale’, ‘kernel’: ‘rbf’RFb‘criterion’: ‘gini’, ‘max_features’: ‘sqrt’, ‘n_estimators’: 100LRc‘C’: 1.0, ‘penalty’: ‘l2’, ‘tol’ : 1e–4XGBoostd‘learning_rate’: 0.3, ‘reg_lambda’:1, ‘n_estimators’: 100, ‘booster’: ‘gbtree’LightGBMe‘learning_rate’: 0.1, ‘n_estimators’: 100, ‘boosting_type’: ‘gbdt’CatBoostf‘learning_rate’: 0.03, ‘n_estimators’: 100, ‘loss_function’: ‘Logloss’, ‘l2_leaf_reg’: 3

aSVM: support vector machine.

bRF: random forest.

cLR: logistic regression.

dXGBoost: extreme gradient boosting.

eLightGBM: light gradient boosting machine.

fCatBoost: categorical boosting.

Model Evaluation Metrics

Predictive performance was quantified using accuracy, sensitivity, specificity, F1-score, AUC, positive predictive value (PPV), and negative predictive value. The specific formulas are as follows:

Where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negatives. Given the class imbalance in the clinical cohort, the F1-score was selected as the primary evaluation metric because it balances PPV and sensitivity (recall). In imbalanced clinical settings, AUC may overestimate performance by reflecting overall discriminability while masking poor sensitivity to the minority class. Unlike the threshold-independent AUC, the F1-score directly captures misclassification costs for minority samples, thereby ensuring robust diagnostic accuracy across classes. AUC is reported as a complementary measure of overall discriminative ability [].

Statistical Analysis

All analyses were conducted using Python (version 3.11; Python Software Foundation). Continuous variables are presented as median and IQR, and categorical variables as frequency and percentage. The Anderson–Darling test was used to assess normality. Group differences were evaluated with the Kruskal–Wallis H test, followed by Dunn’s post-hoc test (significance threshold P<.05). To examine linear and nonlinear associations between continuous predictors and the binary outcome, restricted cubic spline (RCS) regression was fitted within an LR framework. Likelihood-ratio tests compared RCS models against linear specifications, and spline curves were used to visualize dose-response relationships. To further interpret model predictions, Shapley additive explanations (SHAP) were used to quantify the contribution of each feature. Finally, stratified analyses by sex and age were conducted for these oximetry parameters.

ResultsCharacteristics of Study Participants

Among 4156 screened participants, 2641 were included in the final analysis: 2195 undergoing PSG and comprised the internal development cohort, and 446 undergoing HSAT and formed the external validation cohort (). The internal cohort consisted of 943 non-OSA and 1252 OSA participants. Compared with the non-OSA group, the OSA group was significantly older, a higher male proportion, experienced more frequent hypoxic episodes, and had longer hypoxic durations. The external cohort comprised 76 non-OSA and 370 OSA participants. These OSA patients displayed higher AHI, ODI, ST90, T90, and HB values alongside lower oxygen saturation, yet they were younger than non-OSA participants, with no significant between-group difference in sex distribution. Demographic and clinical characteristics are summarized in . Violin plots () revealed a higher median age (60.0 vs 45.0 years) and more severe nocturnal hypoxemia in the external validation cohort, underscoring distinct disease severity and physiological profiles between the 2 cohorts. These differences provide a robust foundation for validating the generalizability of the multi-parameter oximetry model across diverse clinical scenarios.

Table 2. Baseline characteristics of non-obstructive sleep apnea and obstructive sleep apnea patients in the internal development and external validation cohorts.CharacteristicsInternal development cohortExternal cohort
All (n=2195)Non-OSAa (n=953)OSA (n=1242)P valueAll (n=446)Non-OSA (n=76)OSA (n=370)P valueAge (years), median (IQR)45.00 (36.00-57.00)45.00 (35.00-57.00)46.00 (37.00-57.00).00160.0 (45.00-69.00)63.00 (49.75-71.25)58.00 (44.00-68.00)<.001Male, n (%)1651 (75.22)684 (71.77)967 (77.86)<.001351 (78.70)55 (72.37)296 (80.00).14AHIb (events/h), median (IQR)8.20 (2.10-34.10)1.80 (0.90-3.00)29.15 (13.83-54.08)<.00119.40 (9.03-37.48)2.25 (0.90-3.58)24.80 (13.65-42.45)<.001ODIc (events/h), median (IQR)12.80 (3.80-37.90)3.30 (1.70-5.40)33.45 (16.70-58.55)<.00123.85 (10.40-45.28)3.40 (1.18-5.13)29.00 (17.35-49.53)<.001MinSpO2d (%), median (IQR)86.00 (78.00-90.00)90.00 (88.00-92.00)79.00 (71.00-85.00)<.00182.00 (75.00-86.00)89.00 (88.00-91.00)80.00 (72.00-85.00)<.001MeanSpO2 (%), median (IQR)95.00 (93.00-96.00)96.00 (95.00-96.00)94.00 (92.00-95.00)<.00194.00 (92.00-95.00)95.00 (94.00-97.00)93.00 (92.00-95.00)<.001ST90e (%), median (IQR)0.44 (0.02-5.12)0.02 (0-0.11)3.38 (0.72-14.42)<.0012.91 (0.32-13.80)0.0 (0-0.20)4.51 (1.12-16.48)<.001T90f (minute), median (IQR)2.00 (0.10-23.70)0.10 (0.00-0.50)16.20 (3.5-67.10)<.00114.50 (1.53-65.52)0.00 (0.00-0.10)22.20 (5.55-78.9)<.001HBg (%·min/h), median (IQR)3.90 (0.90-16.40)0.70 (0.20-1.70)13.20 (5.50-36.80)<.00151.83 (20.22-112.84)6.29 (2.32-9.70)65.97 (35.07-138.08)<.001AttnEnh, median (IQR)2.19 (1.78-2.84)1.74 (1.54-1.97)2.70 (2.25-3.37)<.0015.87 (5.62-6.10)6.10 (5.99-6.25)5.82 (5.57-6.01)<.001TotalPoweri (dB), median (IQR)38.17 (35.83-40.81)37.61 (35.62-40.64)38.52 (36.10-41.10)<.00146.95 (45.56-49.16)45.06 (44.81-45.40)47.33 (46.11-50.28)<.001

aOSA: obstructive sleep apnea.

bAHI: apnea-hypopnea index.

cODI: oxygen desaturation index.

dMinSpO2: minimal SpO2.

eST90: percentage of sleep time with SpO2 < 90%.

fT90: total sleep time spent with SpO2 < 90%.

gHB: hypoxia burden.

hAttnEn: attention entropy.

iTotalPower: integrated power from power spectral density estimates in the 14-35 mHz frequency band.

‎

Figure 3. Comparison of baseline characteristics between the internal development and external validation cohorts. Violin plots comparing baseline characteristics between the internal development cohort (blue) and external validation cohort (orange). Each plot depicts the kernel density estimate, with bold horizontal lines representing medians and thin lines indicating IQRs. AttnEn: attention entropy; HB: hypoxia burden; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; ST90: percentage of sleep time with SpO2 < 90%; T90: total sleep time spent with SpO2 < 90%; TotalPower: integrated power from power spectral density estimates in the 14-35 mHz frequency band. Performance of Single-Parameter Oximetry Models

We evaluated the predictive performance of 8 OSA-related oximetry parameters using 6 ML algorithms. Given the class imbalance between OSA and non-OSA groups in the internal cohort, the F1-score was selected as the primary metric to balance precision and recall, with AUC used to assess overall discriminative performance []. Substantial heterogeneity was observed in model performance, with F1-scores ranging from 0.5332 to 0.9269 and AUC values from 0.5660 to 0.9808. Notably, ODI and HB exhibited the strongest discriminative ability. A summarizes the top 4 single-parameter oximetry models ranked by F1-score. The SVM model achieved optimal performance for ODI (F1-score = 0.9269, AUC = 0.9712), and the LightGBM model performed best for HB (F1-score = 0.9043, AUC = 0.9590). By contrast, MeanSpO2 and TotalPower showed comparatively weaker discriminative capacity, with F1-score of 0.7073 (LR model) and 0.6713 (CatBoost model), respectively.

Beyond the linear association of MinSpO2, all other oximetry parameters exhibited nonlinear relationships with OSA risk (P<.001), accounting for the heterogeneous predictive performance across indicators. Strong predictors, including ODI, HB, T90, and ST90, exhibited steep dose-response curves with pronounced threshold effects (). For instance, ODI showed a rapid risk escalation at lower values followed by a plateau, thereby providing distinct decision boundaries that enhanced the model’s discriminative ability and optimized F1-scores. Conversely, weaker predictors exhibited contrasting profiles: MeanSpO2 showed shallow gradients within the clinically critical 88%-92% range, resulting in classification ambiguity, whereas TotalPower displayed marked variability with widened 95% CIs at higher values, indicating substantial noise that limited predictive utility (). Given the complex nonlinear patterns of most key predictors, traditional linear regression models fail to capture these critical features.

‎

Figure 4. Heatmap of F1-scores for multi-parameter oximetry models across 6 machine learning algorithms. (A) single-parameter; (B-D) combinations of 2, 3, and 4 parameters, respectively. The top 4 F1-scores are shown for each model configuration, with darker colors representing higher classification performance. CatBoost: categorical boosting; HB: hypoxia burden; LightGBM: light gradient boosting machine; LR: logistic regression; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; OSA: obstructive sleep apnea; RF: random forest; ST90: percentage of sleep time with SpO2 <90%; SVM: support vector machine; T90: total sleep time spent with SpO2 < 90%; XGBoost: extreme gradient boosting. ‎

Figure 5. RCS curves showing associations between oximetry parameters and OSA risk. The analysis was performed on the internal development cohort (n=2195). The solid red lines indicate RCS fits with 5 degrees of freedom, and the red shaded areas indicate the 95% CIs. The blue dashed lines represent the linear fit for comparison. The y-axis represents the predicted probability of OSA. The P values were derived from Likelihood Ratio Tests to evaluate nonlinearity (nonlinearity: P<.05, red boxes; linear: P≥.05, green boxes). The gray dots (top and bottom) represent individual data distributions for OSA-positive and OSA-negative participants, respectively. AttnEn: attention entropy; HB: hypoxia burden; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; OSA: obstructive sleep apnea; RCS: restricted cubic spline; ST90: percentage of sleep time with SpO2 <90%; T90: total sleep time spent with SpO2 < 90%; TotalPower: integrated power from power spectral density estimates in the 14-35 mHz frequency band. Predictive Performance of Multi-Parameter Oximetry Models

We constructed and evaluated multi-parameter oximetry models, including 28 dual-, 56 triple-, and 70 quadruple-parameter combinations, with top-performing models illustrated in B-D. Among the dual-parameter models, the CatBoost-trained ODI-HB model achieved optimal performance (F1-score = 0.9472, AUC = 0.9865; , B). The ODI-HB-MinSpO2 model performed best in the triple-parameter category (F1-score = 0.9496, AUC = 0.9869; , C), whereas the quadruple-parameter ODI-HB-MinSpO2-ST90 model attained the highest overall discriminative ability (F1-score = 0.9516, AUC = 0.9879), significantly outperforming single-parameter oximetry models (, ). CatBoost demonstrated consistent superiority across all evaluation metrics (). Notably, adding 5 or more oximetry parameters yielded only marginal gains, underscoring the importance of selecting informative and complementary features rather than an indiscriminate increase in input dimensionality.

Table 3. Comparison of machine learning algorithms for obstructive sleep apnea screening using multi-parameter oximetry.Feature sets and machine learning modelAUCaF1-scoreAccuracySensitivitySpecificityPPVbNPVcODI-HBd,e
CatBoostf0.98650.94720.94080.94120.94020.95370.9253
LightGBMg0.92800.93610.92800.93320.92130.93960.9143
XGBoosth0.98120.93440.92620.93000.92130.93920.9104
RFi0.97940.93600.92800.93160.92340.94090.9129
LRj0.98090.92170.91340.88810.94960.95860.8655
SVMk0.97740.92970.92260.90660.94340.95450.8863ODI-HB-MeanSpO2l
CatBoost0.98690.94960.94350.94200.94540.95750.9265
LightGBM0.98480.94320.93670.93320.94130.95400.9165
XGBoost0.98310.94270.93580.93720.93390.94870.9205
RF0.98030.94460.93800.93480.94230.95500.9179
LR0.98160.92480.91800.89210.95170.96040.8720
SVM0.97960.92670.91940.90180.94230.95360.8813ODI-HB-MinSpO2-ST90m
CatBoost0.98790.95160.94580.94440.94750.95920.9296
LightGBM0.98620.94510.93850.93880.93810.95200.9227
XGBoost0.98420.94360.93670.93800.93490.94960.9212
RF0.98560.95120.94530.94440.94650.95850.9299
LR0.98110.92360.91620.89700.94120.95250.8760
SVM0.98150.92850.92120.90500.94230.95370.8844

aAUC: area under the receiver operating characteristic curve.

bPPV: positive predictive value.

cNPV: negative predictive value.

dODI: oxygen desaturation index.

eHB: hypoxia burden.

fCatBoost: categorical boosting.

gLightGBM: light gradient boosting machine.

hXGBoost: extreme gradient boosting.

iRF: random forest.

jLR: logistic regression.

kSVM: support vector machine.

lMinSpO2: minimal SpO2.

mST90: percentage of sleep time with SpO2 < 90%.

‎

Figure 6. ROC curves of the optimal 4-parameter oximetry model versus single-parameter oximetry models for OSA screening. AUC was used to quantify model discrimination, with values closer to 1 indicating better predictive ability. AttnEn: attention entropy; AUC: area under the receiver operating characteristic curve; HB: hypoxia burden; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; OSA: obstructive sleep apnea; ROC: receiver operating characteristic; ST90: percentage of sleep time with SpO2 < 90%; T90: total sleep time spent with SpO2 < 90%; TotalPower: integrated power from power spectral density estimates in the 14-35 mHz frequency band. Stratified Analysis by Sex and Age

Subgroup analyses revealed significant performance variations across demographics. In the male subgroup, the optimal model (ODI-HB-MinSpO2-ST90) achieved an F1-score of 0.9460 and an AUC of 0.9853, with CatBoost outperforming other algorithms (, A). In the female subgroup, the best-performing combination was ODI-HB-MinSpO2-MeanSpO2 (F1-score = 0.9543, AUC = 0.9919; , B), suggesting sex-specific differences in OSA-related hypoxic patterns. In the age-stratified analysis, the older subgroup demonstrated superior overall performance (F1-score = 0.9398-0.9701, AUC = 0.9913-0.9933) with ODI-HB-MinSpO2-ST90 as the optimal model, whereas the younger subgroup exhibited stable but slightly lower performance (F1-score = 0.9163-0.9467, AUC = 0.9774-0.9863), favoring ODI-HB-MinSpO2-MeanSpO2 (C-D). Across all subgroups, CatBoost maintained consistently superior classification performance ().

Table 4. Performance of optimal predictive models for obstructive sleep apnea screening across sex and age subgroups in the internal development cohort.Feature SetsSubgroupaAUCbF1-scoreAccuracySensitivitySpecificityPPVcNPVdODI-HB-MinSpO2-ST90e,f,g,hMale0.98530.94600.93760.93380.94380.95870.9102ODI-HB-MinSpO2-MeanSpO2Female0.99190.95430.95410.95270.95540.95720.9532ODI-HB-MinSpO2-ST90Older (≥ 60 years)0.99420.97010.96570.96640.96470.9741 0.9552ODI-HB-MinSpO2-MeanSpO2Younger (< 60 years)0.98630.94670.94040.93840.94290.95520.9224

aAll subgroup models used CatBoost as the optimal classifier.

bAUC: area under the receiver operating characteristic curve.

cPPV: positive predictive value.

dNPV: negative predictive value.

eODI: oxygen desaturation index.

fHB: hypoxia burden.

gMinSpO2: minimal SpO2.

hST90: percentage of sleep time with SpO2 < 90%.

‎

Figure 7. Heatmap of F1-scores for 4-parameter oximetry models across age and sex subgroups. (A) Male subgroup; (B) female subgroup; (C) older subgroup (≥ 60 years); (D) younger subgroup (< 60 years). The top 4 F1-scores are displayed for each subgroup, with darker colors indicating superior classification performance. AttnEn: attention entropy; CatBoost: categorical boosting; HB: hypoxia burden; LightGBM: light gradient boosting machine; LR: logistic regression; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; OSA: obstructive sleep apnea; RF: random forest; ST90: percentage of sleep time with SpO2 <90%; SVM: support vector machine; T90: total sleep time spent with SpO2 < 90%; XGBoost: extreme gradient boosting. Model Interpretability

To elucidate the predictive mechanisms of the optimal 4-parameter model, we integrated SHAP analysis with normalized feature importance scores. SHAP values quantified each feature’s marginal contribution and revealed nonlinear relationships between oximetry parameters and OSA risk (A), while normalized scores reflected relative contribution weights (B). In the internal cohort, ODI, HB, and MinSpO2 emerged as the top 3 predictors, with importance scores of 0.437, 0.320, and 0.137, respectively (B). Subgroup analyses revealed heterogeneous contribution patterns across sex and age strata (C-J). Notably, male and older subgroups showed consistent dominance of ODI, HB, MinSpO2, and ST90 (C, D, G, H), with ODI exhibiting the highest contribution in the older subgroup (importance score: 0.511). Conversely, younger and female subgroups were characterized by ODI-HB-MinSpO2-MeanSpO2, where MeanSpO2 replaced ST90 as a stronger predictor (E,F,I,J). Particularly in females, MeanSpO2 surpassed MinSpO2 in contribution strength (F).

‎

Figure 8. Interpretability analysis of oximetry parameters across sex and age subgroups for OSA screening. (A, C, E, G, I) SHAP summary plots illustrating feature contributions; dot color denotes feature magnitude (red: high, blue: low) and horizontal position indicates the SHAP value. (B, D, F, H, J) Normalized feature importance scores. Results are presented for all participants (A,B), male subgroup (C,D), female subgroup (E,F), older subgroup (≥ 60 years) (G,H), and younger subgroup (< 60 years) (I,J). HB: hypoxia burden; MinSpO2: minimal SpO2; ODI: oxygen desaturation index; OSA: obstructive sleep apnea; SHAP: Shapley additive explanations; ST90: percentage of sleep time with SpO2 < 90%. External Validation

To assess generalizability, we tested model performance on an independent external cohort. The CatBoost algorithm demonstrated robust generalizability, achieving an F1-score of 0.9667 with single-parameter configurations and maintaining high performance as oximetry parameter complexity increased (). Specifically, the optimal 4-parameter oximetry model (ODI-HB-MinSpO2-ST90) achieved an F1-score of 0.9838 and an AUC of 0.9881 (, D), suggesting that the model captures shared OSA pathophysiological features rather than overfitting the internal cohort. Subgroup analysis further confirmed the robustness of sex- and age-stratified models in the external cohort (). Sex-optimized models achieved F1-scores of 0.9848 (male subgroup: ODI-HB-MinSpO2-ST90) and 0.9799 (female subgroup: ODI-HB-MinSpO2-MeanSpO2), with AUCs exceeding 0.98 (E, F). Age-stratified models similarly achieved F1-scores exceeding 0.98 across subgroups (G, H). These results validate the excellent generalizability of the CatBoost-based oximetry model for diverse OSA screening applications ().

Table 5. Performance of multi-parameter oximetry models in external validation. All subgroup models used categorical boosting (CatBoost) as the optimal classifier.Feature setsAUCaF1-scoreAccuracySensitivitySpecificityPPVbNPVcODId0.98770.96670.94620.94050.97370.99430.7708ODI-HBe0.98610.97270.95520.96220.92110.9834

View original article

JMIR MEDICAL INFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Machine Learning–Based Multidimensional Oximetry for Obstructive Sleep Apnea Screening: Development and External Validation

Comments (0)