Weakly supervised pre-training for surgical step recognition using unannotated and heterogeneously labeled videos

Taken together, our results indicate that weak supervision narrows the gap to full supervision under scarce labels: (i) phase-based pre-training yields consistent, cross-procedure gains; (ii) step- and time-based labels provide procedure-specific benefits; (iii) performance improves with larger pre-training sets; and (iv) the observed F1 gains translate into meaningful label-efficiency savings (NNL), i.e., fewer videos that require step annotation.

Weakly supervised pre-training substantially improves surgical step recognition when annotated data are limited. Phase labels—especially from the same target procedure or pooled across related procedures—produce the most reliable gains across all targets, supporting the use of weak labels to reduce manual annotation burden in surgical video pipelines and to accelerate model deployment for education and performance review.

Importantly, we achieved these improvements using a standard pre-training and fine-tuning paradigm on a relatively simple 2D CNN architecture. This suggests that further performance gains may be attainable through more advanced training regimes or model architectures, including temporal models or multi-modal inputs [12, 28].

The greatest benefit was observed with phase-based weak supervision, which yielded an average improvement of approximately 5 percentage points in F1-score (Table 4) across target procedures. This aligns with and extends the findings of Ramesh et al. [12], although their gains diminish in the absence of same-procedure phase labels. Our results highlight that even simple weak labels can unlock substantial improvements, comparable to several years of progress seen in benchmark datasets such as Cholec80 [26, 28].

Beyond average performance, weakly supervised models demonstrated improved robustness to cross-site or intra-procedural variability. For example, in HYS (Fig. 2E), performance of the baseline ImageNet model actually dropped as more annotated training data was added–likely due to increased variance from different surgical sites or styles [29]. In contrast, models pre-trained with weak labels remained stable or improved, suggesting that pre-training helps mitigate overfitting to noisy or heterogeneous training distributions. This robustness may help address known issues of model generalization across institutions [30].

The effectiveness of weak labels was not uniform. Phase annotations consistently generalized across procedures, likely due to their coarse-grained alignment with surgical workflow structure. In contrast, time-based labels provided moderate but procedure-specific benefit—most notably for CHO, a dataset with relatively linear procedures where elapsed time correlates with task progression. These results suggest that time-based labels may be more informative for datasets with more linear workflows. However, curated datasets can under-represent irregularities and real-world workflows may be less linear; thus, we frame time-label benefits as procedure- and dataset-dependent. Surprisingly, step pre-training from a different procedure (Step-RPY) significantly improved HYS performance, suggesting that some step-level visual or temporal patterns may transfer across seemingly unrelated procedures.

Arguably, this could reflect shared anatomical features, tools, or workflow components between radical prostatectomy and hysterectomy. We refrain from asserting causal mechanisms but note that pelvic procedures such as radical prostatectomy and hysterectomy share operating domain, instruments, and recurring workflow primitives (fine dissection, hemostasis, suturing). Such commonalities could provide transferable low-level features (e.g., energy-use signatures, instrument tip dynamics, camera motion patterns) that are subsequently specialized during fine-tuning on hysterectomy steps. By contrast, sleeve gastrectomy’s (SLG) upper-abdominal field and stapler-dominant steps likely reduce alignment with features learned from RPY, consistent with the weaker transfer we observe.

Table 5 summarizes the Number Needed to Label (NNL) at the low-label operating point \(\alpha _00.25\). Across procedures, weak pre-training with phase information delivers sizeable annotation savings: for SLG, Phase-Within yields an additional label fraction of \(\varDelta \alpha 0.53\) (95% CI [0.39, 0.67]). Under the same train-set size, the Baseline would require annotating 36 more SLG training videos (95% CI [27, 45]) to match that performance. HYS shows a similar pattern with Phase-All (NNL 33 [20, 45]; \(\varDelta \alpha 0.31\) [0.19, 0.42]) and CHO with Phase-All (NNL 66 [34, 98]; \(\varDelta \alpha 0.25\) [0.13, 0.37]).

Two observations follow. First, the label-fraction gains (\(\varDelta \alpha \in [0.25,0.53]\)) are comparable across procedures, indicating that phase-based weak pre-training consistently shifts the F1–\(\alpha \) curve leftward at \(\alpha _00.25\). Second, NNL scales with the size of the training pool \(N_}\): CHO exhibits a larger NNL because the same \(\varDelta \alpha \) multiplies a larger \(N_}\) (263 vs. 68/106), making NNL especially useful for budgeting annotation effort at deployment sites with different data volumes. For example, the effort of labeling 66 additional cases (NNL for CHO in Table 5) can be roughly estimated to be \(\sim \)44 person-hours using the annotation time of \(\sim \)40 min per case as reported by Lecuyer et al for cholecystectomy videos [27].

These results suggest that while the change may only be “a few F1 points”, the observed \(\varDelta \)F1 translates into dozens of additional videos a baseline system would otherwise need to have labeled. We emphasize that NNL is an interpretability and planning tool rather than a replacement for standard effect sizes or significance tests; it depends on a local linear approximation of F1 vs. \(\alpha \) and inherits uncertainty from both the fitted slope and the split-to-split variability (reflected in the reported CIs). Even with these caveats, NNL provides an actionable unit—videos—for comparing weak-label strategies and prioritizing annotation where it has the greatest marginal return.

As expected, the impact of weak supervision diminished with increased availability of annotated step data. This is likely due to the reduced ratio of pre-training data to task-specific training data. Our analysis (Fig 3) shows that maintaining a high ratio of pre-training data relative to fine-tuning data helps preserve the benefits of weak supervision. Interestingly, Step-RPY continued to yield gains even at full supervision (\( \alpha = 1 \)), indicating potential for complementary learning beyond what is captured in the target dataset.

The weighted-F1 scores reported here are lower than state-of-the-art (SOTA) precision/recall/accuracy ranges reported on open benchmarks such as Cholec80, AutoLaparo, and MultiBypass140 (typically \(\sim \)65–95% depending on the publication and metric) [31,32,33]. This gap is expected given (i) task granularity (fine-grained step recognition vs. commonly reported phase recognition), (ii) metric differences (weighted-F1 vs. accuracy), (iii) data characteristics (our multi-site/private data and less curated distributions), and (iv) our explicit focus on low-label regimes (e.g., \(\alpha \le 0.25\)). Importantly, our goal was not to optimize absolute SOTA performance, but to conduct a systematic evaluation of which weak-label pre-training strategies improve step recognition across procedure types and by how much, emphasizing label-efficiency (NNL) under realistic annotation constraints. We view this contribution as complementary to SOTA model development; future work can combine the most effective weak-label strategies identified here with stronger temporal/multi-modal architectures to pursue both label efficiency and peak accuracy.

Table 6 Dataset provenance and availability (complements Table 1)Limitations and future work

One limitation of our approach is the computational overhead of evaluating multiple weak pre-training strategies per target task. Identifying the optimal dataset-label pair requires an exhaustive search, which may be prohibitive in practical deployments. Future work could explore automated selection mechanisms–e.g., using label similarity metrics, ontology alignment, or unsupervised correlation analyses–to guide efficient pre-training strategy selection. Additionally, joint training paradigms or meta-learning approaches may offer end-to-end optimization paths that reduce selection bias while retaining flexibility. Despite these costs, the modularity of our framework allows it to be applied to existing recognition pipelines without re-architecting the target model.

Moreover, our dataset includes videos from multiple clinical sites for HYS, CHO, and RPY, whereas SLG is single-site. We did not design experiments to isolate cross-site generalization (e.g., leave-one-site-out training/testing), so we refrain from drawing conclusions about robustness to inter-site distribution shifts. Nevertheless, the consistent gains from phase-based weak pre-training on procedures aggregated across sites suggest that coarser workflow supervision may help learn representations that are somewhat less sensitive to site-specific factors (e.g., optics, instrumentation, video encoding, and local protocols). Future work should (i) report per-site metrics, (ii) conduct stratified and leave-one-site-out evaluations with site as a grouping variable, and (iii) investigate multi-source/domain-generalization strategies (e.g., adversarial site-invariance, site-aware sampling, or multi-task objectives using site indicators) to quantify and strengthen cross-site transfer. Existing literature in this domain suggests that when deploying to a new site, collecting a small amount of site-specific step labels and combining them with phase-level weak pre-training offers a pragmatic path [34,35,36,37].

Additionally, since our study uses a 2D CNN at 1 fps without explicit temporal modeling, we plan to do future work to test whether the observed label-efficiency gains hold for temporal architectures along with temporal sampling rates higher than 1 fps. Here, we also use ImageNet initializations which are domain-general and may not be a strong foundation for analyzing surgical video. We also plan to test other initializations and architectures for video frame embeddings. Finally, there is a clear need to perform multi-source, multi-task pre-training across different procedure types and weak labels in order to investigate synergies in weak labels that may not contribute linearly to performance against the baseline. Specifically, there may be more data- or event-driven approaches to improving the representation of case progression (e.g. instrument installations, energy applications, or anatomy-in-view) instead of the monotonically increasing Time-Within and Time-All labels used here.

Comments (0)

No login
gif