Enhancing generalization in zero-shot multi-label endoscopic instrument classification

All experiments in this study were based on an adaptation of the codebase from [15]. The vision backbone employed was DenseNet-121, initialized with random weights, and all input images were resized from \(1920 \times 1080\) to \(224 \times 224\). Training was optimized using the Adam optimizer [27], running for 50 epochs with a batch size of 16. A dynamic learning rate adjustment was implemented via the OneCycleLR scheduler [28] with a maximum learning rate of 1e-4. Most hyperparameters were adopted from [15], except for the margin value in the ranking loss, which was adjusted to \(\delta = 0.3\) based on preliminary experiments.

The experiments were conducted on NVIDIA GeForce RTX 3090 and A100 80GB PCIe GPUs. To evaluate the robustness of our model’s performance, we performed 25 independent runs per experiment, each with a different random seed. This resulted in a unique resampling of the training dataset with replacement, while maintaining the same total number of samples. Each trained model was then evaluated on the same test dataset. This approach enables us to quantify the variability in model performance and assess its generalization stability more effectively.

Model performance was evaluated using the macro-F1 score, area under the receiver operating characteristic curve (AUROC)—both common multi-label performance metrics [29]—and a dynamic multi-label accuracy metric.Footnote 4 The model selection criterion was the highest macro-F1 score achieved on the validation data, which includes both seen and unseen (S1-UC) classes. By incorporating unseen classes into the validation set, the model’s generalization ability was directly assessed during training. To ensure robust evaluation on unseen classes, new classes were reserved exclusively for the test split (S2-UC). For threshold-independent comparisons, we report both mean AUROC and multi-label accuracy values.

Unlike standard multi-label accuracy, which relies on exact matches (subset accuracy), we apply a dynamic top-k strategy, where k is equal to the number of ground truth labels for each instance. The model’s top-k predicted classes (based on the highest prediction scores) are then compared with the ground truth set. The accuracy is computed as the proportion of correctly predicted labels. In Equation 2, the multi-label accuracy is calculated as follows:

$$\begin \text = \frac \sum _^ \frac, \end$$

(2)

where \( |T_i| \) indicates the total number of relevant (ground truth) classes for test instance \(i\). \(T_i\) denotes the set of relevant (ground truth) classes for test instance \(i\), while \(N\) represents the number of test instances with at least one relevant class (\( |T_i| > 0 \)). \(P_i\) is the set of the top-\( |T_i| \) predicted classes with the highest scores for that instance. Finally, the term \( |P_i \cap T_i| \) refers to the number of correctly predicted classes, for instance \(i\), which is the intersection of predicted and relevant classes.

For example, if a test image has ground truth labels , (i.e., \( |T_i| = 3\)), and the model’s top-3 predictions are , the intersection is , resulting in a multi-label accuracy of \(2/3\) for that instance. The overall multi-label accuracy is then obtained by averaging this accuracy over all instances, which provides a flexible and context-sensitive evaluation for multi-label classification tasks.

For both AUROC and multi-label accuracy, results were reported separately for seen, stage one, and stage two unseen classes. To provide a balanced assessment of the model’s overall performance, we also compute the harmonic mean across all three groups. This approach ensures that the model’s performance on unseen classes, which is critical in GZSL tasks, is given equal consideration alongside seen classes. During evaluation, predictions are made over the complete label set, but metric calculations for each group are performed by filtering the predicted and ground truth labels to include only the relevant class subset (e.g., only S2-UC labels when evaluating zero-shot performance). While we report metrics for all subsets and their harmonic mean, we place particular emphasis on the results for stage two unseen classes (S2-UC), as these represent the truly novel categories that were not seen during training or validation. All results are presented as the mean of 25 repeated runs.

Table 1 Comparison of AUROC and multi-label accuracy values for word and sentence embeddings for the baseline methodWord vs sentence embeddings

In our first experiment, we evaluated the performance of word embeddings (Labels) and sentence embeddings (Phrases) within the baseline model setup. The results, as summarized in Table 1, highlight significant improvements when using sentence embeddings, especially for the S2-UC.

For the AUROC metric, sentence embeddings improved performance by 9.1 percentage points for the unseen test classes and by 5.1 percentage points for the harmonic mean. For the multi-label accuracy, we observed a substantial improvement of 45.3 percentage points for unseen test classes and an increase of 25.6 percentage points for the harmonic mean, while the results for the seen classes remained largely unchanged for both metrics. These findings establish that sentence embeddings provide a significant advantage over word embeddings in the context of multi-label classification tasks. The use of descriptive phrases instead of single-label names offers clear benefits, as reflected in the improved performance for novel classes.

Raw vs Z-score normalization on embeddings

In this experiment, we evaluated the impact of z-score normalization on both word (Labels) and sentence (Phrases) embeddings. Table 2 summarizes the results, comparing the unnormalized (Raw) embeddings and their z-score normalized (Z-Score) counterparts. The experimental pipeline, as illustrated in Fig. 2, follows path 1a) and 1b) for raw embeddings and path 1a) + 2) and 1b) + 2) for normalized embeddings. Applying z-score normalization improved model performance particularly for the AUROC metric, across both embedding types.

Fig. 5figure 5

Box plot comparison for replicate experiments with the label embedding (blue) vs. the z-score normalized phrase embedding (red). Results are compared for seen classes, S1-UCs, S2-UCs and the harmonic mean

Results for Word Embeddings. AUROC values increased by 4.8 and 5.5 percentage points for the unseen test classes and the harmonic mean, respectively. Multi-label accuracy values deteriorated by 4.7 and 2.7 percentage points for the unseen test classes and the harmonic mean, respectively.

Results for Sentence Embeddings. There was a 11.9 percentage point increase in AUROC for the unseen test classes and a 10.5 percentage point rise for the harmonic mean. Additionally, multi-label accuracy values improved by 8.1 and 2.2 percentage points for the same categories.

Fig. 6figure 6

Comparison of model predictions using two different embedding types: raw label embeddings and z-score normalized phrase embeddings. Correct predictions are highlighted in green, while incorrect ones are shown in purple. The results reveal a balanced distribution across the four examples: one instance shows correct predictions from both methods, another shows failures from both, while the remaining two display divergent performances between the methods

These results demonstrate that z-score normalization has a substantial positive impact on both embedding types, with particularly pronounced improvements for unseen test classes, especially in terms of the AUROC metric. Meanwhile, the results for seen classes remain largely unchanged, indicating that the normalization primarily benefits generalization to unseen data.

To further assess the model’s robustness concerning these performance improvements, Fig. 5 presents box plots comparing the raw label embeddings with the z-score normalized phrase embeddings. The boundaries of the boxes represent the \(\text ^}\) and \(\text ^}\) percentiles of the replicate experiments, while the line within the box indicates the median. As the previous results showed, there is little difference for the seen classes, whereas the unseen classes exhibit a substantial improvement. In addition, it can be observed that the variance of the results is rather low for the seen classes, but significantly higher for the unseen classes. However, the use of z-score normalized phrase embeddings was able to mitigate this significantly in some cases.

Table 2 Comparison of AUROC and multi-label accuracy values for word and sentence embeddings, highlighting the impact of the z-score normalization

Figure 6 illustrates the model predictions for four sample images using two different embedding types: raw label embeddings and z-score phrase embeddings. The results demonstrate a balanced distribution across the examples: in one case, both methods yield correct predictions; in another, both fail; and in the remaining two cases, performance varies between the methods.

Ablation study

To further investigate the generalization ability of our approach, we conducted an ablation study examining the impact of using a shared versus separate optimization strategy for the loss functions.

In the baseline method, all loss functions—ranking, alignment, and inter-class consistency—are combined into a single-gradient descent update. However, since these loss functions have distinct objectives, we hypothesized that separating their optimization might yield better results.

In the multiple optimizer approach, the model’s parameters for the visual encoder and the visual mapping module are updated using a single-gradient descent step, driven by the combined ranking and alignment losses. Meanwhile, the semantic mapping module’s parameters are refined in a separate gradient descent step, guided by the inter-class consistency loss. Table 3 compares the performance of the baseline method (shared optimization) with the multiple optimizer approach for both word and sentence embeddings, with z-score normalization applied in all cases.

Contrary to our initial assumption, the results reveal minimal differences between the optimization types, indicating that our hypothesis is not supported by the observed outcomes. These findings suggest that regardless of the embedding type, the choice of optimization strategy has negligible impact on the model’s generalizability.

Table 3 Comparison of AUROC and multi-label accuracy values between the baseline approach and the multiple optimizer setting for both word and sentence embeddings with z-score normalization

Comments (0)

No login
gif