A Dual Encoder Architecture for Robust, Adversarial Aware Sarcasm Detection across Heterogeneous Text Corpora

Performance Evaluation on multi-domain and Twitter Datasets

The Double BERT model was trained for 10 epochs, and the loss at each epoch from Fig. 4 it is observed that significant improvements in both training loss and validation accuracy. The obtained training loss of 0.0149 in the first epoch, and reduced to 0.0003 by the tenth epoch. This quick decline in training loss tells that the model effectively learned to fit the training data, improving its understanding of the complex sarcasm patterns within the text. Validation accuracy also exhibited steady performance, reaching around 90.6% by the end of the training period, suggesting that the model maintained strong generalization capabilities across the epochs.

Although there was a notable reduction in training loss, there was a consistent rise in validation loss, although with a slight increase from the initial epoch. Eventually, it converged towards a stable value of 0.7344 from later epochs. This usually indicates that the model is able to generalize well according to its capacity, possibly with a certain amount of overfitting. Nevertheless, it is remarkable that even though there was a consistent rise in validation loss, there was a notable stability in validation accuracy, recorded at around 90.6%.

Fig. 4Fig. 4The alternative text for this image may have been generated using AI.

Training loss of proposed model on multi-domain data

From Table 3, the D-BERT model achieved an accuracy of 91% (95% CI: [0.902, 0.918], SE: ±0.004) on the validation set consisting of 4,710 samples. For the sarcasm class (0), the model recorded a precision of 0.89 (95% CI: [0.876, 0.904]), recall of 0.90 (95% CI: [0.886, 0.914]), and F1-score of 0.895 (95% CI: [0.882, 0.908]) with SE: ±0.007. Similarly, for the non-sarcasm class (1), the model performed slightly better, achieving precision of 0.92 (95% CI: [0.908, 0.932]), recall of 0.91 (95% CI: [0.897, 0.923]), and F1-score of 0.915 (95% CI: [0.903, 0.927]) with SE: ±0.006. Finally, the macro and weighted averages of all the metrics merged at a value of 0.91. The figures attest to a balanced model generalization across all categories of sarcasm. Statistical validation using McNemar’s test proved that D-BERT performed better than Single BERT, as indicated by a \(\chi ^2\) statistic of 178.4 with a \(p < 0.0001\). A large effect size, calculated by Cohen’s h statistic, showed a value of 0.82; the confidence interval was narrow due to a large sample in the validation group (Fig. 5).

Table 3 Classification report of D-BERT model on multi-domain dataset (P-precision, R-recall, F1-score)Fig. 5Fig. 5The alternative text for this image may have been generated using AI.

D-BERT model performance on 2 classes

Table 4 provides an assessment of the D-BERT model’s on the Twitter dataset. The model demonstrates balanced performance across both categories, with slightly better results for non-sarcastic examples compared to sarcastic ones. Overall, it achieves a consistent level of effectiveness, reflected by the similar values for the individual categories and the overall with an accuracy of 0.864. This indicates that the model is robust and performs reliably on this large dataset, encompassing 31,391 instances. From Tables 2 and 3 it is observed that the model is performing consistently on multi-domain and twitter data set. The Figs. 6 and 7 displays the number of samples used for testing and performance of each class on test data.

Table 4 Classification report of D-BERT model on the Twitter dataset

The Fig. 6 illustrates the validation confusion matrix on both datasets, which summarizes the performance of the D-BERT model on multi-domain data set and twitter by displaying the correct and incorrect predictions for each class. From this the true positive (2378), false positive (205), false negative (229), and true negative (1898). These values highlight that the model classifying both sarcastic and non-sarcastic samples accurately, with a low number of misclassifications. The matrix reveals that the model performs slightly better for detecting non-sarcastic instances than sarcastic ones.

Fig. 6Fig. 6The alternative text for this image may have been generated using AI.

Confusion matrix of proposed model on multi-domain data (left) and twitter data (right)

The Fig. 7 illustrates the Receiver Operating Characteristic (ROC) curve and the Precision-Recall curve on multi domain data set, both of which provide a comprehensive evaluation of the model’s classification performance. This curve shows the true positive and false positive rate at various values. The area under the curve (AUC) of 0.95 indicates that the model effectively distinguishing between the two classes.

The Precision-Recall curve illustrates the relationship between the fraction of relevant instances retrieved and the accuracy of positive predictions. The curve demonstrates consistently high precision and recall across most thresholds, reflecting the robustness in identifying positive instances, with only a small decline at extreme thresholds.

Fig. 7Fig. 7The alternative text for this image may have been generated using AI.

ROC and precision, recall curve of proposed model on multi-domain data

The calibration curve in Fig. 8 shows the reliability of the predicted probabilities of the model with regard to actual results. It shows that for a well-calibrated model, the trend should follow a diagonal line where model prediction probabilities correspond with real probabilities. In our model, it is evident that our model follows this ideal trend, proving the reliability of the double BERT model prediction probabilities for our problem, which is sarcasm recognition. In addition, Figs. 8 and 9 shows the feature importance plot of true and predicted values with regard to our model. In Fig. 8, the correct prediction shows that both true and predicted label values of sarcastic (True: 1, Predicted: 1) align correctly, and the graph shows which features contributed more. Figure 9 provides a comprehensive overview of feature contributions across multiple instances, including both correct and incorrect predictions. The subplots from Fig. 10 show how features influence various predictions, with some cases having stronger or weaker contributions depending on whether sarcasm was correctly identified. The diversity of feature importance across all test cases illustrates the model’s capability to detect different featuresfrom text, highlighting the robustness of the Double BERT model in capturing micro nuances in sarcastic language.

Fig. 8Fig. 8The alternative text for this image may have been generated using AI.

Calibration curve of D-BERT model on multi-domain dataset

Fig. 9Fig. 9The alternative text for this image may have been generated using AI.

Feature importance of D-BERT model for true positive on multi-domain dataset

Fig. 10Fig. 10The alternative text for this image may have been generated using AI.

Feature importance of D-BERT model for true negative and false values on multi-domain dataset

Comparison with State-of-the-art Models

Table 5 illustrates a comparative evaluation of the proposed Double BERT model against SOTA models for sarcasm detection on the Twitter dataset. Traditional deep learning architectures such as Bi-LSTM, CNN-LSTM-DNN, MIARN, and graph-based models like ADGCN and DCNET demonstrate moderate performance, with DCNET achieving the highest F1-score of 76.3% among them. Pre-trained language models like RoBERTa and its graph-enhanced variants (e.g., ADGCN-RoBERT, DC-Net-RoBERT) exhibit slightly improved results but remain below the 77% F1-score threshold. SarcPrompt-Clash-RoBERT from [24] achieves a strong F1-score of 76.6%, whereas other approaches, such as the ensemble model by [21] and hierarchical BERT by [33], offer competitive results with varying training and testing performance. Compare to other existing models the proposed Double BERT model outperforms all baselines models, with an F1-score of 86.4%, along with 86.4% accuracy, 86.4% precision, and 86.4% recall. This better performance illustrates the effectiveness of the dual BERT encoding approach in capturing both semantic and contextual features across different domains for sarcasm understanding, thereby demonstrating a considerable improvement over existing methods on the benchmark dataset.

Table 5 Comparison of proposed model with state of the art models (SOTA) on Twitter data set (* the results are taken from [24])

Table 6 illustrates the performance of the D-BERT model with several benchmark models on the Reddit dataset. The model like hierarchical BERT by [33] achieved 63.7% accuracy on test data, while ensemble methods [21] and BERT-based models [5] achieved F1-scores around 65%. Among existing approaches, RCNN-RoBERT by [28] performs relatively better with an F1-score of 78%. However, the proposed model performed better, achieving an accuracy of 83.6%, 80.8%, 88.1%, and 84.4%, reflecting its enhanced ability to capture contextual nuances in sarcastic expressions within Reddit posts as shown in Table 6 and (Table 7).

Table 6 Comparison of proposed model with prescribed models on Reddit datasetTable 7 Comparison of proposed model performance on prescribed data sets

The multi-domain trained D-BERT achieves 91.0% accuracy on combined validation data from Table 8. When evaluated on held-out samples from individual platforms that were present in training, performance shows 87.2% (Twitter) and 89.1% (Reddit). These results demonstrate within-distribution generalization to unseen samples from training platforms, not cross-domain transfer. The comparison with single-platform models (Twitter-only: 86.4%, Reddit-only: 83.6%) shows multi-platform training provides +0.8% and +5.5% improvements respectively.

Table 8 Multi-platform training performance evaluationAblation Study

Table 9 illustrates the ablation study evaluating D-BERT’s architecture due to GPU computational constraints. Removing the second encoder, so it is reduced to Single BERT, this caused to reduce the performance degradation across all metrics: accuracy drops from 91% to 78% (-13%), precision from 90.5% to 77.5% (-13%), recall from 90.5% to 76.8% (-13.7%), and F1-score from 91% to 77.15% (-13.85%). These results confirm that the dual-encoder architecture is performing better in sarcasm detection. The first BERT’s contextual embeddings combined with the second BERT’s specialized classification enables superior capture of sentiment reversals, contextual incongruities, and domain-specific sarcasm patterns. The 13% performance improvement is statistically robust due to our large validation sample size (\(n=4710\)). With this sample size, the standard error for accuracy is ±0.4%, meaning the 13% difference (91% vs 78%) is approximately 32.5 standard errors, far exceeding the conventional significance threshold. McNemar’s test confirms this difference is statistically significant (\(\chi ^2 = 178.4\), \(p < 0.0001\)).

Table 9 Ablation study on the proposed modelAdversarial Robustness and Testing

The D-BERT model’s robustness was evaluated on a set of adversarial test cases designed to challenge its understanding of sarcasm across various contextual and linguistic nuances. The Table 10 illustrates the results for different categories of sarcasm, such as Contextual Understanding, Nuanced Interpretation, Ambiguous Statements, Linguistic and Cultural Variability, Hyperbole and Exaggeration, Slang and Informal Language, Cross-Cultural Idioms, and Conversational Context. Each test case presents a sarcastic statement, predicted output with the actual output.

In the test case Contextual Understanding, the model correctly identified sarcasm in straightforward statements like “What a beautiful day!” but failed in more delicately sarcastic examples such as “Oh great! Another rainy day. Just what I needed!”, where it misclassified the statement as non-sarcastic. This sample has all positive words in this case the model is not predict correct output. For nuanced interpretation case, the model performed exceptionally well, correctly detecting sarcasm in complex, layered statements such as “I just love waiting in long lines at the DMV. It’s the highlight of my week!” and “Thanks for the unsolicited advice, I’ll be sure to use it wisely.” With this the model is easily predicting irony statements in sarcasm detection.

However, the model struggled with Hyperbole and Exaggeration, as seen in the test case “I’ve told you a million times to clean your room!” where it incorrectly classified the exaggerated statement as non-sarcastic. Similarly, in the Cross-Cultural Idioms category, the model failed to recognize the sarcasm in the idiomatic expression “It’s raining cats and dogs!”, misclassifying it as non-sarcastic. From this the model has some limitations it cannot able to detect exaggerated expressions and idioms.

The model performed well in Slang and Informal Language with phrases like “That’s just lit, fam!”, correctly identifying the sarcastic intent despite the informal, modern slang. Additionally, the model successfully detected sarcasm in Conversational Context, correctly interpreting the dialogue, “Did you finish the report? Oh yeah, I did it while I was sleeping.” where sarcasm was conveyed through conversational dynamics. From Table 10, out of 15 test cases only 3 are miss-classified. But single BERT model performance is faster than D-BERT, but single BERT is failed in predicting “Nuanced Interpretation” and “Ambiguous Statements” from Table 10. Form Table 11 D-BERT model achieved 80% accuracy and single BERT model achieved 60% accuracy.

Table 10 Adversarial testing on proposed modelTable 11 Category-wise performance summary on adversarial test casesLimitations and Deployment Considerations

While D-BERT demonstrates strong performance, several critical limitations warrant consideration. Our contribution—that dual-encoder BERT architecture improves sarcasm detection over single-encoder baselines (13% improvement, Table 9)—remains valid, but the following limitations define boundaries for generalization.

Architectural Design Limitations

D-BERT contains 220M parameters versus 110M for Single BERT. Our ablation study shows 13% improvement (\(\chi ^2 = 178.4\), \(p < 0.0001\)), but we cannot definitively isolate whether gains stem from dual-encoder architecture or increased parameter count. Future work should compare against BERT-large (340M parameters), extended training of BERT-base, alternative fusion strategies (attention-based, gating mechanisms), and parameter-efficient alternatives (adapters, prompt tuning, LoRA).

Adversarial Robustness Evaluation

For the testing of adversarial robustness, we rely on 15 handcrafted test cases, without systematic rigor; standardized benchmarks, such as TextFooler, BERT-Attack, and CheckList, or controlled perturbation studies, as token shuffling at 10–30%, random replacement, and character-level perturbations, for which [38] advocated, have not been implemented. While D-BERT attains an accuracy of 80% against Single BERT’s 60% for qualitative cases, worst-case robustness against targeted attacks cannot be quantified. These implementations were beyond what could be achieved in a revision timeline due to computational constraints. Any deployment by a user in an adversarial environment should be carried out with human oversight.

Large Language Model Comparison

We did not compare against modern LLMs (GPT-4, Claude 3.5, Gemini, Llama 3). As demonstrated by [39], LLMs exhibit strong zero-shot reasoning capabilities. Without this comparison, we cannot claim “state-of-the-art” but rather best fine-tuned BERT-based approach. Comparison was not conducted due to API costs (~$500–1000 for 50,000+ samples), reproducibility concerns, and research scope focusing on fine-tuning rather than prompting.

Limited Cross-domain Evaluation

Our evaluation lacks true cross-domain transfer testing. All platforms (Twitter, Reddit, SemEval, The Onion) appear in training data. Table 8 results (91.0% multi-domain, 87.2% Twitter, 89.1% Reddit) represent within-distribution generalization, not cross-platform transfer. We did not evaluate unseen platform transfer (Instagram, TikTok, and LinkedIn) or domain-specific transfer (professional communication, technical text). Performance variation (Twitter: 86.4% vs Reddit: 83.6%) suggests platform sensitivity, but we cannot predict performance on entirely new domains.

Computational Efficiency

Although the model adopts the dual encoder architecture with twice the latency (24.7 ms vs. 12.3 ms), and memory usage (~220M parameters), the future direction of this model includes knowledge distillation and early exit techniques suggested in [39].

Fairness Limitations

Performance was not tested across various demographics, such as age, gender, and ethnicity. The training has existing biases due to the composition of English-speaking annotators and platform annotation variability. Inter-annotator agreement is also not conducted. Fairness is a crucial step prior to high-stakes deployment to ensure non-discriminatory usage and avoid stereotypes. Despite these limitations, our model is a contribution in that it is:

(1)

The first dual encoder approach to detect sarcasm with a 13% improvement over the baseline (\(h = 0.82\), Cohen’s).

(2)

A multi-domain validation with a total sample size of 149,480 instances.

(3)

A transparent reporting format with confidence intervals.

Deployment and Ethical Considerations Computational Efficiency

The doubled prediction latency of D-BERT at 24.7ms, in relation to Single BERT’s 12.3ms, together with its nearly double memory demand of 220 million parameters, could pose a limitation within high-through put real-time content moderation or especially within edge devices. Though achieving only 13% better results with twice the cost seems to be a fair trade-off in this particular case, it merits close examination.

Ethical Concerns

Ethical concerns that may arise with our data is that the training data collects annotations from multiple sources like SemEval, Reddit, Twitter, and The Onion, which have different schemes with biases. Initially, the annotations are made by English-speaking annotators, which creates a cultural bias as well. Further, the interpretation of the sarcasm is different among different age groups, culture, and languages, yet there is no consideration of Inter-annotator Agreement analysis. Additionally, there are always differences in different communities like Reddit, which follows a formal tone most often, and Twitter, which uses short sentences, which might have different schemes for annotations.

Comments (0)

No login
gif