Background:
Congenital cataract (CC) is a time-critical cause of preventable childhood visual impairment. After diagnosis, parents frequently experience uncertainty and increasingly seek guidance online. The safety, readability, and counseling quality of large language models (LLMs) responses for CC remain insufficiently benchmarked, particularly for explanations involving lens development, etiology, and genetic risk.
Methods:
We performed a cross-sectional comparative evaluation of five publicly accessible Chinese conversational LLMs (ChatGPT-5.2, Gemini 3 Pro, DeepSeek-V3.1, Doubao, and Kimi K2). Thirty standardized parent-facing CC questions were developed by senior ophthalmologists and mapped to five domains, with specific incorporation of scenarios requiring translation of lens developmental pathology and genetic counseling knowledge. Two researchers independently performed standardized zero-shot querying and response recording under identical conditions. Output efficiency and textual structure were extracted. Two blinded ophthalmologists rated each response on a 5-point Likert scale across Accuracy, Logic, Coherence, Safety, and Content Accessibility; inter-rater agreement was assessed using quadratic weighted Cohen’s kappa. Group differences were tested using ANOVA or Kruskal–Wallis H tests with Bonferroni-corrected pairwise comparisons.
Results:
Significant between-model differences were observed in output efficiency and text characteristics (all P < 0.001). ChatGPT-5.2 was fastest (17.94 ± 5.11), whereas DeepSeek-V3.1 and Kimi K2 were slowest (41.46 ± 3.22 and 40.02 ± 4.67). DeepSeek-V3.1 generated the longest responses (1,456.93 ± 224.99 words) and Kimi K2 the shortest (640.83 ± 252.95). ChatGPT-5.2 showed the strongest tendency toward structured/tabular output [2.00 (1.00, 2.00)] followed by Gemini 3 Pro [1.00 (1.00, 1.25)], while the other models rarely produced tables. Quadratic weighted Cohen’s kappa indicated good inter-rater reliability (0.686–0.767). Content quality differed significantly across models (Accuracy H = 41.15, Logic H = 32.95, Content accessibility H = 41.33; all P < 0.001). ChatGPT-5.2 and Gemini 3 Pro achieved higher overall profiles and did not differ significantly from each other, whereas Kimi K2 scored lower on multiple dimensions.
Conclusion:
LLM performance in translating lens developmental pathology and genetics for CC parent counseling is model-dependent. Longer outputs did not necessarily translate into higher quality; structured presentation was more closely associated with better safety and accessibility. These findings provide quantitative benchmarks for safer, parent-centered deployment of LLMs in pediatric ophthalmology education and support more reliable translation of complex disease-related knowledge into actionable parent guidance.
1 IntroductionCongenital cataract (CC) is one of the leading causes of preventable childhood blindness worldwide (Haargaard et al., 2005), with an estimated incidence of approximately 1–15 per 10,000 children (Sheeladevi et al., 2016; Wu et al., 2016). Unlike age-related cataract in adults, the management of CC is highly time sensitive. Because the infant visual system remains within a critical developmental period, any form of visual deprivation may result in irreversible deprivation amblyopia or nystagmus (Li et al., 2024; Rechsteiner et al., 2021). Therefore, early diagnosis, appropriately timed surgery, and sustained, intensive amblyopia therapy after surgery are essential for optimal visual rehabilitation (Wang et al., 2024).
Despite these clinical imperatives, parents and families often experience substantial psychological distress and decisional anxiety along the complex diagnostic and treatment pathway (Du et al., 2025). Importantly, CC is not only a time-critical surgical condition but also a developmental disorder of the crystalline lens, with heterogeneous phenotypes (e.g., nuclear or lamellar opacities) that can reflect distinct pathogenic mechanisms and prognostic implications. Prior studies indicate that post-diagnosis uncertainty can markedly undermine parental self-efficacy, and that most parents still report insufficient access to information after outpatient consultations—particularly regarding the timing of intraocular lens implantation, the implications of lens phenotype and etiology, genetic risk assessment, and practical details of postoperative home care (De Lima et al., 2020). Under the pressure of high clinical workload, ophthalmologists may have limited time to fully address caregivers’ needs for emotional support and individualized explanations, prompting many families to seek medical advice online (Nguyen et al., 2024). However, online health information is frequently fragmented, commonly generated by non-professionals, and sometimes misleading, which may exacerbate parental fear or contribute to delays in appropriate treatment (Wang et al., 2019).
In recent years, large language models (LLMs), exemplified by ChatGPT (OpenAI) and DeepSeek, have catalyzed a paradigm shift in medical applications (Shah et al., 2023; Thi et al., 2023; Bradshaw et al., 2025; Omiye et al., 2024), showing particular promise in medical question answering and in generating public-facing health information (Wei et al., 2025). Evidence from multi-dataset medical evaluation frameworks suggests that LLMs can demonstrate strong capabilities in medical knowledge encoding and generation across diverse question types (Li et al., 2025). Their responses can also be systematically appraised using human evaluation frameworks that assess factuality, reasoning quality, and potential harm (Singhal et al., 2023). Within ophthalmology, perspective articles have highlighted the potential of LLMs to streamline clinical workflows and enhance patient communication, while also underscoring practical barriers related to privacy, safety, and implementation governance (Betzler et al., 2023). In caregiver-facing contexts, an additional challenge is “knowledge translation”: converting complex concepts—such as developmental lens pathology, etiology, and genetic testing—into understandable, actionable, and non-misleading guidance.
Importantly, fluency does not guarantee completeness or safety. LLMs may produce seemingly persuasive outputs that remain incomplete, overconfident, or clinically inappropriate (Huang et al., 2025), making disease-specific, systematic evaluation indispensable. Nevertheless, rigorous assessments of LLMs in pediatric ophthalmology remain scarce. Pediatric consultations are distinctive in that they require not only high factual accuracy—because any delay may lead to lifelong visual impairment—but also empathic communication and high readability to address caregivers’ anxiety and support decision-making (Bernstein et al., 2023). Against this backdrop, the present study focuses on high-frequency caregiver question scenarios related to CC and conducts a systematic, head-to-head evaluation of five widely used LLMs. Anchored to the need to translate developmental lens pathology and genetics into safe parent education, we compare model performance in key information coverage, comprehensibility, action-oriented recommendations, and risk-boundary and safety-netting content (e.g., red-flag symptoms, appropriate care-seeking thresholds, and when to defer to specialist assessment), aiming to provide quantifiable evidence and actionable directions for the standardized use of LLMs in pediatric ophthalmology health education and consultation assistance.
2 Materials and methods2.1 Study designWe conducted a cross-sectional comparative study to evaluate the quality and safety of responses generated by five LLMs when answering parent-oriented consultation questions related to CC. With a focus on knowledge translation, we specifically assessed how models communicate developmental lens pathology and etiology/genetic information into safe, understandable health information for clinical counseling and patient education. The study workflow comprised six steps: construction of the question set, generation of model responses, standardized data collection, extraction of output metrics, expert rating, and statistical analysis (Figure 1). The unit of analysis was a single independent response produced by a model to a single question (one question–one response in a single-turn setting). All models generated responses independently under identical operating rules to ensure comparability and reproducibility. Reporting was conducted in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for cross-sectional studies, where applicable.

Study workflow and evaluation framework for five large language models (LLMs) on congenital cataract (CC) queries.
2.2 Construction of the consultation question setTo reflect parents’ and families’ real-world concerns in clinical settings, the question set was developed through a structured process led by senior ophthalmologists. Two experts (each with ≥8 years of clinical practice) independently compiled the most frequent and decision-relevant questions raised by parents during outpatient visits. The research team then merged the two lists and performed standardization (rewriting into a consistent, lay-language tone, removing duplicates, and excluding semantically ambiguous items). A third expert independently reviewed the refined list to reduce selection bias. To ensure comprehensive coverage, questions were mapped to a “full-cycle management framework” for CC, spanning five domains: disease overview, screening and diagnosis, treatment and follow-up, lifestyle and prevention, and prognosis and complication management. In addition, to align with the study focus on translating developmental lens pathology and genetics, each item was also categorized by question type (definitional, causal/etiology, comparative, and process/management), and the causal/etiology and relevant definitional items were prespecified as the pathology/genetics knowledge-translation subset for secondary analyses. Ultimately, 30 standardized questions were finalized for model evaluation.
2.3 Model selection and testing environmentWe included five publicly accessible, general-purpose LLMs with Chinese-language capability and widely used web interfaces. Inclusion criteria were: (1) public availability; (2) capability for medical question answering in Chinese; and (3) representation of both international and domestic technical approaches. To minimize methodological confounding from promotional specifications or leaderboard scores, we reported only the model name, access route, the publicly visible version label and/or reasoning mode at the time of data collection, internet/tool availability, adjustable parameters, and session policies. Because our evaluation was conducted in a standardized single-turn workflow, potential advantages related to long-context or multi-turn memory were not specifically targeted by the study design.
ChatGPT-5.2 (OpenAI). According to the platform-visible release information, ChatGPT-5.2 was released on 11 December 2025. It is positioned as a flagship model and was included given its claimed capability to adapt reasoning depth to query complexity and to mitigate hallucinated content. Its long-context capacity (up to 400K tokens, as indicated by the platform) was considered relevant for multi-turn clinical consultation scenarios (Handler et al., 2025).
Gemini 3 Pro (Google DeepMind). Gemini 3 Pro was released on 18 November 2025, and has been reported to perform well in tasks involving long-form medical record interpretation and complex report analysis. In this study, it served as a reference model for long-context processing and advanced reasoning (Sandmann et al., 2025).
DeepSeek-V3.1 (DeepSeek). Released in August 2025, this open-source model adopts a mixture-of-experts (MoE) architecture. As described in the public documentation, it activates an efficient subset of parameters per request and supports long-context Chinese reasoning (up to 128K tokens) (Sandmann et al., 2025).
Doubao (ByteDance). Following a mid-2025 architecture upgrade and extensive real-world usage reported by the provider, Doubao has been described as strong in conversational fluency and stability in multi-turn Chinese dialogue. It was included to reflect performance under mainstream commercial deployment conditions (Liu, 2025).
Kimi K2 (Moonshot AI). Released in July 2025, Kimi K2 emphasizes long-context understanding and agentic collaboration. As an open-weight MoE model (1T parameters as reported), it has been described as performing well on instruction following and code generation tasks (e.g., SWE-bench Verified, as reported by the provider) and was included as a reference for long-document analysis and code-assisted workflows (Gibney, 2025).
To approximate typical parent-facing usage without a medical background, we used a zero-shot setting, providing no additional case details and assigning no expert role. We did not add any customized system prompts or fixed role instructions; if a platform imposed a default system prompt, it was left unchanged. Each question was entered in a new conversation window to prevent carryover effects from conversational memory and context. All queries were submitted in Chinese using identical wording across platforms.
2.4 Data collection and data managementBased on the predefined list of 30 standardized consultation questions on CC, the research team collected responses from five mainstream LLMs through their most up-to-date publicly accessible web interfaces between December 13 and 15 December 2025. To minimize investigator interference and approximate naturalistic interaction by non-medical users, no additional system prompt or fixed role instruction was applied throughout the querying process. To prevent potential carryover effects from prior conversations, the session was reset before each new question (i.e., by clearing the chat history or initiating a new conversation), thereby ensuring that each response was generated under an independent conversational context.
Data collection was performed independently by two investigators using the identical question list on each platform, and the original outputs were archived separately. The two records were subsequently consolidated to create the complete response set for each model. To ensure accuracy and consistency, two additional researchers independently verified all responses line by line; any discrepancies were corrected by reference to the original platform output. Finally, all model responses were transcribed and entered into Microsoft Excel for centralized management and subsequent analyses. For metric extraction, an online text-analysis tool was used to automatically calculate Characters, Words, and Sentences for each response. The number of Tables contained in each response was identified and tallied manually by investigators according to a prespecified rule set and recorded in the spreadsheet for comparative analyses.
2.5 Evaluation protocol and metricsTo ensure objectivity, we assembled an expert panel comprising two senior ophthalmologists. The reviewers were blinded to model identity and independently scored only de-identified response texts. A modified 5-point Likert scale (1–5) was used to evaluate each response across five prespecified domains:
Accuracy: the correctness and scientific validity of the medical information provided. Given the time-sensitive nature of CC management, any erroneous guidance regarding surgical timing or indications for intraocular lens implantation could lead to serious consequences.
Logic: the quality of causal reasoning and structural organization. For caregivers, understanding the reasoning chain (e.g., why early surgery is required within the critical period of visual development) is essential for informed decision-making.
Coherence: the fluency and naturalness of expression, including whether the response contains redundancy, contradictions, or fragmented statements.
Safety: the potential for harm and whether appropriate risk warnings are provided. In the context of CC, responses recommending vague “watchful waiting” that could result in missing the window for amblyopia prevention or treatment were considered highly unsafe.
Content accessibility: the extent to which the response is understandable to caregivers without a medical background (i.e., readability). This domain is particularly relevant to alleviating parental anxiety and improving adherence.
For each domain, scoring anchors (1–5) and illustrative examples were predefined. When the two raters differed by ≥ 2 points on any domain, they first discussed to reach consensus; if disagreement persisted, adjudication was performed by a third expert. After scoring, inter-rater agreement for each domain was assessed using the quadratic weighted Cohen’s kappa. All evaluations and interactions were conducted in a controlled online environment following standardized operating procedures, including predefined scoring criteria and a prespecified dispute-resolution process, to minimize assessment bias.
2.6 Statistical analysisAll statistical analyses were performed using IBM SPSS Statistics version 27.0 (IBM Corp., Armonk, NY, United States of America). Inter-rater reliability of the ratings was assessed using the quadratic weighted Cohen’s kappa. Normality was evaluated with the Shapiro-Wilk test. Normally distributed continuous variables are presented as mean ± standard deviation, whereas non-normally distributed variables are reported as median (interquartile range). Homogeneity of variance was assessed using Levene’s test. Comparisons across the five models were conducted using one-way analysis of variance (ANOVA) for parametric variables and the Kruskal–Wallis H test for non-parametric variables. When an overall significant difference was detected, post hoc pairwise comparisons with Bonferroni correction were performed. A two-sided P value <0.05 was considered statistically significant.
3 Results3.1 Output efficiency and structural characteristics of responsesA total of 150 model outputs were obtained in this study. Significant between-model differences were observed in Response times and key structural characteristics of the generated texts (Table 1; all P < 0.001). With respect to Response times, ChatGPT-5.2 had the shortest mean response times (17.94 ± 5.11), which were significantly faster than Gemini 3 Pro, DeepSeek-V3.1, Doubao, and Kimi K2 (all P < 0.001). Gemini 3 Pro (24.27 ± 3.75) was also significantly faster than DeepSeek-V3.1 (41.46 ± 3.22), Doubao (30.44 ± 5.64), and Kimi K2 (40.02 ± 4.67) (Tables 2–4; all P < 0.001). No significant difference in response times were found between DeepSeek-V3.1 and Kimi K2 (P = 0.223).
ModelsResponse timesWordsCharactersParagraphsSentencesTablesChatGPT-5.217.94 ± 5.11892.93 ± 206.101,053.53 ± 220.0551.77 ± 5.1664.60 ± 11.902.00 (1.00,2.00)Gemini 3 pro24.27 ± 3.751,064.47 ± 243.051,119.80 ± 212.9738.03 ± 7.7649.73 ± 9.371.00 (1.00,1.25)DeepSeek-V3.141.46 ± 3.221,456.93 ± 224.991,672.67 ± 214.9843.37 ± 9.8077.60 ± 7.080.00 (0.00,0.00)Doubao30.44 ± 5.64787.00 ± 269.01830.63 ± 249.0833.87 ± 4.1141.20 ± 6.710.00 (0.00,1.00)Kimi K240.02 ± 4.67640.83 ± 252.95790.87 ± 232.9921.83 ± 7.5832.73 ± 8.190.00 (0.00,1.00)P-values<0.001<0.001<0.001<0.001<0.001<0.001Response times and textual output characteristics across five LLMs.
Data are presented as mean ± SD for normally distributed variables and as median (IQR) for non-normally distributed variables. Normality was assessed using the Shapiro–Wilk test, and homogeneity of variance was assessed using Levene’s test. Comparisons across groups were performed using one-way ANOVA for parametric variables and the Kruskal–Wallis H test for non-parametric variables. When an overall significant difference was detected, post hoc pairwise comparisons with Bonferroni correction were performed. Values in bold indicate statistical significance (P < 0.05).
Characters\wordsChatGPT-5.2Gemini 3 proDeepSeek-V3.1DoubaoKimi K2ChatGPT-5.2—0.040<0.0010.430<0.001Gemini 3 pro0.259—<0.001<0.001<0.001DeepSeek-V3.1<0.001<0.001—<0.001<0.001Doubao0.002<0.001<0.001—0.130Kimi K2<0.001<0.001<0.0010.497—Comparative analysis of textual length: Words and Characters.
Pairwise comparisons were performed with Bonferroni correction after a significant overall test. The upper triangle shows adjusted P values for Words, and the lower triangle shows adjusted P values for Characters. Values in bold indicate statistical significance (P < 0.05).
Sentences\paragraphsChatGPT-5.2Gemini 3 proDeepSeek-V3.1DoubaoKimi K2ChatGPT-5.2—<0.001<0.0010.001<0.001Gemini 3 pro<0.001—0.0400.088<0.001DeepSeek-V3.1<0.001<0.001—<0.001<0.001Doubao<0.0010.001<0.001—<0.001Kimi K2<0.001<0.001<0.001<0.001—Comparative analysis of text segmentation: Paragraphs and Sentences.
Pairwise comparisons were performed with Bonferroni correction after a significant overall test. The upper triangle shows adjusted P values for Paragraphs, and the lower triangle shows adjusted P values for Sentences. Values in bold indicate statistical significance (P < 0.05).
Tables\response timesChatGPT-5.2Gemini 3 proDeepSeek-V3.1DoubaoKimi K2ChatGPT-5.2—<0.001<0.001<0.001<0.001Gemini 3 pro0.087—<0.001<0.001<0.001DeepSeek-V3.1<0.001<0.001—<0.0010.223Doubao<0.0010.0011.000—<0.001Kimi K2<0.001<0.0011.0001.000—Comparative analysis of speed and formatting: Response times and Tables.
Pairwise comparisons were performed with Bonferroni correction after a significant overall test. The upper triangle shows adjusted P values for Response times, and the lower triangle shows adjusted P values for Tables. Values in bold indicate statistical significance (P < 0.05).
In terms of response length and structure, DeepSeek-V3.1 generated the largest outputs, with the highest Words (1,456.93 ± 224.99), Characters (1,672.67 ± 214.98), and Sentences (77.60 ± 7.08), which were significantly greater than those of the other models overall (Tables 2–4; most pairwise comparisons P < 0.001). In contrast, Kimi K2 produced the smallest outputs (Words 640.83 ± 252.95; Characters 790.87 ± 232.99; Sentences 32.73 ± 8.19) (Table 1). Notably, ChatGPT-5.2 and Doubao did not differ significantly in Words (P = 0.430), but they differed significantly in Characters, Paragraphs, and Sentences (P = 0.002, P = 0.001, and P < 0.001, respectively) (Tables 2–4). In addition, ChatGPT-5.2 yielded the highest number of Paragraphs (51.77 ± 5.16), which was significantly higher than that of most other models (Tables 2–4; most P < 0.001).
Regarding the propensity to generate Tables, ChatGPT-5.2 had the highest median number of tables [2.00 (1.00, 2.00)], followed by Gemini 3 Pro [1.00 (1.00, 1.25)], whereas DeepSeek-V3.1 produced no tables [0.00 (0.00, 0.00)]; Doubao and Kimi K2 also had a median of 0.00 tables (Table 1). Pairwise comparisons showed that ChatGPT-5.2 generated significantly more tables than DeepSeek-V3.1, Doubao, and Kimi K2 (all P < 0.001), whereas no significant differences were observed among DeepSeek-V3.1, Doubao, and Kimi K2 (all P = 1.000) (Tables 2–4). The difference in the number of tables between ChatGPT-5.2 and Gemini 3 Pro did not reach statistical significance (P = 0.087) (Figure 2).

Response times and textual output characteristics across five large language models (LLMs). (a) Response times; (b) Words; (c) Characters; (d) Sentences; (e) Paragraphs; (f) Tables.
3.2 Comparison of content quality scores across five domainsBefore comparing content quality across models, we first assessed inter-rater agreement between the two ophthalmologist reviewers, quadratic weighted Cohen’s kappa for Accuracy, Logic, Coherence, Safety, and Content accessibility were 0.767,0.712,0.686,0.759, and 0.716,respectively (Figure 3), indicating good inter-rater reliability and supporting subsequent between-model comparisons based on this rating framework. Significant between-model differences were observed in score distributions across all five domains (Accuracy: H = 41.15, P < 0.001; Logic: H = 32.95, P < 0.001; Coherence: H = 27.79, P < 0.001; Safety: H = 31.72, P < 0.001; Content accessibility: H = 41.33, P < 0.001) (Table 5).

Quadratic weighted Cohen’s kappa across five evaluation dimensions.
ModelsAccuracyLogicCoherenceSafetyContent accessibilityChatGPT-5.24.00 (4.00, 5.00)4.00 (4.00, 5.00)4.00 (4.00, 5.00)4.00 (4.00, 4.25)4.50 (4.00, 5.00)Gemini 3 pro4.00 (4.00, 5.00)4.00 (4.00, 5.00)4.00 (4.00, 5.00)4.00 (4.00, 5.00)5.00 (4.00, 5.00)DeepSeek-V3.14.00 (3.00, 4.00)4.00 (3.00, 4.00)4.00 (3.00, 4.25)4.00 (3.00, 4.00)4.00 (3.75, 4.00)Doubao4.00 (4.00, 4.00)4.00 (3.00, 4.00)4.00 (4.00, 4.25)4.00 (3.75, 4.00)4.00 (3.00, 4.00)Kimi K23.50 (3.00, 4.00)4.00 (2.75, 4.00)3.50 (2.75, 4.00)4.00 (3.00, 4.00)4.00 (3.00, 4.00)H-values41.1532.9527.7931.7241.33P-values<0.001<0.001<0.001<0.001<0.001Comparison of five LLMs across five content quality dimensions.
Data are presented as median (IQR). Overall comparisons were performed using the Kruskal–Wallis H test. H-values and corresponding two-sided P values are shown. When an overall significant difference was detected, post hoc pairwise comparisons with Bonferroni correction were performed. Values in bold indicate statistical significance (P < 0.05).
Overall, ChatGPT-5.2 and Gemini 3 Pro achieved consistently high median scores across all domains. The median (interquartile range) scores were 4.00 (4.00, 5.00) for Accuracy, 4.00 (4.00, 5.00) for Logic, and 4.00 (4.00, 5.00) for Coherence in both models. For Safety, the median score was 4.00 (4.00, 4.25) for ChatGPT-5.2 and 4.00 (4.00, 5.00) for Gemini 3 Pro. For Content accessibility, Gemini 3 Pro achieved the highest median score of 5.00 (4.00, 5.00), whereas ChatGPT-5.2 scored 4.50 (4.00, 5.00) (Table 5). Pairwise comparisons indicated no statistically significant differences between ChatGPT-5.2 and Gemini 3 Pro across any domain (all P = 1.000) (Tables 6–10).
Z-values\P-valuesChatGPT-5.2Gemini 3 proDeepSeek-V3.1DoubaoKimi K2ChatGPT-5.2—1.0000.0060.2030.000Gemini 3 pro0.000—0.0060.2030.000DeepSeek-V3.13.4213.421—1.0000.652Doubao2.3202.3201.101—0.032Kimi K25.2655.2651.8442.945—Pairwise comparisons of Accuracy across five LLMs.
Pairwise comparisons for Accuracy were performed with Bonferroni correction after a significant overall Kruskal–Wallis H test. The upper triangle presents adjusted P values, and the lower triangle presents the standardized test statistics (Z-values). Values in bold indicate statistical significance (P < 0.05).
Z-values\P-valuesChatGPT-5.2Gemini 3 proDeepSeek-V3.1DoubaoKimi K2ChatGPT-5.2—1.0000.0030.0030.000Gemini 3 pro0.360—0.0120.0110.001DeepSeek-V3.13.5903.230—1.0001.000Doubao3.6133.2530.023—1.000Kimi K24.3153.9550.7260.703 0.703—Pairwise comparisons of
Comments (0)