Clinical Context Variables Collectively Rival Model Choice in Embedding-Based Retrieval: Multi-Corpus Benchmark Study


IntroductionBackground

Retrieval-augmented generation (RAG) has emerged as a leading strategy for grounding large language model (LLM) outputs in verifiable clinical knowledge, addressing persistent concerns about hallucinations and outdated training data [,]. In a typical clinical RAG pipeline, a user query is encoded by an embedding model, matched against a vector index of clinical documents, and the top-ranked passages are injected into an LLM prompt to guide answer generation. The retrieval component is foundational: if the correct document is not retrieved, no amount of generative sophistication can recover it.

Despite the centrality of the retrieval step, embedding model selection for clinical RAG remains largely guided by general-domain leaderboards such as the Massive Text Embedding Benchmark (MTEB) []. The implicit assumption is that models ranked highly on news articles, Wikipedia passages, and web queries will transfer effectively to clinical documentation. However, decades of health services research have demonstrated that clinical practice is profoundly heterogeneous. The Dartmouth Atlas project documented 4- to 10-fold variation in surgical procedure rates across hospital referral regions in the United States, with similar magnitudes observed internationally [,]. These variations are idiosyncratic and condition-specific rather than reflecting a general tendency toward aggressive or conservative care [].

This practice heterogeneity directly affects clinical documentation. The structure and semantics of clinical notes vary widely across electronic health record (EHR) systems, sites, and institutions [], as shown in a national analysis of over 215,000 ambulatory physicians []. Functional status documentation is context-specific, with variations driven by source instruments, information providers, practice settings, and institutions []. This heterogeneity poses a direct challenge to natural language processing model portability [] and, by extension, to embedding model generalization in clinical retrieval.

Recent work on clinical RAG has made important progress. The Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE) benchmark evaluated 41 combinations of corpora, retrievers, and backbone LLMs on 7663 medical questions and answers (QA), finding that corpus selection significantly affected performance and that no single configuration dominated []. A 2025 systematic review of 70 RAG-in-health care studies identified persistent challenges, including retrieval noise, domain shift, and limited evaluation frameworks []. Other studies have examined chunking strategies [] and hallucination mitigation [] for clinical RAG. However, existing benchmarks primarily evaluate retrieval via downstream QA accuracy on medical examination questions rather than through direct retrieval metrics on clinical documentation. To the author’s knowledge, no study has systematically benchmarked embedding models head-to-head on the heterogeneous clinical text that real-world RAG systems must process.

This paper addresses that gap with a controlled, multicorpus benchmark that isolates the retrieval component of clinical RAG. The study evaluates 10 embedding models (plus 2 ablation variants and Best Match 25 [BM25]) across 3 clinical corpora under 294 experimental conditions. The central hypothesis is that clinical context variables—primarily corpus type (encompassing differences in document length, specialty mix, and structural characteristics) and query format—produce effect sizes on retrieval performance comparable to or larger than those of embedding model choice. If confirmed, this would imply that local validation against institution-specific documentation is not merely a best practice but a methodological requirement for responsible clinical RAG deployment.

Related WorkClinical Practice and Documentation Heterogeneity

The observation that medical practice varies substantially across geographies and institutions is among the most robust findings in health services research. Wennberg and Gittelsohn’s [] foundational small-area analysis in Vermont revealed large variations in hospitalizations, surgical procedures, and expenditures across hospital service areas that could not be explained by differences in population health needs. Dartmouth Atlas Project subsequently documented that these patterns persist nationally, with surgical procedure rates varying 4- to 8-fold across 306 hospital referral regions [,]. International comparisons showed that although absolute rates differ across countries, the relative degree of within-country variation is remarkably consistent, suggesting that clinical decision-making paradigms rather than environmental factors drive the phenomenon [].

This variation in practice leads to corresponding variation in documentation. A study of EHR migration at 2 Mayo Clinic sites found that clinical note structure and semantics varied considerably across EHR implementations, with significant effects on natural language processing model portability []. Six distinct note composition strategies were identified nationally, with differential prevalence across specialties []. Functional status documentation was shown to be context-specific, with variations driven by source instruments, care settings, and institutional culture []. Together, these findings establish that clinical text available for RAG indexing reflects local practices, templates, and documentation cultures that differ markedly across institutions.

Embedding Models for Biomedical Text

Biomedical text embeddings have evolved through several generations. Domain-specific encoders, such as Biomedical Bidirectional Encoder Representations from Transformers (BioBERT) [] and ClinicalBERT [], applied continued pretraining of Bidirectional Encoder Representations from Transformers (BERT) on PubMed and Medical Information Mart for Intensive Care III (MIMIC-III) clinical notes. Purpose-built biomedical retrievers, including Biomedical Learning of Ontological Representations from Definitions (BioLORD) [] and Medical Contrastive Pre-trained Transformer (MedCPT) [], were trained with contrastive objectives on biomedical literature. General-purpose embedding models, such as BAAI (Beijing Academy of Artificial Intelligence) General Embedding (BGE) [] (part of the FlagEmbedding family described in []), General Text Embeddings (GTE) [], and Nomic Embed [], have achieved strong MTEB performance. Recent LLM encoders (E5-Mistral-7B []) and commercial application programming interfaces (APIs; OpenAI text-embedding-3-small [], hereafter OpenAI-emb3-small) have further expanded the landscape. MTEB [] provides standardized comparisons, but its clinical coverage is limited to general biomedical text rather than institutional documentation.

RAG Evaluation in Health Care

MIRAGE [] introduced the first systematic benchmark for medical RAG, evaluating RAG across 5 medical QA datasets. Its key finding—that corpus choice significantly affects downstream accuracy—is consistent with the present hypothesis. However, MIRAGE evaluates end-to-end QA accuracy on examination-style questions, conflating retrieval and generation quality. A 2025 evaluation of RAG variants for clinical decision support tested 12 pipeline configurations on 250 patient vignettes []. Studies on chunking for clinical RAG have demonstrated that adaptive strategies can improve precision []. This work complements these studies by isolating the retrieval component and systematically varying the clinical context rather than the pipeline architecture.


MethodsCorpora

Three corpora were selected to represent distinct clinical documentation contexts.

MTSamples (n=500) comprises deidentified medical transcription samples spanning 40 clinical specialties []. The full MTSamples dataset contains approximately 5000 documents; 500 were randomly sampled and stratified by specialty to preserve the original distribution. Documents shorter than 50 tokens or exact duplicates were excluded before sampling. The corpus includes operative reports, consultation notes, discharge summaries, and history-and-physical examinations, representing real-world dictated clinical documentation with varied formatting and narrative structures. The median document length was 391 (IQR 297-509) tokens.

PubMed Central (PMC)-Patients (n=500) comprises patient case descriptions from the PMC-Patients dataset [], which aggregates structured case reports from PubMed Central. The source dataset contains approximately 167,000 patient summaries. A random sample of 500 English-language patient summaries was drawn, excluding entries shorter than 50 tokens, duplicates, and entries without a primary diagnosis label. These documents follow standardized academic reporting formats with explicit section headers. The median document length was 397 (IQR 305-495) tokens.

Synthetic clinical notes (n=500) were generated using Mistral-7B-Instruct-v0.2 (mistralai/Mistral-7B-Instruct-v0.2) with structured prompts specifying the clinical specialty (see the “Use of Language Models for Synthetic Data Generation” section and for prompt templates). Each note was generated independently with temperature 0.8 and top_p 0.9 to encourage diversity. All outputs were manually screened to confirm the absence of real patient identifiers. The median document length was 421 (IQR 312-518) tokens.

To characterize the synthetic corpus relative to the other 2, I computed corpus-level lexical statistics on a 100-document subset (5 per specialty). The synthetic corpus exhibited high structural uniformity: document length SD of 26 tokens (vs SD 143 for MTSamples and SD 163 for PMC-Patients, as estimated from IQR), and a mean document-level type-token ratio of 0.597 (SD 0.048). Cross-specialty vocabulary overlap was high (mean pairwise Jaccard 0.266; range 0.179-0.434), with 71%-94% of each specialty’s vocabulary appearing in at least one other specialty (per-specialty counts: 350-502 shared of 420-545 total unique terms). Term-frequency entropy was 9.14 bits (of 11.44 maximum), indicating moderate lexical diversity but substantially less than would be expected from real institutional EHR data. The structural regularity of the synthetic corpus—reflected in identical section-based and 512-token chunk counts ()—likely contributes to its uniformly low retrieval performance and confirms that it functions more as a stress test of cross-specialty disambiguation than as a representative third clinical documentation context. A potential circularity concern—that Mistral-generated text might preferentially advantage architecturally related models—was tested by comparing model rank positions across corpora: E5-Mistral-7B-ablation maintained rank 8 across all 3 corpora, and Phi-3-mini dropped from rank 9 (MTSamples) to rank 11 (synthetic). No Mistral-related model showed a relative rank improvement on the synthetic corpus, indicating no evidence of circularity.

Query GenerationDeterministic Generation of Keyword and Natural-Language Query Formats

For each document, 2 query types were generated using deterministic heuristics (no language model was used for query generation in the main benchmark). Keyword queries comprised 3-6 clinical terms extracted heuristically from each document by selecting capitalized medical terms and available metadata fields, such as specialty labels (eg, “Cardiology Dyspnea Hypertension HbA1c”), simulating search-box behavior in clinical information systems. Natural language queries consisted of the first 1-2 sentences of each document’s clinical narrative (eg, the opening of the History of Present Illness), representing a high-overlap retrieval scenario. For MTSamples, the description field was used when available. Both query types were derived directly from the target document text.

Known-Item Retrieval Design

As both query types were derived from target documents, the retrieval task constitutes known-item retrieval: queries are designed to match a specific known document. This design was chosen because it provides an unambiguous ground truth—each query has exactly 1 correct document—without requiring expert relevance judgments, which would be prohibitively expensive at the scale of 294 conditions. The trade-off is that absolute mean reciprocal rank (MRR) values are not directly comparable to production retrieval settings where queries are formulated independently; the primary analytical object is relative model rankings, which are interpretable under this design. This approach increases lexical overlap between queries and targets, which may benefit lexical methods such as BM25 and embedding models with strong lexical dependence. Using verbatim opening sentences as natural language queries and heuristically extracted terms as keyword queries produces particularly high overlap, establishing a controlled high-overlap retrieval scenario for lexical methods. However, heuristic term extraction also introduces lexical noise (eg, nondiagnostic capitalized terms), which can depress dense retrieval performance; the net effect on absolute scores is therefore not necessarily upward for all models (see the “Validation With Reduced-Lexical-Dependence Queries” section). The near-perfect BM25 performance on PMC-Patients (MRR@10=0.999 for natural language queries) reflects this design, as the opening sentences of structured case reports contain rare medical terms directly present in target documents, giving BM25 an extreme advantage through inverse document frequency weighting. Absolute performance values are therefore not directly representative of production retrieval; in settings where user queries are formulated independently of target documents, performance would differ for all models. Relative comparisons across models and corpora remain valid within this design, as all conditions share the same query derivation method. A validation experiment using LLM-generated reduced-lexical-dependence queries supported the stability of model rankings across query derivations (see the “Validation With Reduced-Lexical-Dependence Queries” section).

Models

Ten embedding models, plus BM25, were evaluated across 6 architectural categories (). All models were evaluated using vendor-recommended pooling strategies and task-specific prefixes, as detailed in . Two ablation variants were also included: Nomic-embed-text without its search_query prefix, and E5-Mistral-7B using mean pooling instead of the vendor-specified last-token (end of sequence [EOS]) pooling, yielding 12 embedding configurations and 13 total retrieval configurations, including BM25.

Table 1. Model configurations and architectural categories.ModelCategoryParametersDimensionalitySource/notesBM25aLexical baselineN/AbN/AOkapi BM25 (k1=1.5, b=0.75)BioBERTcDomain encoder110 million768dmis-lab/biobert-v1.1ClinicalBERTDomain encoder110 million768medicalai/ClinicalBERTBioLORDd-2023Biomedical retriever110 million768FremyCompany/BioLORD-2023MedCPTeBiomedical retriever110 million768ncbi/MedCPT-Query/Article-EncoderfBGEg-base-en-v1.5General embedding110 million768Beijing Academy of Artificial Intelligence/bge-base-en-v1.5GTEh-baseGeneral embedding137 million768thenlper/gte-baseNomic-embed-v1.5General embedding137 million768nomic-ai/nomic-embed-text-v1.5OpenAI-emb3-smallGeneral application programming interfaceN/A1536Application programming interface (cl100k_base tokenizer)E5-Mistral-7BGeneral large language model7 billion4096intfloat/e5-mistral-7b-instructPhi-3-mini-128kGeneral large language model3.8 billion3072microsoft/Phi-3-mini-128k-instruct

aBM25: Best Match 25.

bN/A: not applicable.

cBioBERT: Biomedical Bidirectional Encoder Representations from Transformers.

dBioLORD: Biomedical Learning of Ontological Representations from Definitions.

eMedCPT: Medical Contrastive Pre-trained Transformer.

fMedCPT uses a dual-encoder architecture with separate query (ncbi/MedCPT-Query-Encoder) and article (ncbi/MedCPT-Article-Encoder) encoders. Performance reflects this asymmetric design rather than a single shared representation space.

gBGE: BAAI General Embedding.

hGTE: General Text Embeddings.

Chunking StrategiesComparison of Chunking Strategies and Tokenization Standardization

Each corpus was indexed using 4 chunking strategies: (1) the full document as a single vector; (2) section-based splitting at detected clinical section headers (eg, “History of Present Illness,” “Assessment/Plan”); (3) fixed 512-token nonoverlapping chunks that respect sentence boundaries; and (4) fixed 256-token nonoverlapping chunks that respect sentence boundaries. Token counts were computed with the cl100k_base tokenizer (tiktoken library, version 0.5.1). This tokenizer was chosen for its stability and widespread adoption, providing consistent approximate token budgeting across conditions; word count–based sizing would not account for subword segmentation differences, while model-specific tokenizers would confound chunking with model-dependent tokenization. The embedding models’ own tokenizers handle input encoding independently. shows the mean number of index items per chunking strategy and corpus.

Table 2. Mean number of index items per chunking strategy and corpus. The identical section-based and 512-token counts for the synthetic corpus (987) reflect the similar document structure: Mistral-7B–generated notes average 421 tokens with consistent section boundaries, so section-based and 512-token splitting produce nearly identical break points.CorpusFull, nSection, n512-token splitting, n256-token splitting, nMTSamples50050010672158PMCa-Patients5005019871960Synthetic5009879871499

aPMC: PubMed Central.

Chunking Ground Truth

For chunked conditions, a query’s target document was considered successfully retrieved if any chunk from that document appeared in the top-k results (document-level evaluation). As chunking increases the total number of index items (), chance-level retrieval performance varies across chunking conditions, although the effect is small at k=10 relative to minimum index sizes of 500.

Evaluation MetricsPrimary Retrieval Metrics and Bootstrap CIs

Primary retrieval metrics included mean reciprocal rank at cutoff 10 (MRR@10) as the primary performance measure, precision at 1 (P@1), recall at 10/20/50/100, and normalized discounted cumulative gain at 10 (NDCG@10). Bootstrap 95% CIs (1000 resamples, percentile method) were computed for MRR@10.

Supplementary analyses included (1) document-length sensitivity, with documents binned into terciles; (2) lexical overlap analysis measuring Spearman correlation between query-document Jaccard similarity (computed on lowercased, punctuation-stripped tokens with English stop words removed) and retrieval rank; (3) embedding geometry diagnostics; and (4) factorial ANOVA with η2 effect sizes.

Embedding Geometry Metrics

Anisotropy was computed as the mean pairwise cosine similarity across 1000 randomly sampled embedding pairs; values approaching 1.0 indicate that all embeddings point in approximately the same direction, making cosine similarity nondiscriminative []. Average self-similarity is the mean cosine similarity of each document embedding to all others; high values (>0.95) indicate near-complete loss of retrieval capacity. Effective rank was computed as the exponential of the Shannon entropy of the normalized singular value spectrum of the embedding matrix, measuring the effective dimensionality used by the model []. First principal component variance ratio was computed as the proportion of total variance explained by the first principal component of the embedding matrix; higher values indicate more concentrated, less isotropic embedding distributions.

Experimental Procedure

All experiments were run on a single NVIDIA H100 80 GB graphical processing unit (GPU). The 12 embedding configurations were evaluated across 3 corpora × 2 query formats × 4 chunking strategies = 288 conditions. BM25 was evaluated across 3 corpora × 2 query formats × 1 (full-document only) = 6 conditions; it was restricted to full-document indexing because chunk-level BM25 retrieval introduces a passage-to-document aggregation step absent from the embedding pipeline, and because full-document BM25 represents the standard baseline in information retrieval benchmarks. Passage-level BM25 with score aggregation (eg, MaxP, SumP) is a common alternative that may interact with chunking differently than dense retrieval; this is left for future work. The total of 294 conditions reflects the complete factorial for embedding models plus the BM25 subset. For each condition, cosine similarity (or BM25 scores) was computed between query and document/chunk representations, items were ranked, and results were evaluated against the known single relevant document per query. BM25 was implemented using rank_bm25 (version 0.2.2) with default parameters (k1=1.5, b=0.75). A post hoc sensitivity analysis over k1 ∈ and b ∈ (16 parameter combinations) confirmed that BM25 performance was robust to parameter choice: the maximum MRR@10 spread across all parameter combinations was 0.038 (PMC-Patients, keyword), 0.032 (MTSamples, keyword), and 0.011 (synthetic, keyword)—far smaller than cross-model differences. Default parameters (k1=1.5, b=0.75) fell within 0.014 of the best-performing combination in all conditions, using the same lowercased, punctuation-stripped tokenization as for lexical overlap analysis. Models were loaded sequentially, with explicit GPU memory clearing between evaluations.

Two design constraints are relevant to interpretation. First, the single-relevant-document assumption, while standard in known-item retrieval benchmarking, differs from production settings where multiple documents may be relevant to a given query; this may differentially affect model rankings. Second, the corpora of 500 documents each are smaller than production indices; scaling effects on retrieval difficulty are not captured.

Use of Language Models for Synthetic Data Generation

Mistral-7B-Instruct-v0.2 (mistralai/Mistral-7B-Instruct-v0.2) was used to generate 500 synthetic clinical notes. The model was loaded in half-precision (float16) on a single H100 GPU. Each note was generated in an independent inference call with temperature 0.8 and top_p 0.9, using a structured prompt that specified the clinical specialty (drawn from 20 specialties; see for the prompt template). No conversation history was retained between generations to prevent cross-contamination. Notes were screened for inadvertent real patient identifiers (none were found).

For the main benchmark, query generation relied on deterministic heuristics rather than language models. Keyword queries were constructed by extracting capitalized medical terms and available metadata fields (eg, specialty labels) from each document. Natural language queries consisted of the first 1-2 sentences of each document’s clinical narrative, or the description field for MTSamples when available. This approach produces high lexical overlap between queries and target documents, creating a high-overlap retrieval scenario (see the “Query Generation” section). No language model was used for query generation in the primary evaluation.

GPT-4o (gpt-4o-2024-05-13, OpenAI) was used exclusively in the validation experiment (see the “Validation With Reduced-Lexical-Dependence Queries” section) for metadata extraction (temperature 0.0) and for metadata-only query generation (temperature 0.3). No language model was used for manuscript drafting, analysis, or interpretation of results.

Statistical Analysis

To quantify the relative contribution of each experimental factor to retrieval performance, a type II factorial ANOVA was performed on MRR@10 with fixed effects for model, corpus, query format, and chunking strategy, plus all 2-way interactions: model × corpus, model × query format, corpus × query format, corpus × chunking, model × chunking, and query format × chunking. Effect sizes were reported as η2 (sum of squares for each factor divided by the total sum of squares). The primary analysis used the 288 balanced embedding model conditions (excluding BM25, which lacked chunking conditions). A sensitivity analysis including BM25 (full-document conditions only; N=78) was conducted to assess robustness. A secondary analysis replaced individual model identity (12 levels) with architectural category (6 levels: domain encoder, biomedical retriever, general embedding, general API, general LLM, and ablation) to distinguish between model choice and architectural category as explanatory factors.

This ANOVA was performed on condition-level aggregated MRR@10 values (N=288 observations) rather than on per-query reciprocal ranks. This design treats the ANOVA as a descriptive variance decomposition across experimental settings rather than a formal inferential test. As the same documents and queries contribute to all model conditions, the observations are not independent, and P values should be interpreted as indicators of relative factor importance rather than as classical hypothesis tests. P values are reported alongside η2 for completeness, but effect sizes are emphasized as the primary basis for interpretation. A bootstrap sensitivity analysis (resampling conditions with replacement, 1000 iterations) confirmed that the η2 decomposition is stable: embedding model 95% CIs (percentile) 0.392-0.522; corpus 95% CIs (percentile) 0.183-0.289; query format 95% CIs (percentile) 0.129-0.216; and chunking 95% CIs (percentile) 0.000-0.005. Model ranking stability was assessed using Kendall τ with bootstrap 95% CIs (10,000 resamples) and Spearman ρ. All analyses were conducted in Python 3.11 (Python Foundation) using statsmodels 0.14, scipy 1.12, and numpy 1.26.

Ethical Considerations

This study used only publicly available or synthetically generated datasets and did not involve human research. MTSamples consists of publicly posted, deidentified medical transcriptions accessed in accordance with the site’s terms of service []. PMC-Patients includes previously published case reports from PubMed Central. Synthetic clinical notes were generated using Mistral-7B-Instruct-v0.2 and contain no real patient data. As no individually identifiable patient information was processed, institutional review board approval and informed consent were not required, and no participant compensation was applicable. All data sources are publicly accessible; no privacy-restricted data were generated, accessed, or stored.


ResultsVariance Decomposition: Context Variables Collectively Match Model Choice Effects

presents the factorial ANOVA decomposition of MRR@10 across all 288 balanced embedding conditions, including all 2-way interactions. Embedding model choice was the largest single factor (η2=0.408, 40.8% of variance), followed by corpus (η2=0.246, 24.6%) and query format (η2=0.192, 19.2%). Model × query format was notable (η2=0.029, P<.001), indicating that models differ in their sensitivity to query type. Chunking strategy contributed little to the variance decomposition, both as a main effect (η2=0.002, P=.009) and in most interactions: corpus × chunking (η2=0.001, P=.41) and model × chunking (η2=0.002, P=.99, reflecting near-zero between-cell variance across 33 degrees of freedom). Query format × chunking was small but detectable (η2=0.002, P=.003). Interactions were also observed for model × corpus (η2=0.040, P<.001) and corpus × query format (η2=0.052, P<.001). The combined model explained 97.4% of the variance (R2=0.974).

Table 3. Factorial ANOVA of MRR@10a across 288 embedding conditions (all 2-way interactions).Factorη2bF testc (df)P valuecEmbedding model0.408269.72 (11, 193)<.001Corpus0.246895.22 (2, 193)<.001Query format0.1921399.11 (1, 193)<.001Chunking strategy0.0023.99 (3, 193).009Model × corpus0.04013.17 (22, 193)<.001Model × query format0.02918.84 (11, 193)<.001Corpus × query format0.052187.62 (2, 193)<.001Corpus × chunking0.0011.03 (6, 193).41Model × chunking0.0020.52 (33, 193).99Query × chunking0.0024.88 (3, 193).003Residual0.027N/AdN/A

aMRR@10: mean reciprocal rank at 10.

bη2 is calculated as the sum of squares (factor)/sum of squares (total). Here, N=288 (12 embedding models × 3 corpora × 2 query formats × 4 chunking strategies).

cF and P values are not defined for the “Residual” row. The value 193 (dfnumerator) for that row is the residual degrees of freedom used as dfdenominator by every F test in the table.

dN/A: not applicable.

Context variables collectively explain as much variance as model choice: corpus + query format + corpus × query format=49.0% versus model + model × corpus + model × query format=47.6%. While model choice is the single largest factor, optimizing model selection while ignoring corpus characteristics and query design leaves approximately half of the available performance variation unaddressed. The model × corpus interaction (F22,193=13.17, P<.001) indicates that model rankings shift meaningfully across corpora—not merely as additive offsets but as rank reordering. The model × query format interaction (F11,193=18.84, P<.001) further shows that models differ in their sensitivity to query type, meaning that query reformulation affects models unequally.

A secondary analysis replacing individual model identities with an architectural category (6 levels) showed that the category explained 37.2% of the variance—lower than the 40.8% explained by individual models, with the 3.6% difference reflecting within-category model variation. Context variables explained an even larger relative share than the architectural category. A sensitivity analysis restricted to full-document conditions, including BM25 (N=78), yielded a consistent pattern: model 46.8%, corpus 24.8%, query format 16.4% (R2=0.880).

These results were robust to metric choice. Kendall τ between model rankings under MRR@10 and P@1 averaged 0.985 across the 6 dataset × query format conditions (12 models each), indicating near-identical rankings. Concordance with recall@10 was also high (τ=0.894) and with recall@50 somewhat lower (τ=0.822), as expected given the broader retrieval window. NDCG@10 correlated perfectly with MRR@10 (τ=0.985), which is expected when each query has exactly 1 relevant document—in this setting, the 2 metrics are monotone transforms of each other. NDCG@10 is therefore omitted from the primary results and retained only in .

To address the nonindependence limitation of the condition-level ANOVA, I fitted a linear mixed-effects model to 39,000 per-query reciprocal ranks (13 models × 3 corpora × 2 query formats × ≈500 queries) with query_id as a random intercept. All 3 fixed effects were significant by likelihood ratio tests (model: χ212=13,540.6, P<.001; dataset: χ22=801.6, P<.001; query format: χ21=4456.9, P<.001). The intraclass correlation coefficient was 0.210, indicating that 21.0% of residual variance was attributable to between-query difficulty—a source that the condition-level ANOVA absorbs into the residual. Despite this reattribution, the relative importance ordering was preserved: model remained the largest contributor, followed by dataset and query format. Fixed effects explained 37.3% of total variance (pseudo-R2=0.373). These results confirm that the ANOVA findings are robust to the nonindependence concern. The condition-level R2=0.974 is partly inflated by aggregation over heterogeneous queries; the per-query pseudo-R2=0.373 provides a more conservative estimate of variance attributable to experimental factors.

Model Rankings Are Corpus-Dependent

presents MRR@10 for all retrieval configurations under keyword queries with full-document indexing. Model rankings were moderately unstable across corpora. Nomic-embed-text achieved the highest MRR@10 on MTSamples (0.768, 95% CI 0.734-0.800) but fell to 0.460 on PMC-Patients, where BM25 dominated (0.881, 95% CI 0.853-0.906)—a margin of +0.312 over the best embedding model (MedCPT; 0.569).

Table 4. MRR@10a by model and corpus (keyword queries, full-document indexing)b,c.ModelCategoryMTSamplesPMCd-PatientsSyntheticNomic-embed-textGeneral embedding0.7680.4600.288BGEe-base-en-v1.5General embedding0.7590.4590.253GTEf-baseGeneral embedding0.7300.4130.219OpenAI-emb3-smallGeneral application programming interface0.7110.4100.273BM25gLexical0.6940.8810.266MedCPThBiomedical retriever0.6240.5690.212BioLORDi-2023Biomedical retriever0.5810.2250.162E5-Mistral-7B (mean-pooling, ablation)General large language model0.3440.1950.152Phi-3-miniGeneral large language model0.1750.1400.031BioBERTjDomain encoder0.1650.1480.040ClinicalBERTDomain encoder0.1290.0460.014E5-Mistral-7B (end of sequence, vendor)General v0.0620.1690.042Nomic-embed-text (no prefix, ablation)Ablation0.7460.3990.314

aMRR@10: mean reciprocal rank at 10.

bAblation variants are labeled with their pooling or prefix configuration in parentheses.

cItalics indicates best in column.

dPMC: PubMed Central.

eBGE: BAAI General Embedding.

fGTE: General Text Embeddings.

gBM25: Best Match 25.

hMedCPT: Medical Contrastive Pre-trained Transformer.

iBioLORD: Biomedical Learning of Ontological Representations from Definitions.

jBioBERT: Biomedical Bidirectional Encoder Representations from Transformers.

Rank stability varied by query type (). For keyword queries, Kendall τ between MTSamples and PMC-Patients was 0.590 (95% CI 0.211-0.889), indicating a moderate positive association with substantial reordering among individual models. For natural language queries, rankings were much more stable (τ=0.821-0.872 across all corpus pairs). This suggests that keyword-based retrieval is more sensitive to corpus-specific vocabulary, whereas natural language queries allow models to exploit semantic similarity more consistently.

Table 5. Rank stability across corpora (Kendall τ with bootstrap 95% CI).Corpus pairSpearman ρKendall τ (95% CI)Query typeMTSamples vs PMCa-Patients0.7530.590 (0.211-0.889)KeywordMTSamples vs synthetic0.8900.744 (0.472-0.944)KeywordPMC-Patients vs synthetic0.7750.641 (0.127-0.971)KeywordMTSamples vs PMC-Patients0.9340.821 (0.531-1.000)Natural languageMTSamples vs synthetic0.9400.846 (0.559-1.000)Natural languagePMC-Patients vs synthetic0.9620.872 (0.671-1.000)Natural language

aPMC: PubMed Central.

Query Format Effect

presents the MRR@10 difference when switching from keyword to natural language queries. The effect was positive for all but 2 model-corpus combinations (MedCPT on synthetic: Δ=−0.011 and E5-Mistral-7B on synthetic: Δ=−0.001) and often exceeded the gap between the best and worst models. On PMC-Patients, BioLORD-2023 improved from 0.225 to 0.884 (Δ=+0.659), a nearly 4-fold improvement. This single-variable change exceeded the entire range of model performance under keyword queries (0.835 range). On MTSamples, the average query format effect was +0.171 MRR@10 points; on PMC-Patients, it was +0.399. The MedCPT exception is consistent with its contrastive training on structured PubMed query-article pairs that resemble keyword queries; natural language reformulation may disrupt the learned query-document alignment for this model.

Table 6. Query format effect: MRR@10a difference (natural language minus keyword), full-document indexing. Seven models were selected to represent each architectural category and the widest effect range; full results are presented in .ModelΔ MTSamplesΔ PMCb-PatientsΔ syntheticBioLORDc-2023+0.188+0.659+0.257GTEd-base+0.156+0.489+0.250Nomic-embed-text+0.121+0.490+0.221BGEe-base-en-v1.5+0.123+0.424+0.241BM25f+0.189+0.118+0.364OpenAI-emb3-small+0.151+0.419+0.181MedCPTg+0.101+0.151−0.011

aMRR@10: mean reciprocal rank at 10.

bPMC: PubMed Central.

cBioLORD: Biomedical Learning of Ontological Representations from Definitions.

dGTE: General Text Embeddings.

eBGE: BAAI General Embedding.

fBM25: Best Match 25.

gMedCPT: Medical Contrastive Pre-trained Transformer.

Document Length Introduces Systematic Bias

Retrieval performance degraded with document length across most models. On PMC-Patients, OpenAI-emb3-small showed the largest bias: MRR@10 of 0.515 for short documents versus 0.282 for long (Δ=+0.233). MedCPT was the sole exception, showing stable or slightly improved performance on longer documents (short 0.548 vs long 0.599, Δ=−0.051 on PMC-Patients), likely attributable to contrastive training on longer PubMed articles. The length bias was corpus-dependent: on the synthetic corpus, several models, including Nomic-embed-text and BGE-base-en-v1.5, showed reversed or negligible length effects.

The length-tercile analysis serves as a partial proxy for specialty-level variation, because clinical specialties differ systematically in document length and terminology. However, the condition-level ANOVA uses corpus type as a single factor rather than modeling specialty individually. Within MTSamples (40 specialties) and the synthetic corpus (20 specialties, 5 documents each), specialty-specific retrieval difficulty likely varies: surgical operative reports use distinctive procedural vocabulary that may be easier to retrieve than general medicine notes with overlapping symptom terms. The length-tercile spread (up to Δ=0.233 for OpenAI-emb3-small on PMC-Patients) suggests that within-corpus document characteristics meaningfully affect retrieval, and specialty is a plausible driver of this variation. A per-query stratified analysis by specialty was not conducted because the condition-level design aggregates across documents; future work with per-query retrieval logs could decompose the corpus effect into specialty-level components.

Lexical Overlap Correlates With Retrieval Success

Spearman correlations between query-document Jaccard similarity and retrieval rank were negative across nearly all models (higher overlap=better rank). BM25 showed the strongest correlation on MTSamples (ρ=−0.556). Among embedding models, E5-Mistral-7B-ablation showed strong lexical dependence (ρ=−0.498), while MedCPT showed near-zero correlation on PMC-Patients (ρ=−0.008), indicating genuinely semantic retrieval. On synthetic, BioBERT and ClinicalBERT showed positive (reversed) correlations (ρ=+0.105 and +0.082), indicating that retrieval was effectively random with respect to lexical content—consistent with the degenerate embedding geometry observed for these models (see the “Domain-Specific Pretraining Does Not Guarantee Retrieval Quality” section).

Domain-Specific Pretraining Does Not Guarantee Retrieval Quality

BioBERT and ClinicalBERT ranked 11th and 12th among 13 configurations across all corpora despite biomedical pretraining. Embedding geometry analysis revealed the mechanism: both exhibited anisotropy exceeding 0.90 (mean pairwise cosine similarity over 1000 random pairs; see the “Evaluation Metrics” section) across all corpora (), indicating that embeddings are dominated by a narrow cone in which cosine similarity between arbitrary document pairs is uniformly high. Self-similarity scores of 0.97-0.99 confirmed a near-complete loss of discriminative capacity. By contrast, BioLORD-2023 (contrastive training, same domain) achieved anisotropy of only 0.25-0.40 and 3-5× better retrieval performance, demonstrating that the training objective matters more than the pretraining domain.

Table 7. Embedding geometry and retrieval performance (MTSamples, keyword, and full-document).ModelAnisotropySelf-similarityEffective rankMRR@10aCategoryClinicalBERT0.9050.9692220.129Domain encoderBioBERTb0.9500.9822570.165Domain encoderPhi-3-mini0.9740.9932860.175General large language modelBioLORDc-20230.2480.6812670.581Biomedical retrieverBGEd-base-en-v1.50.6710.8423140.759General embeddingNomic-embed-text0.6520.8473020.768General embeddingOpenAI-emb3-small0.4620.7893190.711General application programming interface

aMRR@10: mean reciprocal rank at 10.

bBioBERT: Biomedical Bidirectional Encoder Representations from Transformers.

cBioLORD: Biomedical Learning of Ontological Representations from Definitions.

dBGE: BAAI General Embedding.

LLM-Based Encoders: Pooling Strategy Is Critical

E5-Mistral-7B with the vendor-specified last-token (EOS) pooling ranked last across all 13 configurations on every corpus, achieving its lowest single-condition MRR@10 of 0.062 (MTSamples, full-document, and keyword). The E5-Mistral-7B model card recommends last-token pooling with query instructions for optimal performance; the finding that this configuration fails on clinical text—even when following vendor guidance—is consistent with a task-distribution misalignment between the model’s training data and clinical documentation. A mean-pooling ablation improved performance 5.5× to 0.344, suggesting that clinical text does not produce the token-position patterns EOS pooling was optimized for. Phi-3-mini (3.8 billion parameters, mean pooling) scored only 0.175, failing to outperform 110-million-parameter general embedding models. Larger parameter counts do not compensate for misaligned training objectives.

Chunking Strategy Effects Are Modest

Fixed 256-token chunking achieved the highest mean MRR@10 across embedding conditions on 2 of 3 corpora: 0.580 (MTSamples) and 0.543 (PMC-Patients), compared with 0.569 and 0.502 for full-document indexing. On the synthetic corpus, full-document and fixed-256 were tied at 0.238. The maximum chunking effect across corpora was Δ=0.066 (between fixed-256 and fixed-512 on PMC-Patients)—small in variance-decomposition terms (η2=0.002) but potentially meaningful for applied retrieval, where even modest MRR gains affect user experience. To contextualize the impact of chunking, under keyword queries, the largest MRR@10 difference between chunking strategies within a single model was 0.106 (PMC-Patients), whereas the cross-model difference within a single corpus reached 0.671 (MTSamples). Across all 3 corpora, the maximum chunking effect ranged from 11% to 20% of the corresponding corpus-specific model effect (defined as within-model chunk spread divided by cross-model MRR@10 spread, restricted to keyword queries: MTSamples 0.077/0.671=11.5%; PMC-Patients 0.106/0.527=20.1%; and synthetic 0.047/0.288=16.3%). Practitioners should therefore prioritize model selection over chunking strategy, selecting the latter based on application constraints (latency and context window size) rather than retrieval performance alone. Chunking had small effects in most ANOVA interactions (corpus × chunking, P=.41, model × chunking, P=.99). Query × chunking was small but detectable (P=.003), suggesting that chunking effects may differ slightly between keyword and natural language queries. These chunking results apply to the dense retrieval pipelines evaluated here; lexical retrieval may behave differently under chunking. Chunking may also become more important at larger scales, where documents routinely exceed model context windows.

Validation With Reduced-Lexical-Dependence Queries

To test whether relative model rankings are artifacts of the known-item retrieval design, the evaluation was repeated using metadata-only queries for a random subset of 100 documents per corpus (300 total). For each document, GPT-4o (gpt-4o-2024-05-13, temperature 0.0) extracted structured metadata—specialty, note type, primary diagnosis, secondary diagnoses, and patient demographics—from the document text. Queries were then generated by GPT-4o (temperature 0.3) from these metadata fields alone, without access to the document text. Although the metadata extraction step reads the document, the resulting metadata representation substantially reduces lexical dependence: mean Jaccard overlap between metadata-only queries and target documents was 0.010 (keyword) and 0.022 (natural language), compared with 0.027 and 0.049 for known-item queries—a 55%-63% reduction in token overlap on the synthetic corpus. This does not eliminate information leakage through the extraction step, but it substantially reduces the surface-form overlap that BM25 and lexically dependent models exploit.

Model rankings were highly stable between known-item and metadata-only queries. Kendall τ ranged from 0.59 to 0.90 across all corpus-query format combinations (mean τ=0.76), and all correlations were significant (P=.004; for 5 of 6 conditions P<.001). Spearman ρ ranged from 0.80 to 0.96 (). Rankings were most stable on MTSamples (τ=0.87-0.90) and least stable on the synthetic corpus (τ=0.59-0.77), consistent with the main study’s finding that keyword queries are more sensitive to corpus-specific vocabulary. A post hoc audit revealed that the validation script inadvertently used different HuggingFace checkpoints for 3 models (BioBERT: biobert-base-cased-v1.2 instead of biobert-v1.1; ClinicalBERT: Bio_ClinicalBERT instead of medicalai/ClinicalBERT; and GTE-base: gte-base-en-v1.5 instead of gte-base). For BioBERT and ClinicalBERT, rank stability across different checkpoints is expected: both models consistently rank in the bottom 2 positions due to embedding-geometry collapse (mean pairwise cosine similarity >0.90). For GTE-base, which ranked fourth/fifth in both experiments, the stability is more meaningful but is based on a single-model observation without formal testing.

Table 8. Rank stability between known-item and metadata-only queries (Kendall τ and Spearman ρ).CorpusQuery formatτ

Comments (0)

No login
gif