Evolving computational paradigms for noncoding variant pathogenicity prediction

Abstract

The rapid expansion of whole-genome sequencing (WGS) has highlighted the important contribution of noncoding variants to human disease, yet their pathogenic mechanisms remain difficult to resolve. Traditional statistical and experimental approaches often struggle to capture complex regulatory interactions or establish causal links, leaving many noncoding variants classified as variants of uncertain significance in clinical databases. Recent advances in computational modeling have substantially improved pathogenicity prediction by integrating genomic, epigenetic, and structural information. In parallel, genome language model (gLM)-inspired methods have enabled more context-aware interpretation of noncoding sequences and improved model generalization. This review summarizes current computational approaches, data modalities, and evaluation strategies for noncoding variant pathogenicity prediction, discusses key challenges in interpretability and data heterogeneity, and highlights emerging opportunities for clinical translation.

Introduction

The vast noncoding regions of the human genome remain largely unexplored as genetic studies have historically focused on protein-coding variants (Alexander et al., 2010). Emerging evidence shows that most disease-associated variants are located in noncoding regions (Wu et al., 2024); these variants usually exert their effects by disrupting transcriptional regulation, RNA processing, and chromatin organization (Zhang and Lupski, 2015), thereby contributing to the onset and progression of complex disorders such as cancer (Fredriksson et al., 2014; Khurana et al., 2016), cardiovascular disease, and neurodegenerative disease (Spielmann and Mundlos, 2016; Nemeth et al., 2024). Accordingly, understanding how noncoding variants affect disease is essential for improving our knowledge of disease mechanisms and for advancing precision medicine (Ellingford et al., 2022; Ruffo et al., 2025).

Genome-wide association studies (GWASs) (Visscher et al., 2012; Jin et al., 2022; Schipper and Posthuma, 2022) identify statistical associations between genetic variants and phenotypic traits across large populations. GWASs have been used to successfully discover genetic loci associated with disease; however, their direct clinical interpretation is limited. Because of linkage disequilibrium (LD), where a signal often represents a block of correlated variants rather than a single causal site, fine mapping is required (Uffelmann et al., 2021). To address this issue, there has been an increase in research studies embedding GWAS signals with functional genomic annotation, as well as applying probabilistic fine-mapping methods to give greater priority to noncoding variants with putative regulatory function (Li and Zhou, 2025). Meanwhile, the widespread adoption of whole-genome sequencing (WGS) in both research and clinical practice has further expanded the scale of the interpretation problem (Bagger et al., 2024; Brlek et al., 2024; Qiao et al., 2024); each genome contains millions of mostly noncoding variants, making systematic interpretation challenging (Austin-Tse et al., 2022). Both GWAS fine mapping and clinical diagnostics increasingly rely on computational frameworks that can rank and score noncoding variants to prioritize mechanistically informative and clinically relevant candidates. Noncoding variant interpretations are observed between clinical diagnoses and GWAS fine mapping. In clinical settings, the main goal is to identify pathogenic variants, particularly those with effects large enough to be demonstrable with a clear link from the phenotype to the tissue/development period of interest. These types of variants are typically assessed in the context of family segregation, functional studies, and curated clinical databases (Ellingford et al., 2022). By contrast, GWAS is mainly concerned with identifying loci that influence molecular or organismal traits, where the underlying variants often have modest effects and do not necessarily correspond directly to pathogenic alleles. Consequently, predictive models developed for clinical pathogenicity assessment and those designed for GWAS fine mapping should not be treated as interchangeable because they are built to address different biological questions and must be evaluated using different criteria.

Functional and pathogenic effects are closely related and are often conflated. Both are typically measured using individual variant assays, but their biological meanings differ. Functional effects refer to the direct perturbation of molecular phenotypes, such as altered transcription factor (TF) binding, changes in gene expression, or splicing disruption, and therefore represent proximal mechanistic signals (Wang X. et al., 2024). By contrast, pathogenic effects depend on whether these molecular alterations occur in disease-relevant tissues or developmental contexts and translate into disease phenotypes or risks (Guo et al., 2024; Huang et al., 2024; Huang et al., 2025). Importantly, the relationship between the two is not strictly dichotomous. Many sequence-based models predict regulatory-layer molecular effects rather than pathogenicity, yet their outputs still provide informative clues about disease relevance. In particular, recent unified frameworks such as AlphaGenome (Avsec et al., 2026), which jointly model functional consequences across multiple regulatory layers, have further strengthened the ability of functional predictions to indicate disease relevance, thereby forming an important bridge between functional effect prediction and direct pathogenicity inference. At the same time, it should be emphasized that many variants can produce measurable functional effects without necessarily causing disease. This is particularly relevant for complex traits, where clinical significance arises from the cumulative effects of multiple small-effect variants (Lappalainen et al., 2024; Xu et al., 2025).

This conceptual distinction helps explain why early studies relied on indirect markers and chains of evidence to identify candidate noncoding variants (Ritchie et al., 2014); this reliance was driven by two major constraints: the scarcity of confidently validated pathogenic labels and the strong context dependence of noncoding regulatory mechanisms. In practice, candidate variants were first filtered according to the population frequency, evolutionary conservation, and heuristic rules based on functional genomic regions (Ward and Kellis, 2012) and were then further annotated using resources such as ENCODE (ENCODE Project Consortium, 2012) and Roadmap Epigenomics (Fisher et al., 2015) to obtain mechanistic clues. Pathogenicity was subsequently inferred by integrating these annotations with functional validation results and other lines of supporting evidence. However, the evidence-based method was limited in important ways, including wide variability between signals with respect to scale, resolution, and biological relevance; many annotations were made at the region level rather than at the single-variant level; and accurate mapping to disease-relevant tissues or cell types was often lacking, leading to potential false positives and false negatives. Even as large-scale projects expanded functional annotation catalogs, relevant signals could still be missed.

Against this background, the rapid development of artificial intelligence (AI), machine learning (ML), and especially deep learning (DL) provided a new framework for noncoding variant prediction. By automatically extracting informative features and modeling complex nonlinear relationships, these methods substantially improved the ability to prioritize potentially deleterious variants. More recently, general language model (gLM)-inspired approaches (Shu et al., 2026) that treat genomic sequences as biological language have further expanded model representational capacity and improved predictive performance in some contexts. Current methods that are used to evaluate the effect of noncoding variants can generally be categorized into two classes of computational approaches. The first strategy infers disease relevance indirectly from predicted functional effects (Wang X. et al., 2024); models predict impacts on molecular phenotypes such as splicing, gene expression, chromatin states, or TF binding and use these predictions to prioritize potentially pathogenic variants, with the direct output being functional consequences rather than pathogenicity labels. Enformer (Avsec et al., 2021) and the more recent AlphaGenome (Avsec et al., 2026) both fall into this category. The second strategy predicts pathogenicity more directly under pathogenicity-oriented supervision. Its training endpoints are typically derived from manually curated resources such as ClinVar (Landrum et al., 2025) or from constructed pathogenic/benign label sets. These models learn a mapping from multisource input features to an integrated pathogenicity score, making them more closely aligned with the practical needs of clinical interpretation. Combined Annotation-Dependent Depletion (CADD) (Schubach et al., 2024) is a representative example of this strategy. In the following sections, we primarily focus on the development of the second category of methods.

Noncoding variants as drivers of human diseaseMapping the functional hierarchy of noncoding regions in the genome

Noncoding regions refer to segments of the genome that are not translated into proteins. Although once termed “junk DNA,” they are now widely recognized as having diverse and important biological functions (Walter, 2024). With the advancement of whole-genome sequencing (WGS) technologies, increasing evidence has shown that most disease-associated genetic variants reside in noncoding regions, where they play critical roles in transcriptional regulation, RNA processing, and chromatin architecture. Based on their location and function, noncoding regions can be classified into the following categories (Figure 1).

Illustration depicting genomic organization and gene expression, showing a chromosome with centromere and telomere, nucleosome structure, regulatory elements like enhancers and silencers, RNA polymerase transcribing DNA, production of pre-mRNA, splicing into mature mRNA, and translation by a ribosome.

Schematic representation of noncoding regions in the human genome, from chromosomes to DNA sequences, including key functional elements such as promoters, enhancers, silencers, 5′UTRs, 3′UTRs, introns, telomeres, centromeres, and repetitive DNA elements. During transcription, introns are spliced out to generate mature mRNA. Noncoding RNAs are transcribed throughout the genome by RNA polymerase and primarily include long noncoding RNAs (lincRNAs), microRNAs (miRNAs), and circular RNAs (circRNAs). Telomeres and centromeres are annotated on the chromosomes. Intergenic regions, located between two genes, such as between exons and introns, contain regulatory elements (such as enhancers and silencers), repetitive DNA elements (such as transposons and pseudogenes), and noncoding RNA genes. Noncoding RNAs are transcribed under the action of these cis-regulatory elements.

Cis-regulatory elements

These include promoters, enhancers, silencers, and insulators (Encode Project Consortium, 2012). Promoters can be found upstream from the starting position of transcription and are used mainly for binding of RNA polymerase and TFs (Whitfield et al., 2012; Haberle and Stark, 2018). Enhancers and silencers can also exist at greater distances from the target gene, enhancing or repressing transcription of target genes via protein interactions (Panigrahi and O’Malley, 2021; Pang et al., 2023). Insulators form chromatin boundaries, blocking enhancers from acting on non-target promoters and preventing heterochromatin spread (Brasset and Vaury, 2005). In addition, intergenic regions also contain these cis-regulatory elements, which play a crucial role in the transcription of noncoding RNAs.

Post-transcriptional regulatory regions

These mainly include the 5′ untranslated region (5′UTR) and the 3′ untranslated region (3′UTR) of messenger RNA (mRNA) (Chu et al., 2024). They play vital roles in regulating translation efficiency, mRNA stability, and microRNA (miRNA) binding (Bohn et al., 2023). Models such as CADD (Schubach et al., 2024) and FINSURF (Moyon et al., 2022) predict the pathogenicity of variants found in the 5′UTR and 3′UTR by integrating measures of sequence conservation, epigenetic information, and functional annotation into assessments that can help determine their effect on gene expression and disease relevance.

Introns and splicing regulatory elements

Introns are the sequences separating exons that are removed during splicing. Deep intronic mutations can cause splicing to occur incorrectly, thereby affecting gene expression (Vaz-Drago et al., 2017; Barbosa et al., 2023). Some introns have regulatory functions because they contain transcription promoters, enhancers, or TF-binding sites (TFBSs). They contribute to transcriptional regulation by determining whether certain isoforms of mRNA are expressed using splicing activity. The DL model [i.e., DYNA (Zhan et al., 2025)], which uses a Siamese network, compares the splicing activities of wild-type versus variant sequences to accurately predict pathogenic intronic variants from splicing. TRAP (Gelfman et al., 2017), on the other hand, uses transcript information and sequence-based features, such as GERP++ (Davydov et al., 2010), to predict the impact of intronic variants on transcripts through ML methods, helping identify disease-associated intronic variants.

Noncoding RNAs

Noncoding RNAs (ncRNAs) can be considered to span the entire genome as they are widely distributed across different regions of the genome and play crucial roles in gene expression regulation, cellular functions, and genome stability (Tóth and Hannon, 2011). These include miRNAs, long noncoding RNAs (lncRNAs), circular RNAs (circRNAs), and transfer RNA (tRNA)-derived fragments. The miRNAs bind complementarily to mRNAs to regulate degradation or translation; lncRNAs (>200 nucleotides) participate in chromatin remodeling, transcriptional regulation, and RNA processing (Statello et al., 2021); circRNAs form covalently closed loops that resist RNase degradation and act as miRNA “sponges” or protein-binding platforms; tRNA-derived fragments may play roles in RNA interference or translation inhibition. The JARVIS model, which integrates genome-wide residual variation intolerance scores (gwRVIS) and genomic sequences, can effectively evaluate the impact of such variants on ncRNA function.

Structural and repetitive elements

These include tandem repeats, pseudogenes, transposons, telomeres, and centromeres. These regions are closely associated with genome stability, replication, and chromosomal architecture and represent important sources of genetic variation. DYNA not only focuses on splicing-related variants but is also capable of modeling variants that alter the 3D genome architecture or chromatin structure, such as enhancer rearrangements and structural variants in intergenic regions, effectively predicting whether such variants are disease-associated.

Deciphering the functional impact induced by noncoding variants

Functional models for noncoding variants generally do not directly output clinical pathogenicity conclusions. Instead, they primarily predict the impact of variants on molecular phenotypes and are therefore better suited as tools for mechanistic interpretation and as sources of functional evidence for candidate variant prioritization. According to their central focus and the breadth of functional outputs they cover, functional models for noncoding variants can be broadly divided into three levels.

The first consists of specialized models centered on a single dominant modality, with each model focusing on the consequences of variation at one particular functional layer. Such models can capture only a subset of features and may overlook others. Representative splice-centric models include SpliceAI (Jaganathan et al., 2019), MMSplice (Cheng et al., 2019), and Pangolin (Zeng and Li, 2022). SpliceAI directly analyzes raw pre-mRNA sequences to predict splice sites and identify cryptic splice-altering variants; variants with high scores can often be validated at relatively high rates in RNA-seq data. MMSplice adopts a modular neural network architecture, separately modeling the effects of donor sites, acceptor sites, exonic sequences, and intronic sequences on splicing. It mainly learns quantitative changes in exon skipping and splicing efficiency and then combines these modules into an overall splicing impact score. Pangolin further emphasizes multi-tissue splicing prediction and incorporates cross-species training to better capture tissue-specific effects and variant impacts; its outputs represent predicted functional effects on splice-site usage. Beyond splicing, PromoterAI (Jaganathan et al., 2025) focuses on promoter variants that cause expression abnormalities. Using approximately 20 kb of the promoter-centered sequence context as input, it predicts the effect of promoter variants on gene expression. The model is first pretrained on multi-omic readouts and then fine-tuned using rare promoter variants from the Genotype-Tissue Expression (GTEx) project that are associated with linking promoter variation to abnormal expression. Basset (Kelley et al., 2016) primarily targets chromatin accessibility, using DNase-seq accessibility as the supervisory signal to train a convolutional neural network across 164 cell types; variant effects are then estimated through allelic differences, making it, in essence, a model that infers variant impact through the “accessibility channel.” ExPecto (Zhou et al., 2018) is mainly designed to predict tissue-specific expression changes. It first predicts a large number of epigenetic features from sequence and then uses spatial feature transformation together with a linear model to infer tissue-specific gene expression, thereby estimating variant-induced expression changes.

The second level includes models that allow a single framework to handle multiple modalities rather than relying on modality-specific specialized models. Multimodal models such as DeepSEA (Zhou and Troyanskaya, 2015), Basenji (Kelley et al., 2018), Enformer (Avsec et al., 2021), Sei (Chen et al., 2022), and Borzoi (Linder et al., 2025) have all demonstrated practical utility and broad applicability. One of the earliest representatives is DeepSEA, which directly predicts, from the sequence alone, large-scale chromatin features including TF binding, DNase I hypersensitive sites, and histone modifications and estimates variant effects through allelic differences. This represented an early attempt to jointly model multiple regulatory phenotypes within a single model. Basenji subsequently extended the input to much longer sequence contexts and jointly predicted DNase-seq, ChIP-seq, and CAGE coverage profiles, thereby beginning to connect local regulatory signals to broader expression-related effects. Enformer further advanced this line of work by jointly predicting gene expression and multiple epigenetic marks under a longer-context, multitask-learning framework, while leveraging attention mechanisms to improve long-range information integration and thereby enhance the prediction of variant-induced expression effects. Sei expanded this paradigm to much larger-scale chromatin prediction, covering tens of thousands of chromatin profiles and summarizing them into interpretable “sequence activity classes” to quantify how variants increase or decrease different regulatory activities, although its mechanistic interpretation remains largely confined to the chromatin layer. More recently, Borzoi has moved further toward a unified sequence-to-function framework by directly predicting RNA-seq coverage from DNA sequence and enabling the extraction of variant effects across multiple layers, including transcription, splicing, and polyadenylation. The general representations learned by these models make them amenable to rapid fine-tuning for new tasks; however, their broader generality may come at the cost of reduced performance on some specific tasks, and they often lack the depth of analysis achievable with modality-specific specialized models.

The third level consists of unified multimodal frameworks spanning multiple major functional categories. AlphaGenome (Avsec et al., 2026) is a representative example. It integrates multimodal prediction, long-range context modeling, and base-pair-resolution inference within a single framework to predict a broad range of genomic tracks across multiple cell types. Using a 1-Mb DNA sequence as input, AlphaGenome jointly predicts diverse functional readouts in one unified architecture, including gene expression, transcription initiation, splice-site usage and splice junctions, chromatin accessibility, histone modifications, transcription factor binding, and chromatin contact maps. In doing so, it can simultaneously characterize the molecular consequences of variants across multiple regulatory layers. Although AlphaGenome approaches disease relevance assessment, it remains fundamentally a functional effect predictor rather than a pathogenicity classifier. Its outputs can support disease-related interpretation, serving as a bridge between functional effect prediction and pathogenicity inference.

In summary, functional effect models focus on molecular phenotypes and are primarily suited for mechanistic interpretation rather than direct disease risk prediction. Even unified models such as AlphaGenome and Borzoi require integration with additional evidence—such as disease-relevant tissues, developmental stages, candidate target genes, and effect magnitude—to inform pathogenicity inference.

Decoding the regulatory links between noncoding variants and disease

Noncoding variants can contribute to disease development through diverse molecular pathways, with varying types and functions validated in multiple disorders (Table 1). From a regulatory architecture perspective, many noncoding variants first manifest as local cis-regulatory perturbations, directly affecting regulatory elements within their genomic neighborhood, such as promoters, enhancers, silencers, or splicing regulatory sequences. The effectiveness of this localized regulatory mechanism makes it easier to associate these variants with nearby target genes. For example, enhancer variants can alter TF binding sites or disrupt enhancer–promoter spatial interactions, thereby deviating from normal gene expression patterns. Mutations in the enhancer upstream of the TERT promoter have been linked to malignancies such as melanoma and glioblastoma (Heidenreich and Kumar, 2017), whereas other diseases such as campomelic dysplasia are linked to reduced silencing activity of a mutation in the SOX9 silencer. Additionally, deep intronic mutations present in the SMN2 gene lead to splicing defects in patients diagnosed with spinal muscular atrophy (SMA) (Csukasi et al., 2019). By contrast, trans-regulation primarily depends on diffusible regulatory factors, such as miRNAs, which are capable of modulating the expression of multiple target genes over a broader range (Signor and Nuzhdin, 2018). Therefore, when noncoding variants affect the expression, processing, or activity of these trans-acting regulators, their consequences are often not confined to the local genomic environment but may instead propagate through broader regulatory networks, giving rise to more distal and systemic downstream effects. For example, mutations in the seed region of miR-96 are associated with familial progressive hearing loss (Avraham et al., 2022). Beyond affecting gene expression through cis- or trans-regulatory mechanisms, noncoding variants can impact disease development through other forms of locally noncoding mechanisms. For instance, CGG repeat unit expansions in the 5′UTR are associated with fragile X syndrome (Broniarek et al., 2024), and HOTAIR variants within the promoter region are associated with breast cancer susceptibility (Milevskiy et al., 2016). Epigenetic modification-related variants may alter DNA methylation or RNA modification levels, exemplified by MLH1 promoter methylation control variants linked to Lynch syndrome (Ward et al., 2013) and noncoding variants affecting METTL3-mediated m6A regulation in acute myeloid leukemia (Wen et al., 2023). In addition, copy number variations (CNVs) and structural rearrangements in noncoding regions have clinical importance; for instance, CNVs at 17p11.2 are associated with Smith–Magenis and Potocki–Lupski syndromes (Juyal et al., 1996; Potocki et al., 2007), and enhancer rearrangements located upstream of the TAL1 gene have been associated with T-cell leukemia (Mansour et al., 2014). Collectively, these examples highlight how different classes of noncoding variants can affect gene expression, RNA processing, epigenetic regulation, and chromatin architecture, underscoring the critical role of the noncoding genome in the study of genetic and complex diseases. In addition to these molecular mechanisms, the pleiotropic nature of regulatory elements and their positions within gene regulatory networks can influence the strength of selective constraint and their disease relevance. Variants affecting multiple target genes or located at hub positions in regulatory networks tend to experience stronger purifying selection and are, therefore, more likely to be associated with disease.

Variant classMolecular mechanismDisease exampleEnhancer variantsAlter TF binding sites; disrupt enhancer–promoter spatial interactionsTERT upstream enhancer mutations → melanoma and glioblastomaSilencer variantsWeaken transcriptional repressionSOX9 silencer mutations → campomelic dysplasiaRNA processing variants(i) 5′UTR CGG repeat expansion → abnormal translation initiation; (ii) deep intronic mutations → aberrant splicing(i) Fragile X syndrome; (ii) SMN2 intronic mutations → spinal muscular atrophy (SMA)Noncoding RNA variantsAffect miRNA seed sequence or lncRNA promoter activity(i) miR-96 seed region mutation → familial progressive hearing loss; (ii) HOTAIR promoter variants → breast cancerEpigenetic modification-related variantsAlter DNA methylation or RNA modifications (e.g., m6A)(i) MLH1 promoter variants → Lynch syndrome; (ii) noncoding METTL3 variants → acute myeloid leukemiaCopy number variations (CNVs) and structural rearrangementsChange the dosage or regulatory architecture(i) CNVs at 17p11.2 → Smith–Magenis and Potocki–Lupski syndromes; (ii) Enhancer rearrangements upstream of TAL1 → T-cell leukemia

Linking noncoding variant classes to molecular mechanisms and disease phenotypes.

This table provides an overview of the main types of noncoding variants, their molecular mechanisms, and examples of human diseases. For each variant class, we describe how alterations in regulatory or structural elements, including promoters, enhancers, silencers, UTRs, introns, telomeres, centromeres, and repetitive sequences, can disrupt gene expression or genomic stability. The affected molecular mechanisms include transcriptional regulation, RNA processing, chromatin structure, and higher-order genomic organization. Associated examples of diseases illustrate the clinical significance associated with each type of noncoding variant and further demonstrate the broad and widespread role of noncoding genomic variation in human pathophysiology.

Key attributes underlying the pathogenic potential of noncoding variantsAn explicit attribute landscape of pathogenic noncoding variants

The prediction of pathogenicity for noncoding variants relies on multidimensional biological attributes, which not only reveal the molecular functional changes that a variant may induce but also provide key input features for constructing prediction models (Figure 2). First, sequence context information (such as GC content, CpG islands, and nucleotide context) can reveal how the local sequence environment influences the sensitivity of variants. From an evolutionary perspective, signals relevant to noncoding variant pathogenicity should not be reduced to a single conservation score but rather viewed as multiple observable manifestations of selective constraint operating across different timescales. Cross-species conservation reflects long-term evolutionary constraint; highly conserved noncoding sequences often carry key regulatory functions, and variants occurring in these regions are, therefore, more likely to have biological consequences. Commonly used metrics include PhyloP (Pollard et al., 2010), PhastCons (Siepel et al., 2005), and GERP++ (Davydov et al., 2010). However, evolutionary information extends beyond conservation scores alone. Variant intolerance features, such as gwRVIS (Vitsios et al., 2021), can help identify depletion of variation and intolerance in functionally important regions along the human lineage and have been incorporated into models such as JARVIS (Vitsios et al., 2021) and NCBoost (Caron et al., 2019). Population genetic features, such as allele frequency (AF) derived from gnomAD and the 1000 Genomes Project (1kGP) (Auton et al., 2015), reflect recent or ongoing purifying selection: variants with stronger potentially deleterious effects are generally less likely to reach high frequency, resulting in an overall inverse relationship between allele frequency and effect size at the population level. Importantly, this relationship is statistical rather than absolute; rare variants are not necessarily pathogenic, and common variants are not necessarily functionally neutral. Therefore, conservation, allele-frequency spectra, and regional intolerance should be viewed as complementary forms of evolutionary evidence that together inform the prediction of noncoding variant pathogenicity. Next, epigenetic and functional genomic features, including experimentally derived genomic functional states and three-dimensional structural information, such as chromatin accessibility (by DNase-seq/ATAC-seq), histone modifications (by ChIP-seq), TF binding profiles (by ChIP-seq), and three-dimensional genomic structures (e.g., chromatin topology-associated domain boundaries and enhancer–promoter specific interactions identified through Hi-C), can directly identify active regulatory elements and their interaction networks, helping determine whether a variant could disrupt chromatin structure or gene regulation. Functional annotation features are also critical; variants in regions known to significantly affect gene expression are at a higher risk for pathogenicity. For example, variants in regions such as promoters, enhancers, 5′UTR, 3′UTR, and introns are more likely to be affected by mutations compared to other regions. Clinical and association signals from GWAS and eQTL resources, along with curated databases such as HGMD (Stenson et al., 2020) and ClinVar (Landrum et al., 2025), provide direct evidence of disease relevance for variant interpretation, as exemplified by CADD (Schubach et al., 2024), which integrates eQTL (Wong et al., 2025) and GWAS (Uffelmann et al., 2021) data to improve pathogenicity prediction. Finally, integrated functional prediction scores, which are comprehensive prediction scores based on sequence and evolutionary information, assess the functional impact of variants [e.g., CADD and EIGEN (Ionita-Laza et al., 2016)] and can serve as input features for advanced models. RegBase-PAT (Zhang et al., 2019) combines the results from 23 prediction tools [e.g., CADD, GWAVA (Ritchie et al., 2014), and DANN (Quang et al., 2015)] to train a composite model and can serve as input features for advanced models. Typically, these features (Table 2) are input into ML or DL frameworks in multimodal form, where feature interactions and pattern recognition are used to comprehensively assess the pathogenicity of noncoding variants. In the future, with further integration of multisource data, predictive models are expected to shift from “correlation” to “causality” judgments, providing more reliable variant interpretation for precision medicine.

Circular infographic illustrating seven categories of genomic features: sequence context, evolutionary conservation, variant intolerance, population genetics, clinical associations, integrated prediction scores, and functional annotation, each with icons and relevant example methods or terms such as AF, GWAS, PhyloP, CADD, and chromatin accessibility.

Potential attributes for predicting the pathogenicity of noncoding variants. This figure illustrates the overall framework for predicting the pathogenicity of noncoding variants. The central circle represents the pathogenicity of noncoding variants, while the inner circle displays the graphical representations of eight representative predictive attributes. The outer circle lists the specific characteristics or methods associated with these predictive attributes. The blank sections in the boxes represent the specific features of each attribute or method, further demonstrating how each attribute contributes to the assessment of noncoding variant pathogenicity. AF, allele frequency; AF is the frequency of a specific allele in the entire population. MAF, minor allele frequency; MAF is the frequency of the less common allele in the population.

Predictive attributeRelevance descriptionIndicators and sourcesRepresentative modelsSequence contextAnalyzes the physicochemical properties and compositional features of the local DNA sequence surrounding the variantGC content, CpG islands, and nucleotide contextCADD, GWAVA, and JARVISEvolutionary conservationVariants in highly conserved noncoding regions are more likely to be pathogenicPhyloP, PhastCons, and GERP++CADD, NCBoost, and EigenEpigenetic and functional genomic featuresExperimental measurements of genomic functional states and 3D structural information, helping identify active regulatory elementsChromatin accessibility, histone modifications, TF binding, and 3D genome structureGWAVA, FINSURF, and JARVISFunctional annotationsDetermines whether a variant is located in known functional genomic regions, such as regulatory elements or noncoding RNA regionsPromoters, enhancers, 5′UTR, 3′UTR, and intronsCADD, TraP, and DYNAVariant intolerance featuresIdentifies intolerant sites in core functional regionsgwRVIS, RVIS, and pLIJARVIS and NCBoostPopulation geneticsUses population frequency data to infer natural selection pressure on variants; rare variants are more likely to be pathogenicAllele frequency (AF), gnomAD, and 1000 GenomesNCBoost and JARVISClinical and association signalsDirect genetic evidence from disease and molecular phenotype association studies providing direct support for predictionsGWAS, eQTL, HGMD, and ClinVarDYNA and DVARIntegrated functional prediction scoresComprehensive scores calculated based on multiple features; can themselves serve as powerful input for advanced modelsCADD and EIGENregBase-PAT and CADD v1.7

Predictive attributes of noncoding variant pathogenicity, associated methods, and models.

This table summarizes eight key attributes commonly used to predict the pathogenicity of noncoding variants, alongside the core computational methods and representative models that leverage each attribute. Each attribute is annotated with the primary methodological approaches—traditional ML, DL, and gLMs—and representative models that integrate these features are highlighted. Detailed descriptions of the representative models are provided in Table 3. This overview serves as a comprehensive reference for understanding how diverse genomic and epigenomic features inform computational predictions of noncoding variant pathogenicity.

Latent features as determinants of noncoding function

In contrast to explicit attributes, which rely on predefined annotations such as evolutionary conservation, epigenomic marks, chromatin states, and known regulatory elements, latent features capture higher-order sequence patterns that are not readily summarized by a limited set of handcrafted variables but are, nonetheless, critical for regulatory function. The latent features involved reflect not only the strength and direction of each transcription factor binding site but also their combinatorial syntax among multiple motifs and the sequence context from flanking sequences at different genome levels. These features influence how an allele will impact the regulatory landscape of a specific variant when considered in combination with all the background sequences surrounding it. Sequence-based learning models are intended to extract latent features from raw DNA sequence data. Under supervised training signals, they learn hierarchical representations that typically progress from local motif patterns to intermediate-range combinatorial syntax and ultimately to long-range dependencies. As a result, sequence learning models can complement traditional annotation-based integration methods when explicit annotations are deficient in content, context, or sufficiently resolution on the allele-specific basis. gLMs further extend this paradigm. Models such as DNABERT-2 (Zhou et al., 2023), Nucleotide Transformer (NT) (Dalla-Torre et al., 2025), and long-context architectures including HyenaDNA (Nguyen et al., 2023) learn generalizable sequence representations through self-supervised pretraining on large-scale genomic corpora. In downstream tasks, these pretrained representations can be transferred through fine-tuning or by comparing reference and alternative allele embeddings. As a result, gLMs help bridge the gap between explicit annotation-based modeling and sequence-derived latent representation learning, particularly in settings where detailed allele-specific resolution is required or where functional annotations remain sparse.

The evolution of noncoding variant pathogenicity prediction

Guided by the literature search and screening strategy (Supplementary Figure S1) described in the Supplementary Material, we systematically screened the published literature and selected representative computational models for noncoding variant pathogenicity prediction. These methods can be broadly grouped into three methodological categories: integrative annotation-based models, context-dependent sequence feature models, and foundation models for noncoding variant interpretation. As shown in Table 3, the three categories differ substantially in terms of input representation, feature-learning strategy, model architecture, and output formulation and together illustrate the field’s progression from feature engineering-driven approaches to hybrid sequence-learning frameworks and, more recently, self-supervised foundation models.

ModelModel typeInput featuresCore architecturePredicted outputTraining setsPublication yearIntegrative annotation-based modelsCADDMLFunctional annotations (conservation, epigenomics, TFBS, etc.)SVMPathogenicity score (continuous)Simulated DNMs and variants arisen and fixed in human populations2014–2024GWAVAMLRegulatory annotations and sequence contextRFFunctional importance score (continuous)HGMD regulatory variants vs. 1000 genomes common variants2014DANNDLSame features as CADDMultilayer DNNPathogenicity score (continuous)Same as CADD (observed vs. simulated)2014FATHMM-MKLMLFunctional annotations, sequence conservation, and protein featuresMultiple kernel learning SVMFunctional effect score (continuous) and prediction confidence scoreHGMD pathogenic SNVs vs. 1000 genomes2015ReMMMLCurated Mendelian noncoding mutations, conservation, and epigenomic annotationsRF + resamplingPathogenicity score (continuous)Hand-curated set of regulatory Mendelian mutations and derived alleles of human evolution2016EigenMLFunctional annotations, sequence conservation, and genomic contextUnsupervised learning + spectral meta-learn

Comments (0)

No login
gif