Controlling gene expression using AI designed Cis-regulatory elements

Synthetic biology aims to reprogram organisms by designing genetic circuits, in which transcriptional regulatory elements play a pivotal role in controlling gene expression (Gao et al., 2023; Wang et al., 2013). These DNA sequences function as genetic switches and dials, determining the timing, location, and magnitude of gene transcription (Ali and Brewster, 2022). Cis-regulatory elements (CREs), such as promoters, enhancers, silencers, and insulators, are non-coding DNA sequences that modulate gene expression by influencing transcription factor binding and chromatin state (Biłas et al., 2016; Li et al., 2015). CREs are indispensable for constructing genetic circuits and optimizing metabolic pathways (Brophy and Voigt, 2014; Chubukov et al., 2012; Hoynes-O'Connor and Moon, 2015; Sprinzak and Elowitz, 2005). However, the structure and mechanisms of regulatory elements vary significantly among different types of organisms (Wittkopp and Kalay, 2012). Despite remarkable progress in sequencing and high-throughput assays, the design of functional regulatory DNA remains a major bottleneck, largely due to the complexity and context dependence of gene regulation. To address this challenge, recent advances in artificial intelligence (AI), particularly deep generative and foundation models, have opened new opportunities for understanding and engineering regulatory sequences (Moldwin and Shehu, 2025; Yu and Zhang, 2024).

CREs exhibit remarkable diversity across different organisms, reflecting distinct modes of transcriptional control(Wittkopp and Kalay, 2012). In prokaryotes, regulatory systems are relatively simple; typical transcriptional regulatory elements include promoters and operators, generally located adjacent to genes, with transcription and translation often coupled(McAdams et al., 2004; Zhou and Yang, 2006). Conversely, eukaryotic transcriptional regulation is more complex, involving not only core promoters but also enhancers, silencers, and upstream regulatory elements (Harbison et al., 2004; Wray et al., 2003). Furthermore, regulation in eukaryotes typically involves numerous transcription factors and alterations in chromatin structure (Barral and Zaret, 2024; Zaret and Mango, 2016). Promoters, typically 50–300 bp in length, govern transcription initiation and thus constitute critical determinants of gene expression (Tayara et al., 2020; Zhang et al., 2022a, Zhang et al., 2022b). Enhancers integrate multiple transcription factor inputs to drive cell type-specific or condition-specific gene activation, whereas silencers and insulators function to repress or delimit gene expression domains (Ali et al., 2016; Field and Adelman, 2020; Kang et al., 2020; Riethoven, 2010). Therefore, precise design of these elements is essential for applications ranging from gene therapy to bioproduction, as accurate control of gene expression can enable more effective therapeutic outcomes and robust microbial cell factories (Delvigne et al., 2018; Ding and Liu, 2024; Jung et al., 2021; Xu and Liu, 2024; Yang et al., 2025).

However, designing regulatory DNA sequences remains highly challenging due to the immense sequence space (Levo and Segal, 2014; Zhang et al., 2024). Even a DNA sequence merely 60 bp in length has 4^60 possible variants, while natural evolution has sampled only a small fraction of this diversity (Zhang et al., 2024). Traditional engineering approaches, such as random mutagenesis or assembling known motifs, probe only a narrow subset of the possible sequence space and therefore require substantial experimental screening. Their exploration of sequence space largely depends on the quality of the constructed library and the efficiency of transformation (Alper et al., 2005; Guiziou et al., 2016; Guo et al., 2012; Johns et al., 2018; Kosuri et al., 2013; Liu et al., 2018; Qin et al., 2011; Redden and Alper, 2015; Yim et al., 2013). The disparity between vast potential design spaces and limited empirical exploration underscores the need for more intelligent computational simulation methods to guide the design of regulatory sequences. Advances in sequencing technologies and high-throughput experimental systems have ushered biology into a data-driven era. Traditional methods often exhibit limitations in data handling and analysis, particularly with large volumes and diverse types of data (Kashyap et al., 2016). Conventional bioinformatics techniques, such as sequence alignment and gene annotation, typically rely on explicit rules and manually engineered feature extraction (Ejigu and Jung, 2020). In recent years, deep learning has emerged as a powerful tool for decoding and leveraging genomic regulatory codes (Peleke et al., 2024). Leveraging extensive datasets of DNA sequences and their corresponding activities, deep neural networks are capable of inferring complex mappings from sequence to function, devoid of explicit prior structural or biological rules (Li et al., 2024a). These models have demonstrated strong performance in predicting gene expression levels from DNA sequences and identifying key regulatory features, occasionally surpassing traditional machine learning models (Chen et al., 2024; DaSilva et al., 2024; Ding et al., 2023; Peleke et al., 2024). Importantly, deep learning not only enhances our understanding of gene regulation but is now being harnessed to create novel regulatory elements (Goshisht, 2024; Yang et al., 2024). By learning the “intrinsic regulatory code” of cells, deep learning models can generate synthetic DNA sequences that drive desired expression patterns. This represents a paradigm shift from trial-and-error mutagenesis toward more rational, data-driven regulatory DNA design. In particular, generative deep learning models, which are capable of creating new sequences with defined characteristics, have opened novel pathways for engineering promoters, enhancers, and other regulatory components.

Despite rapid advances, a conceptual and methodological gap persists between biological and computational disciplines in modeling regulatory elements. Existing studies have explored either generative AI for sequence design or large pretrained DNA language models, but a unified framework connecting these paradigms is still lacking(Hu et al., 2023). To address this, the present review outlines two complementary paradigms for regulatory element design: (1) de novo generative modeling, which directly trains models on regulatory sequences to create functional elements, and (2) foundation model–based fine-tuning, where large-scale DNA language models are adapted for regulatory design tasks. We further discuss how these approaches intersect to enable controllable and interpretable gene expression design.

Building upon this framework, we systematically summarize recent advances in generative models, including Generative adversarial networks (GANs), variational autoencoders (VAEs) Transformers, and diffusion models, applied to the design of CREs. The review covers strategies spanning data acquisition, feature encoding, and model architectures, as well as their practical applications in generating diverse regulatory sequences. We also examine the growing potential of DNA foundation models (DNA-FMs) for predicting and generating regulatory elements, and discuss emerging challenges and future directions that highlight the transition of synthetic biology from empirical exploration toward truly data-driven, generative design.

Comments (0)

No login
gif