Completing spatial transcriptomics data for gene expression prediction benchmarking

Spatial Transcriptomics (ST) is an emerging technology that precisely localizes gene expression profiles within histological images (Jiang et al., 2023). While histology analysis is the gold standard for the diagnosis of many diseases (Xie et al., 2023), transcriptomics unlocks molecular insights that unveil causal pathways behind pathologies (Zeng et al., 2022, Jiang et al., 2023). Beyond disease research, ST has broad applications in developmental biology, enabling the study of tissue formation, cellular differentiation, and organogenesis with spatial resolution (Choe et al., 2023). Additionally, ST is valuable in regenerative medicine and tissue engineering, guiding the design of biomaterials and cell-based therapies through a deeper understanding of gene expression patterns in healthy and regenerating tissue (Lammi and Qu, 2024). By integrating histology with transcriptomics, ST opens a new spectrum of possibilities to understand tissue structure and mechanistic insights into various biological processes (Wang et al., 2023).

As with any emerging technology, multiple variations of ST are currently available and under continuous development (Stickels et al., 2021, Chen et al., 2015, Patrik et al., 2016). Notably, as demonstrated by the number of entries in the comprehensive ST repository (Wang et al., 2023), Visium (Patrik et al., 2016) has emerged as the most widely used ST technology. The workflow of this technology is depicted in Fig. 1 and begins with the preparation of the tissue, where the sample is embedded, sectioned, and placed on a slide with designated capture areas. Next, staining and imaging are performed using standard histological techniques to visualize tissue structures. Once imaged, the tissue is permeabilized, allowing mRNA to be released. Then, this mRNA is captured using barcoded oligonucleotides, enabling spatial mapping of gene expression. A reverse transcription reaction is then used to synthesize cDNA from the captured mRNA, which is subsequently processed into sequencing libraries. Finally, specialized analysis software processes the sequencing data, generating spatially resolved gene expression maps for visualization and interpretation (Patrik et al., 2016).

Despite its advantages, this approach presents key challenges: high costs, the need for domain expertise, and slow adoption in clinical settings, limiting its accessibility in routine diagnostics (Pang et al., 2021). In addition to these challenges, on the technical side, it inherits data capturing issues from bulk and single-cell transcriptomics (Pham et al., 2023, Avşar and Pir, 2023). This problem is known as dropout and corresponds to the failure to detect transcripts even though they are present in the source tissue. In practice, this phenomenon appears as pepper noise in gene expression maps, often requiring single-cell reference datasets to compensate for missing data (Avşar and Pir, 2023).

Acknowledging these challenges, the deep learning community has delved into democratizing ST by studying gene expression prediction from histology images (Jiang et al., 2023). By bypassing the need for specialized sequencing, these approaches offer a more accessible and scalable alternative, enabling subjects to obtain molecular insights of a tissue from a standard biopsy image. Leveraging the abundance of public Visium data, multiple deep learning models have emerged to tackle this task (He et al., 2020, Pang et al., 2021, Yang et al., 2023, Yang et al., 2024, Xie et al., 2023, Zeng et al., 2022, Mejia et al., 2023). Although these methods consistently report favorable results against the latest state of the art, differences in datasets, preprocessing strategies, and training hyperparameters hinder fair comparisons and compromise the validity of new findings.

In our previous MICCAI paper titled ”Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion” (Mejia et al., 2024), we introduced initial efforts to address the limitations discussed above. In this work, we substantially build upon and refine those initial contributions. First, we enhance the methodology by introducing comprehensive ablation studies to support our design choices for SpaCKLE, including the contribution of data pre-completion, the integration of visual features, the effect of incorporating context genes information, and the impact of neighborhood size. Second, we broaden the SpaRED benchmark by adding the state-of-the-art model HGGEP (Li et al., 2024) and systematically evaluating its performance across all 26 datasets. Third, we provide a more comprehensive analysis with additional qualitative and statistical results for both our completion model and the SpaRED Benchmark, offering more profound insights into SpaCKLE’s performance and a more detailed comparative evaluation of existing gene expression prediction models.

Our key contributions can be summarized as follows.

1.

We systematically compile, curate, and standardize 26 public ST datasets into the Spatially Resolved Expression Database (SpaRED), an extensive Visium resource encompassing human and mouse samples from nine tissue types.

2.

To address the dropout problem, we introduce Spatial transcriptomics Completion with Knowledge from the Local Environment (SpaCKLE), a transformer-based model inspired by the unrivaled power of self-attention mechanisms for next token prediction in natural language processing (Dosovitskiy et al., 2020). Notably, SpaCKLE surpasses existing gene completion approaches, achieving a relative 82.5% MSE reduction compared to the median method.

3.

We establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data. This benchmark exposes the proximity in performance across all the models we study and the need for exploring new approaches in this task. Moreover, our benchmark also demonstrates that SpaCKLE significantly enhances gene expression prediction performance across all tested models.

To ensure the reproducibility of our experiments and facilitate the implementation of SpaCKLE, we provide the SpaRED library, available at PyPI. Additionally, we present a web platform to explore SpaRED data, access key statistics, and download both raw and processed datasets.

Comments (0)

No login
gif