The initial 3D conformation of PFOA in SDF format was downloaded from the PubChem repository (https://pubchem.ncbi.nlm.nih.gov/) (Fig. 2A), along with its toxicity assessment profile (Fig. 2C). As illustrated, PFOA exhibited a high carcinogenicity score of approximately 80%, obtained from the ProTox-3.0 database, which applies machine learning models trained on curated experimental datasets to predict chemical carcinogenic potential. Target prediction was performed using an integrative approach across five databases: bSDTNBI (network-based drug-target inference), ChEMBL (bioactivity data repository), CTD (Chemical-Gene-Disease interactions), STITCH (chemical-protein functional associations), and PharmMapper (pharmacophore mapping server). The combined dataset resulted in a comprehensive target landscape comprising 9,591 candidate proteins (Fig. 2B).
Fig. 2
Computational characterization of PFOA structure, carcinogenicity, and putative target profiles. (A) Three-dimensional structure of PFOA retrieved from the PubChem database. (B) Venn diagram summarizing the predicted molecular targets of PFOA identified by five independent databases, including bSDTNBI, ChEMBL, CTD, STITCH, and PharmMapper. (C) Toxicological evaluation of PFOA indicating a high carcinogenicity score approaching 80%, as predicted by computational toxicity assessment tools
Transcriptomic Integration, Differential Expression Analysis, and Co-Expression Network ConstructionPublicly accessible transcriptomic profiles from five independent bladder cancer cohorts were retrieved from the GEO and integrated into the analysis. Among these, two datasets (GSE52519 and GSE13507) were randomly selected and integrated to construct the training cohort. To minimize technical variation attributable to differences in sequencing platforms and experimental conditions, batch effects were rigorously corrected, and data normalization was performed following a standardized preprocessing pipeline. To assess the efficacy of batch correction, the resultant shifts in gene expression distributions are displayed in Figs. 3A and 3B (Figure S1), respectively. Differential gene expression analysis was carried out to detect genes with statistically meaningful expression variations between cancerous and normal bladder tissues. The identified differentially expressed genes (DEGs) are visualized as a heatmap (Fig. 3C), while the corresponding volcano plot (Fig. 3D) depicts the distribution of log2 fold changes in relation to adjusted statistical significance. The resulting DEGs were subjected to WGCNA to delineate gene expression modules exhibiting coordinated regulation. The hierarchical clustering dendrogram alongside module color assignments is presented in Fig. 3E, where distinct colors denote discrete gene co-expression modules, and grey indicates unassigned genes. This analysis revealed the modular architecture underlying the transcriptomic landscape. The heatmap in Fig. 3F depicts the correlation strength and significance between gene modules and clinical traits, identifying the black module as exhibiting the strongest correlation with tumor status. Moreover, Fig. 3G and Table S2 indicate that the module membership (MM) of genes is closely and significantly related to their gene significance (GS) values across all modules, revealing that genes within each module exhibit highly consistent biological functional relevance to the trait of interest.
Fig. 3
Identification of differentially expressed genes and construction of co-expression networks via WGCNA. (A) Principal component analysis (PCA) plot displaying the distribution of transcriptomic profiles from GSE52519 and GSE13507 datasets prior to batch effect correction. (B) PCA plot after batch effect removal and normalization, showing improved dataset integration. (C) Heatmap illustrating the expression patterns of differentially expressed genes (DEGs) identified in the training cohort. (D) Volcano plot displaying the distribution of DEGs based on log2 fold change and statistical significance (E) Hierarchical clustering dendrogram and module identification results from weighted gene co-expression network analysis (WGCNA); distinct colors represent different gene modules, while grey indicates unassigned genes. (F) Heatmap summarizing module-trait relationships, revealing that the black module shows the most significant correlation with tumor status. (G) Scatter plot demonstrating a highly significant correlation between gene significance (GS) and module membership (MM) within the black module (P < 0.001), indicating strong biological coherence of genes associated with the trait of interest
Identification and Functional Characterization of PFOA-Associated Core Genes in Bladder CarcinogenesisTo further delineate the molecular targets of PFOA in bladder cancer, an integrative analysis was performed by intersecting differentially expressed genes, the WGCNA-derived black module, and predicted PFOA target genes, yielding 69 core candidate genes (Fig. 4A). Protein–protein interaction network analysis (Fig. 4B) revealed extensive interconnectivity among these genes, suggesting coordinated functional roles.
Fig. 4
Identification and functional analysis of PFOA -associated core genes in bladder cancer. (A) Venn diagram showing the intersection of differentially expressed genes (DEGs), WGCNA black module genes, and predicted PFOA target genes, resulting in 69 core candidate genes. (B) Protein–protein interaction (PPI) network of the 69 core genes constructed using STRING, illustrating the extensive interconnectivity and potential functional cooperation among these targets. (C) Gene Ontology (GO) enrichment analysis. (D) KEGG pathway enrichment analysis of core genes
Gene Ontology enrichment analysis of the 69 overlapping genes between PFOA targets and bladder cancer-associated genes revealed a predominant involvement in cell cycle regulation. Key biological processes included mitotic cell cycle phase transition, regulation of mitotic nuclear division, and G2/M transition, underscoring potential perturbations in mitotic fidelity. Cellular component annotations were enriched in chromosomal domains, including the centromeric region, telomeric region, and replication fork, suggesting genomic instability as a plausible mechanistic axis. Kinase-related molecular functions were prominent, encompassing regulation of cyclin-dependent protein serine/threonine kinases, enzymatic activity toward histones, and affinity for damaged DNA, implicating dysregulation of phosphorylation cascades and DNA repair pathways. Collectively, these results suggest that PFOA may promote bladder carcinogenesis through interference with genome maintenance and cell cycle progression (Fig. 4C). Concurrently, Pathway enrichment analysis (KEGG) of the 69 genes derived from dual-omics integration confirmed their significant convergence on central axes of tumorigenesis and cell cycle governance. Key enriched pathways included cell cycle, apoptosis, cellular senescence, and the p53 and FoxO signaling pathways. Multiple cancer types were implicated, including bladder cancer, glioma, melanoma, and leukemia, alongside pathways related to drug resistance and microRNA regulation. Notably, several virus-associated carcinogenic pathways-such as HPV, EBV, HIV-1, and hepatitis B/C-were also enriched, suggesting that PFOA may engage host oncogenic programs via viral mimicry or immune evasion. These findings indicate that PFOA-associated genes converge on conserved oncogenic and cell regulatory networks implicated in bladder cancer progression (Fig. 4D).
Machine Learning-Based Core Gene Selection and SHAP Interpretation of Predictive ContributionsFollowing comprehensive model construction and performance evaluation using 113 distinct machine learning algorithms. Among the 113 machine learning models evaluated, glmBoost exhibited the highest mean AUC (0.980) with a low CV_AUC (0.0251)(Table S3, Fig S2), indicating excellent predictive accuracy and stability across validation cohorts. Other top-performing models, such as Stepglm[both] + GBM (Mean_AUC = 0.966, CV_AUC = 0.0246) and Lasso + glmBoost (Mean_AUC = 0.975, CV_AUC = 0.0275), also demonstrated robust performance with minimal variation. In contrast, models like Lasso + Stepglm[both] showed lower predictive accuracy (Mean_AUC = 0.611) and higher variability (CV_AUC = 0.4261), reflecting poorer generalizability. Based on these findings, the Gradient Boosting with Component-wise Linear Models (glmBoost) approach was selected as the final predictive model. This choice was driven by glmBoost’s ability to balance high predictive accuracy, feature interpretability, and robustness across both internal and external validation cohorts. Its component-wise linear modeling framework enhances interpretability by quantifying the contribution of individual features, while the iterative boosting process effectively captures complex, non-linear relationships inherent in biological data. To ensure model generalizability and minimize overfitting, a repeated five-fold cross-validation strategy was embedded within the iterative training routine of all candidate algorithms. Across all external validation cohorts, the AUC served as the core metric for assessing the predictive performance of each model. Ultimately, the glmBoost model was prioritized for downstream analysis due to its optimal combination of gene feature count, predictive accuracy, and performance stability. This final model identified nine core predictive genes: ADRM1, BCL2L1, CDC25C, IGFBP2, MAD1L1, MCM7, SKP2, TOP2A and UPK3A. Performance comparisons of multiple machine learning models across datasets are presented (Fig. 5A), while the predictive accuracy and validity of the selected model genes are demonstrated (Fig. 5B). The predictive performance of the constructed nine-gene signature was further assessed using an internal–external validation strategy. In the training set, the model achieved near-perfect classification performance with an AUC of 0.986 (Figure S3A). Robust generalizability was demonstrated across independent validation cohorts, yielding AUC values of 0.991 in GSE19915 (Figure S3B), 1.000 in GSE3167 (Figure S3C), and 0.944 in GSE166716 (Figure S3D), thereby confirming the stability and high discriminatory power of the proposed biomarker panel across diverse clinical cohorts. Analysis of expression distribution showed that seven out of nine signature genes resided within the regions of elevated expression on the volcano plot, while two genes were found in the low-expression regions (Fig. 5C), further supporting their functional relevance in tumor biology. The impact of individual genes on prediction performance was assessed through SHAP analysis. The SHAP summary plot (Fig. 5E) highlighted MCM7 as the most influential gene, underscoring its pivotal role in DNA replication and cell cycle regulation, both of which are critical processes in bladder carcinogenesis. The SHAP bar plot (Fig. 5D) ranked all nine genes by their mean absolute SHAP values, providing a quantitative assessment of their relative impact on the model’s output. Analysis of the SHAP dependence plot (Fig. 5F) indicated that malignancy risk varied non-linearly with MCM7 expression levels, with a sharp increase in risk when expression exceeded 9 FPKM, suggesting a threshold effect with potential diagnostic utility. ROC curve analysis was used to evaluate the discriminative ability of ten machine learning models constructed using the selected gene signature. Among them, the random forest (RF) model showed the best performance, achieving an AUC of 0.990, and notably, even the least performing model, decision tree (DTS), yielded an AUC of 0.867 (Fig. 5G). These results collectively underscore the robust and consistent discriminative capacity of the identified biomarkers across diverse machine learning algorithms. The SHAP force plot (Fig. 5H) provides an individualized visualization of gene-specific contributions to prediction outcomes, further enhancing model interpretability for potential clinical application.
Fig. 5
Identification of core predictive genes using machine learning and interpretation through SHAP analysis. (A) Comparative evaluation of the predictive accuracies across 113 machine learning algorithms in both training and validation datasets. (B) Performance evaluation of the final model genes (ADRM1, BCL2L1, CDC25C, IGFBP2, MAD1L1, MCM7, SKP2, TOP2A, and UPK3A) demonstrating their robust predictive efficacy. (C) Distribution of model genes within gene co-expression modules. (D) SHAP bar plot ranking the mean absolute SHAP values across model genes. (E) SHAP summary plot depicting the relative contribution of each gene to model prediction; (F) SHAP dependence plot; (G) ROC analysis demonstrated excellent classification performance of the selected gene signature across ten machine learning models; (H) SHAP force plot;
Subgroup Analysis of Signature PerformanceTo evaluate the clinical applicability of the nine-gene signature, subgroup analyses were performed across clinically relevant variables, including age, pathological stage, tumor grade, muscle invasion, and gender, in the training cohort and three independent validation datasets (GSE166716, GSE19915, and GSE3167) (Figure S4). In the training cohort (Figure S4A-E), the nine-gene signature achieved near-perfect predictive accuracy in age-based subgroups (AUC = 1.00), while demonstrating moderate discrimination for tumor grade, muscle invasion, and pathological stage (AUC range: 0.594–0.714).Across external validation cohorts, the model consistently showed strong performance in age and tumor grade subgroups (AUCs up to 1.00 and 0.944, respectively), whereas predictive accuracy in gender-, stage-, and muscle invasion-based subgroups remained modest and variable across datasets (Figure S4F-O). Overall, these results indicate that the nine-gene signature is robust across age and tumor grade strata, while reduced performance in other subgroups likely reflects biological heterogeneity and limited subgroup sample sizes. Accordingly, these subgroup analyses were intended as sensitivity analyses to assess model robustness rather than to derive subgroup-specific clinical conclusions.
Integrative Analysis of Core PFOA-Associated Genes and Their Functional ImplicationsSpatial genomic mapping revealed the chromosomal distribution of the nine PFOA-associated core genes (Fig. 6A), providing a framework for subsequent functional interrogation. Integration of normalized gene expression profiles with CIBERSORT-estimated immune cell fractions demonstrated significant associations between core gene expression and tumor immune microenvironment composition (Fig. 6B). Correlation analysis revealed that several core genes exhibited distinct associations with specific immune cell subsets. ADRM1 expression was positively correlated with resting NK cells, while CDC25C showed positive correlations with plasma cells, regulatory T cells (Tregs), and M1 macrophages. IGFBP2 demonstrated a notable positive correlation with Tregs, suggesting a potential role in modulating immune suppression. MAD1L1 expression showed significant positive correlations with neutrophils, activated mast cells, and activated dendritic cells, indicating involvement in innate immune responses. Furthermore, MCM7 and TOP2A were both positively correlated with M1 macrophages, with TOP2A also showing a significant association with gamma delta T cells. UPK3A expression exhibited a positive correlation with Tregs, implicating it in immune evasion mechanisms. These findings highlight the immune-modulatory roles of the identified core genes and suggest potential crosstalk between PFOA-associated molecular pathways and the tumor immune microenvironment in bladder cancer. Protein–protein interaction network analysis delineated direct and indirect interactions among these core genes (Fig. 6C). Differential expression analysis revealed distinct expression patterns among the nine model genes. ADRM1, BCL2L1, MAD1L1, MCM7, SKP2, CDC25C, and TOP2A expression were substantially elevated in bladder tumors when contrasted with normal bladder tissue, suggesting their potential oncogenic involvement. In contrast, IGFBP2 and UPK3A exhibited higher expression levels in normal bladder tissues, implying possible tumor-suppressive roles. These findings underscore the heterogeneous transcriptional behavior of the model genes and highlight the functional relevance of the majority in bladder carcinogenesis (Fig. 6D).
Fig. 6
Multi-layered characterization of PFOA-associated core targets in bladder cancer. (A) Genomic mapping of nine core genes across chromosomal loci. (B) Correlation matrix showing associations between core gene expression and immune cell infiltration (CIBERSORT). (C) Protein–protein interaction network depicting molecular connectivity among the core targets. (D) Differential expression of core genes in tumor versus normal bladder tissues. ADRM1, BCL2L1, MAD1L1, MCM7, SKP2, CDC25C, and TOP2A are upregulated in tumors, whereas IGFBP2 and UPK3A show higher expression in normal tissues, highlighting heterogeneous transcriptional patterns
Binding Affinity Profiling of PFOA with Core Bladder Cancer TargetsMolecular docking analysis was conducted to explore the potential binding interactions between PFOA and each of the nine model proteins (Fig. 7A–I). The predicted binding energies ranged from − 6.4 to − 13.0 kcal/mol, indicating variable affinity across targets (Table 1). Among them, IGFBP2 exhibited the most favorable predicted binding energy (− 13.0 kcal/mol), whereas UPK3A showed comparatively weaker interaction (− 6.4 kcal/mol). Intermediate binding affinities were observed for TOP2A (− 9.0 kcal/mol), MCM7 (− 8.0 kcal/mol), MAD1L1 (− 7.9 kcal/mol), and SKP2 (− 7.7 kcal/mol). These results suggest that PFOA may have the physicochemical potential to associate with multiple proteins implicated in cell cycle regulation, DNA replication, and chromatin-related processes. However, the docking analysis represents an in silico prediction of binding feasibility rather than experimental evidence of physical interaction or functional modulation. Accordingly, these findings should be interpreted as hypothesis-generating and provide a rationale for prioritizing selected targets-such as IGFBP2 and MCM7-for further experimental validation.
Fig. 7
Molecular docking analysis of PFOA interactions with Core Bladder Cancer Targets (A) Molecular docking model of PFOA bound to ADRM1. (B) Molecular docking model of PFOA bound to BCL2L1. (C) Molecular docking model of PFOA bound to CDC25C. (D) Molecular docking model of PFOA bound to IGFBP2. (E) Molecular docking model of PFOA bound to MAD1L1. (F) Molecular docking model of PFOA bound to MCM7. (G) Molecular docking model of PFOA bound to SKP2. (H) Molecular docking model of PFOA bound to TOP2A. (I) Molecular docking model of PFOA bound to UPK3A
Comments (0)