A graph clustering algorithm with hypergraph learning and a core-attachment strategy for protein complex identification

Abstract

Protein complexes play a crucial role in cellular biological processes. Identifying these complexes is essential for understanding cellular functions and biological mechanisms. Graph clustering approaches to identify protein complexes in protein-protein interaction (PPI) networks have become a significant research hotspot in data mining and bioinformatics. Many graph clustering methods have been developed for protein complex identification. However, most existing methods only utilize original networks to discover dense subgraphs and ignore higher-order topological characteristics. Considering the prevalent multi-relational and complex interactions in biological networks, a graph clustering algorithm based on hypergraph learning and a core-attachment strategy is proposed for protein complex identification, called HLCA. Hypergraph networks are employed to directly model multi-relational interactions. Based on this method, a multi-level hypergraph is used as higher-order topology and a core-attachment strategy are adopted to identify protein complexes. Firstly, the original PPI network is transformed into a hypergraph network. Secondly, a hierarchical compression strategy is applied to recursively compress the hypergraph into smaller hypergraphs at various levels, forming a multi-level analytical framework. Thirdly, hypergraph convolution is performed across different hierarchical levels to obtain node representations at each level. These node representations are then combined to produce complete node embeddings. Based on these node embeddings, a weighted PPI network is constructed by cosine similarity from the original PPI network. Core clusters are obtained in this weighted network by cluster density. Finally, remaining protein nodes are added to the core clusters using a core-attachment strategy combining hyperedge density and overlap. The effectiveness of HLCA is evaluated by comparing it with other protein complex identification methods on multiple datasets. Experimental results show that the proposed method outperforms comparison methods regarding F-measure and Accuracy.

1 Introduction

Proteins are large organic molecules formed by amino acids linked through peptide bonds. They serve as core executors of cellular structure and function. Proteins participate in nearly all biological processes, including enzyme catalysis, signal transduction, and immune defense. However, most proteins do not function individually. Instead, they form complexes with other biomolecules such as proteins and nucleic acids to perform complex biological functions (Zhang et al., 2015). As critical functional units in cellular processes, protein complexes participate in numerous biological activities, including cell cycle regulation, signal transduction, and gene expression regulation. Extensive evidence indicates that protein complexes are closely associated with various diseases (Wu et al., 2013), making them significant for biomedical research.

Experimental methods for identifying protein complexes, such as nuclear magnetic resonance (NMR) (Göbl et al., 2014) and mass spectrometry (Walzthoeni et al., 2013), are costly and time-consuming. Thus, they are unable to comprehensively identify all protein complexes. Computational methods identify protein complexes by mining closely connected protein node sets in protein-protein interaction (PPI) networks (Alberts, 1998). PPIs can be obtained using sequence-based prediction methods (Dunham and Ganapathiraju, 2021), deep learning-based prediction methods (Hua et al., 2022; Cao and Chen, 2023), and evolutionary relationship-based prediction methods (Li et al., 2008b). These methods help construct comprehensive PPI networks. Developing computational methods based on network topology to identify protein complexes from PPI networks has become a major research direction. Such methods contribute to understanding complex intracellular biological processes, promote disease research, and facilitate drug discovery, highlighting their significant research value and application potential (Pan et al., 2022). More details on the related work are introduced in the related work section.

Currently, numerous computational methods have been proposed for protein complex identification. Among them, the methods that utilize network topology information mainly include graph partitioning algorithms, heuristic-based algorithms, and graph embedding algorithms, etc.

The core idea of graph partitioning algorithms is to optimize a specific objective function. These methods partition a graph into multiple disjoint subgraphs that contain a certain number of nodes or edges. Connections within subgraphs are dense, while connections between subgraphs are sparse (Shang et al., 2025). Algorithms based on this idea usually consider each subgraph as a protein complex. For example, the Markov clustering algorithm (MCL) (Vlasblom and Wodak, 2009) identifies subgraphs by simulating random walks. The algorithm adds loops to the input graph and transforms it into a Markov matrix. This matrix is repeatedly updated through expansion and inflation operations until the graph is segmented into multiple subgraphs. Each subgraph represents a predicted protein complex. However, this approach produces only non-overlapping clusters. Thus, although MCL can handle noisy nodes, it cannot detect overlapping clusters. The Soft Regularized MCL (SR-MCL) algorithm (Shih and Parthasarathy, 2012) was proposed to address this limitation. Algorithms developed on this concept, such as BOPS (Lyu et al., 2022) and DPFO (Wang et al., 2025a), further improved clustering performance. Protein complex identification algorithms developed under this approach perform complex identification by partitioning PPI networks. Such methods neglect the authentic multi-node cooperative interactions among protein populations and therefore fail to capture the many-to-many interaction characteristics intrinsic to biological networks.

Heuristic-based algorithms identify protein complexes by selecting seed proteins as initial clusters and then greedily expanding these clusters. Classic examples include ClusterONE (Nepusz et al., 2012) and DPClus (Altaf-Ul-Amin et al., 2006). ClusterONE addresses overlapping clusters in PPI networks. It starts from individual seed nodes and applies a greedy strategy to identify high-density clusters. ClusterONE merges overlapping clusters by quantifying the overlap between each pair and discarding clusters that do not meet certain criteria. It treats high-density regions as protein complexes. DPClus introduces the concept of “cluster periphery”. It assigns edge weights based on common neighbors and determines node weights from the sum of adjacent edge weights. DPClus starts clustering from the node with the highest weight and expands the cluster iteratively until the density and cluster-periphery conditions are satisfied. DPClus accurately identifies high-density regions. Another heuristic approach based on seed expansion is the core-attachment method. Gavin et al. (Gavin et al., 2006) initially proposed that protein complexes consist of core and attachment components. The main idea is to divide protein complexes into two functional units: core proteins and attachment proteins. Core proteins are stable and frequently recurring protein groups, while attachment proteins are loosely connected proteins dynamically associating with or dissociating from the core. Based on this concept, the CORE algorithm (Leung et al., 2009) introduces the core-attachment model. It predicts core components and identifies attachment proteins interacting with these cores. The COACH method (Wu et al., 2009) selects proteins whose node degrees are above average as preliminary core proteins within their neighborhood. A recursive core-elimination method extracts high-density subgraphs as the final cores. Attachment proteins are identified from neighboring nodes that interact with more than half of the core proteins. Attachment proteins are then combined with core proteins to form predicted protein complexes. Based on these approaches, further algorithms such as WPNCA (Peng et al., 2014), WCOACH (Kouhsar et al., 2015), SEGC (Wang et al., 2017), GCAPL (Wang et al., 2023), and Multiobjective (Mukhopadhyay et al., 2024) have been developed. Heuristic protein complex identification algorithms operate on PPI networks by modeling one-to-one node interactions. Such designs similarly neglect the complex many-to-many interaction characteristics intrinsic to biological networks.

Graph embedding is a technique that maps nodes (or other graph structures such as edges and subgraphs) into low-dimensional vector spaces. This mapping preserves structural information from the original topological graph. Similarities between nodes in the embedding space approximate their relationships in the original network (Xu, 2021). For example, DPCMNE (Meng et al., 2021), captures local and global topological information by embedding nodes into low-dimensional vector spaces at different granularities. Initially, the Louvain algorithm divides the original PPI network into modules by grouping closely connected protein nodes. For these modules, the DeepWalk algorithm obtains low-dimensional embeddings of proteins. These embeddings are then used to construct weighted networks, and protein complexes are subsequently identified using a core-attachment strategy. Another algorithm, ELF-DPC (Wang et al., 2022) proposes an ensemble learning framework. This framework integrates various types of information and employs voting regression models to enhance protein complex identification. Furthermore, hypergraph models grounded in graph embedding concepts have been introduced and combined with diverse bioinformatics data and protein dynamic features for protein complex identification, exemplified by HyperGraphComplex (Xia et al., 2024) and HGST (Wang et al., 2025b). Most graph embedding-based protein complex identification algorithms project PPI networks into low-dimensional vector spaces. This projection neglects higher-order non-pairwise interactions among protein nodes.

Most existing protein complex identification methods rely on pairwise interactions derived from PPI networks. Such low-order modeling approaches cannot accurately capture higher-order cooperative interactions among protein groups. To overcome this limitation, hypergraph-based modeling has been introduced to represent complex structural interactions in biological networks. Unlike traditional graphs that restrict relationships to pairs of nodes, hypergraphs allow a single hyperedge to connect multiple nodes simultaneously, thereby enabling the modeling of higher-order and non-pairwise interactions among proteins (Gao et al., 2022). This one-to-many relationship representation is particularly suitable for describing protein complexes composed of multiple interacting proteins.

In summary, this paper propose a graph clustering algorithm for protein complex identification based on hypergraph learning and a core-attachment strategy, named HLCA. The HLCA algorithm transforms the PPI network into a hypergraph and extracts topological features at multiple scales through a hierarchical compression strategy, resulting in informative protein node embeddings. Protein complexes are then identified using these embeddings in combination with a core-attachment strategy. This approach overcomes the limitations of traditional graph-based pairwise interaction modeling and effectively integrates both local and global topological information from PPI networks, providing a novel and powerful framework for protein complex identification.

2 Methods

The HLCA algorithm consists of four major components: hypergraph construction, hierarchical compression, node embedding, and cluster generation. First, the original PPI network is converted into a hypergraph. Second, during the hierarchical compression phase, a hypergraph modularity-based strategy is employed to recursively compress the hypergraph into multiple smaller hypergraphs at different levels. Third, in the node embedding phase, hypergraph convolution operations are applied to these compressed hypergraphs across all hierarchical levels, generating multi-scale embedding representations for each node. These representations are subsequently integrated to obtain a final unified node embedding vector. Finally, in the cluster generation phase, a weighted PPI network is constructed by computing the cosine similarity between node embeddings. Core clusters are then identified within this weighted network based on cluster density. Additional protein nodes are subsequently assigned to the core clusters using a core-attachment strategy that incorporates both hyperedge density and overlap. The complete workflow of the HLCA algorithm is illustrated in Figure 1.

Flowchart illustrating a protein complex detection framework using hierarchical hypergraph compression. The pipeline starts from PPI network construction to hypergraph formation, follows with hierarchical compression across multiple levels using GCN and HGCN modules for feature embedding, and concludes with cosine similarity-based clustering resulting in weighted PPI network clusters and protein complex identification.

The workflow of the proposed HLCA algorithm.

2.1 Hypergraph construction

The PPI network is typically modeled as an undirected graph . In this work, we transform the PPI network into a hypergraph denoted by , where is the set of nodes, is the set of hyperedges, is a diagonal matrix of hyperedge weights, and represents the node feature matrix. Here, is the number of nodes and is the dimensionality of node features.

To construct the hypergraph, we define the neighborhood of each protein node in the PPI network as Equation 1:where indicates an interaction between proteins and in the original PPI network. Based on this definition, the hyperedge associated with node is constructed as Equation 2:which represents the local interaction group centered at protein . When two nodes mutually interact, the hyperedges they induce are equivalent; thus, and denote the same hyperedge.

The incidence matrix characterizes the relationship between nodes and hyperedges. Each element indicates whether node belongs to hyperedge , and is defined in Equation 3:

The hyperedge degree matrix is defined as , where represents the number of nodes contained in hyperedge . The node degree matrix is defined as , where reflects the weighted degree of node in the hypergraph. The hyperedge degree and node degree are computed in Equation 4 and Equation 5, respectively:where denotes the weight of hyperedge , and is the number of nodes it contains. This method mitigates the bias introduced by large hyperedges and ensures a balanced representation of local structural information in the hypergraph.

2.2 Hierarchical compression

In this section, a hypergraph modularity method is employed to partition the hypergraph into smaller hypergraphs. Through hierarchical compression, both local and global topological information of the hypergraph can be effectively extracted (Kumar et al., 2020).

Traditional hypergraph modularity introduces a null model and defines modularity directly based on this model. However, in biological networks, hyperedges derived from PPI networks often vary substantially in size. If the traditional method is applied directly, large hyperedges may dominate the model. Direct use of the traditional criterion may overemphasize large hyperedges and reduce the validity of hierarchical compression. To mitigate this limitation, a node-degree–preserving hypergraph modularity (Kumar et al., 2020) is adopted. The hypergraph is converted into a weighted graph, and node degrees are preserved during the conversion. This step reduces degree bias caused by large hyperedges. A null model is then defined on the weighted graph to obtain a revised modularity measure. Hierarchical compression is performed through this method, which also enables the extraction of both local and global topology. The multi-level embedding strategy allows HLCA to effectively maintain both fine-grained local structural details and higher-order connectivity patterns by integrating representations from both low-order and high-order embeddings.

Under this formulation, the hypergraph is transformed into a weighted graph. Each hyperedge is replaced by a clique formed by all of its protein nodes. The resulting weighted adjacency matrix is defined in Equation 6:where is the incidence matrix of the hypergraph and is the diagonal matrix of hyperedge weights. After removing self-loops, the degree of protein node in the weighted graph is computed in Equation 7:

To ensure that the node degrees of the weighted graph align with those of the original hypergraph, a normalized weighted adjacency matrix is defined in Equation 8:where normalizes the contribution of each hyperedge.

A degree-preserving null model is then introduced, and the expected interaction weight between nodes is computed in Equation 9:where represents the total weighted degree of all nodes in the original hypergraph.

The hypergraph modularity matrix is obtained as shown in Equation 10.

Using this matrix, the hypergraph modularity value is computed using Equation 11:where if and belong to the same community, and 0 otherwise. denotes the total weight of all edges in the weighted graph.

For a given hypergraph , each node is initially assigned to its own community. During optimization, the modularity gain is evaluated for each possible community reassignment. A node is moved only if the reassignment increases ; otherwise, it remains in its current community. When no further improvement is possible, the layer is considered converged.

The hierarchical compression process consists of modularity optimization and module aggregation. The hypergraph modularity divides the hypergraph into several modules. Each module is aggregated as a “supernode” at this stage to form a new and smaller hypergraph . Next, modularity optimization is reapplied to the new hypergraph . This iterative process achieves hierarchical compression and produces a series of multi-level compressed hypergraphs . From the perspective of the higher-order hypergraph, local and global topological information can be analyzed. The embedding representations of protein nodes at different scales are learned.

2.3 Node embedding

Hypergraph convolution is applied to all hypergraphs obtained through hierarchical compression to learn protein node embeddings. The original topological graph considers only first-order proximity, which limits its ability to capture complex relationships among nodes. As a result, the derived adjacency matrix is sparse, hindering effective node embedding in the hypergraph. To address this limitation, an attribute similarity graph is constructed to uncover latent associations between protein nodes. This enriched connectivity facilitates more effective hypergraph convolution and improves representation learning. The adjacency matrix of the attribute similarity graph is denoted by , where each entry is defined in Equation 12:where . The adjacency matrix of the original PPI network is defined in Equation 13:

Although can measure potential similarity between nodes, it may introduce false-positive links between unrelated nodes (Xiang et al., 2024). Therefore, an improved adjacency matrix is constructed as shown in Equation 14:where denotes the neighborhood of node , and represents the minimum similarity between node and all its neighbors. If is greater than or equal to this threshold, the similarity is preserved; otherwise, it is removed.

Using this similarity graph, an enhanced adjacency matrix is constructed as expressed in Equation 15:where is the PPI adjacency value between nodes and , and is a balancing hyperparameter.

We perform graph convolution to obtain the first-layer node features , which serve as the input to the first layer of hypergraph convolution. The computation of is shown in Equation 16:where is the degree matrix of , is the node weight matrix of the PPI network, and is the initial node adjacency vector in the PPI network.

A two-layer hypergraph convolutional network is constructed, using ReLU as activation. The first layer aggregates information from immediate hyperedge neighbors, while the second layer propagates information from more distant nodes, resulting in richer node embeddings. The two-layer hypergraph convolutions are defined in Equation 17 and Equation 18:where is the learnable hypergraph convolution parameter matrix. The second hypergraph convolution is the corresponding node embedding representation.

For each hypergraph in , we obtain a node embedding set If node in the -th hypergraph corresponds to a “supernode”, its embedding is redistributed back to its constituent nodes in the original PPI network. Suppose “supernode” in the -th hypergraph corresponds to node set . Then the embedding of node is computed as in Equation 19:where is the normalized weight of node within its module.

Finally, the hypergraph node embedding for node is obtained by concatenating its embeddings from all hypergraph levels, as shown in Equation 20:where is the final embedding of node , and denotes the hypergraph layer to which node belongs.

2.4 Cluster generation

In the clustering phase, a weighted PPI network is constructed by computing the cosine similarity between the protein embeddings derived from Equation 20. Let the embedding vector of protein node be , and the embedding vector of node be . Their cosine similarity is computed as shown in Equation 21:

A core–attachment strategy is then employed to identify protein complexes from the weighted PPI network . This strategy consists of two steps: (1) detecting core proteins and (2) attaching additional proteins to expand the cores into full complexes. A depth–first search is applied to enumerate all maximal cliques containing at least three protein nodes, forming the candidate core set . These cliques are sorted in descending order according to their density defined in Equation 22:where denotes the -th candidate core in the set .

Suitable seed cores are then selected from . First, the clique with the highest density, , is added to . For each remaining clique , overlapping proteins with are removed. If the remaining part of satisfies , then is updated as ; otherwise, it is removed from , since protein complexes with fewer than three proteins are usually considered biologically unreliable. This process is repeated until no cliques remain. The resulting seed core set is denoted as .

Next, core members are added to each seed core by combining cluster density, neighborhood relationships, and the presence of triangles and squares in the PPI network. A core node is selected from , and its first-order neighbors are evaluated. Let denote the first neighbor added to the core. If adding increases the cluster density, then becomes a core cluster member; otherwise, the remaining neighbors are evaluated. For each seed core , if core cluster , the seed core is regarded as a valid core cluster and included in , where dens is a parameter. The core cluster set is defined as . The cluster density is computed in Equation 23:where and denote the numbers of edges and nodes in cluster , respectively.

To ensure that the expanded core cluster maintains strong internal connectivity, we select appropriate accessory protein nodes to add to the core clusters. Considering the core clusters as hyperedges (seeds), new protein nodes are gradually adsorbed to expand these initial core clusters, thus obtaining the complete protein complex. In order to ensure that the hyperedge still has high internal connectivity after the candidate protein nodes are added, this paper introduces the HyperedgeDensity in Equation 24:where if protein pair has an edge in the original PPI network; otherwise, .

In addition, relying on the Hyperedge Density alone for adsorption may lead to excessive node sharing between different cores, resulting in functional confusion and blurred boundaries between different protein complexes. To avoid this situation, this paper introduces Hyperedge Overlap to clarify nodes’ difference between core clusters, defined in Equation 25, to distinguish differences between core clusters:

A candidate protein node that has not yet been assigned to any core cluster is added to core cluster when HyperedgeDensity and HyperedgeOverlap are both satisfied.

2.5 Time complexity analysis

The overall computational complexity comprises four main components: hypergraph construction, hierarchical compression, node embedding, and cluster generation. Let and denote the number of nodes and edges in the PPI network. denotes the number of hierarchical compression levels, while denotes the embedding dimension.

Specifically, in the hypergraph construction phase, the PPI network is transformed into a hypergraph by generating hyperedges and building a node-hyperedge incidence matrix. This step has a time complexity of . In the hierarchical compression stage, hypergraph modules are optimized and aggregated. Each layer of the hypergraph is partitioned into communities, and each community is merged into a “supernode”. If the number of edges in layer is , the time complexity of this process is . In the node embedding stage, two hypergraph convolutions are applied to each hypergraph layer. The time complexity of each convolution is . Therefore, the total time complexity of this stage is . In the node clustering stage, the algorithm performs maximal clique enumeration within the local neighborhood of each node. It is well known that the number of maximal cliques in an arbitrary graph can reach in the worst case (Moon-Moser theorem), and therefore maximal clique enumeration is NP-hard in general. However, our algorithm does not enumerate maximal cliques over the entire PPI network. Instead, the enumeration is restricted to the local neighborhood of each node, whose size is approximately equal to the average degree . Biological networks, including PPI networks, are typically sparse and exhibit heavy-tailed degree distributions, which significantly limits the size of local neighborhoods and prevents worst-case exponential behavior from occurring in practice. Let denote the actual number of maximal cliques in a neighborhood of size . Although is theoretically upper-bounded by , empirical observations on biological networks show that grows slowly and remains close to a low-degree polynomial in . Therefore, the practical complexity of one clique enumeration step can be expressed as , and the total cost across all nodes becomes . Since in real PPI networks and is small due to network sparsity, the effective runtime of this stage is well approximated by . In practice, HLCA has been successfully applied to larger datasets such as BIOGRID and DIP, demonstrating its scalability and confirming that the algorithm can handle larger networks effectively. Consequently, the overall computational complexity of the HLCA framework remains within the scale.

2.6 Evaluation metrics

To evaluate the performance of the proposed HLCA algorithm, we use several metrics, including F-measure, precision, recall, and Accuracy (ACC). The F-measure is calculated based on the precision and recall of the test. Precision represents the ratio of true positives to all positive results, including false positives, while recall represents the ratio of true positives to all actual positive samples. ACC is the geometric mean of Sensitivity (Sn) and Positive Predictive Value (PPV).

To compare the detected protein complexes with known protein complexes, the neighborhood affinity score between them is defined as Equation 26:where denotes a identified protein complex and denotes a known (ground truth) complex. The neighbourhood affinity score measures similarity between two complexes, with higher values indicating greater similarity. This metric also facilitates the overlap assessment between the detected complex and the standard complex . If is greater than or equal to a predefined threshold , and are considered a match.

The precision, recall, and F-measure metrics are calculated using Equations 27–29:

In addition, Accuracy (ACC), sensitivity (Sn), and positive predictive value (PPV) are defined as shown in Equations 30–32:

Here, denotes the number of proteins shared between the known protein complex and the detected protein complex , and denotes the number of proteins in the known complex .

F-measure and Accuracy can each reflect the algorithm’s performance in different dimensions, but the single use of one of the metrics may ignore the comprehensiveness of the algorithm’s performance. According to previous studies (Vlasblom and Wodak, 2009; Altaf-Ul-Amin et al., 2006; Omranian et al., 2021), F-measure reflects the algorithm’s comprehensive balance between the precision and recall of the prediction results. In contrast, Accuracy can intuitively reflect the overall agreement between the algorithm’s prediction results and the real complexes. Combining these two metrics helps to assess the algorithm’s performance more objectively and comprehensively. In this paper, we use F-measure + Accuracy as a comprehensive evaluation index, denoted as F1 + ACC.

3 Result3.1 Experimental datasets

The PPI networks used in this paper are Gavin1 (Gavin et al., 2002; Wang et al., 2025a), Gavin2 (Gavin et al., 2006; Abbas et al., 2025), K_extend (Krogan et al., 2006; Wang et al., 2025b), DIP (Xenarios et al., 2002; Wang et al., 2025b), and BioGRID (Oughtred et al., 2019; Papik et al., 2025). As a complete species-specific set of PPI networks, these yeast datasets are most widely used in protein complex identification studies. Therefore, yeast is selected as the experimental object. CYC2008 (Pu et al., 2009) and MIPS (Brohee and Van Helden, 2006) are adopted as the ground truth for yeast proteins to conduct parameter analysis and evaluate the clustering results. Table 1 provides detailed information about these PPI networks. The datasets are then preprocessed to remove all self-interactions and duplicate interactions.

DatasetGavin1Gavin2K_extendDIPBioGRIDProteins1,3521,4303,6724,9304,187Interactions3,2106,53114,31717,20120,4543.2 Parameter analysis

To study the influence of key parameters on model performance, a comprehensive sensitivity analysis is conducted for the three main parameters-cluster density threshold , hyperedge density threshold , and hyperedge overlap threshold , used in the proposed algorithm. These parameters play critical roles in determining the structure and quality of the predicted protein complexes.

A total of 1,000 parameter configurations are evaluated by performing a grid search across the three parameters , , and . For each configuration, experiments are carried out on the Gavin1, Gavin2, K_extend, DIP, and BIOGRID datasets, using CYC2008 and MIPS as the ground truths. The sensitivity analysis results are presented in Figures 2, 3.

Five separate 3D scatter plots compare F1-ACC values across parameters theta, epsilon, and density for datasets labeled Gavin1, Gavin2, K_extend, DIP, and BIOGRID. Each plot uses color gradients from purple to yellow to indicate F1-ACC variation, with a vertical color bar for reference. Plots display grid arrangements, axes labels, and similar data distributions for visual comparison of parameter effects among datasets.

CYC2008 as benchmarks, parametric analysis on different datasets.

Five separate 3D scatterplots each represent the relationship between three variables, labeled as theta, epsilon, and dens, against the F1-ACC metric, displayed by color gradients from purple to yellow. Each plot corresponds to a different dataset: Gavin1, Gavin2, K_extend, DIP, and BIOGRID. Color bars indicate the F1-ACC value range for each dataset, with similar patterns but slight differences in color distribution and value intervals across plots.

MIPS as benchmarks, parametric analysis on different datasets.

Experimental results show that increasing the cluster density threshold leads to a rapid improvement in performance, followed by stabilization across all datasets. The overall F1 + ACC metric peaks when is set around 0.8, indicating that this value allows the model to effectively capture compact and robust core protein complexes. Under this setting, the model can effectively capture more reasonable and robust core clusters. The hyperedge density threshold also significantly impacts performance across datasets. While variation is observed among datasets, performance tends to be optimal when reaches 0.7, where F1 + ACC consistently remains high. This suggests that higher values facilitate the inclusion of informative attachment nodes, thereby enhancing protein complex identification accuracy. The hyperedge overlap threshold exhibits notable sensitivity as well. Performance degrades when is set too low or too high. This situation indicates that extreme values may either underrepresent or overrepresent functional modules. The optimal performance and stability are observed when is set to 0.5, which balances module distinctiveness and inter-complex sharing.

Although different parameter settings may yield slightly better results on specific datasets, the parameter combination of , , and consistently outperforms baseline algorithms across all evaluation metrics. While tuning parameters individually per dataset could further enhance performance, it may compromise the model’s generalizability. Overall, when , , and , the algorithm maintains high F1 + ACC performance across different datasets. This indicates that parameter values within these ranges deliver robust and reliable performance. Considering overall performance, stability, and cross-dataset generalization, the parameter combination is selected as the final configuration. This selection ensures strong performance across diverse datasets and robust structural consistency, highlighting the effectiveness and rationale behind the parameter design. This parameter combination ensures that HLCA performs reliably across different datasets without the need for dataset-specific parameter tuning, addressing concerns related to the sensitivity and reproducibility of results.

3.3 Effectiveness analysis of hypergraph learning and improved core-attachment strategy

In this study, the HLCA algorithm integrates hypergraph learning with an improved core-attachment strategy for protein complex identification. In the hypergraph learning component, higher-order topological information is incorporated into the PPI network. This information enables a more effective modeling of multi-protein cooperative patterns and improves the representation of complex interaction relationships. In the core-attachment component, tightly associated protein nodes are grouped into cores based on cluster density. Attachment proteins are then assigned according to node metrics and higher-order topological structure. To evaluate the effect of higher-order topology and the improved strategy on recognition performance, a series of comparative experiments is designed and conducted. The results of the experiments are shown in Figure 4 (with CYC2008 as the ground truth and the BIOGRID dataset as an example). As shown in Figure 4, LL denotes the use of only low-order topological information, and HL denotes the use of only high-order topological information. LL + CA represents the use of low-order topological information combined with the improved core-attachment strategy, while HL + CA represents the use of high-order topological information combined with the improved core-attachment strategy.

Bar chart comparing F-measure, Accuracy, and combined F1 plus Accuracy across four groups labeled LL, HL, LL plus CA, and HL plus CA; combined F1 plus Accuracy scores are highest in each group.

CYC2008 as ground truth, effectiveness analysis of hypergraph learning and improved core-attachment strategy on BIOGRID dataset (LL, low-order topology; HL, high-order topology; LL + CA, low-order topology with improved core-attachment strategy; HL + CA, high-order topology with improved core-attachment strategy).

This paper first compares the effects of low-order and high-order topological information on protein complex identification. In this setting, the traditional core-attachment strategy is used to identify protein complexes. The experimental results show that higher-order topology information is superior to lower-order topology information, with F-measure and Accuracy improved by about 47.89% and 20.10%, respectively. This advantage arises from the ability of higher-order topological information to model many-to-many synergistic relationships among proteins, beyond simple binary interactions in PPI networks. As a result, it can effectively capture complex interactions and modular structures that are often difficult to detect in a conventional PPI network.

Further, this paper analyzes the impact of the proposed improved core-attachment strategy on protein complex recognition performance. Based on low-order and higher-order topological information, this paper introduces node metrics in the core search and attachment addition process. The experimental results demonstrate that the proposed improved core-attachment strategy consistently outperforms the original version, regardless of the topological order. When combined with low-order topology, the improved strategy increases F-measure and Accuracy by 13.11% and 16.72%, respectively. Further improvements are observed under higher-order topology, where Accuracy increases by an additional 12.1% over the original strategy. Moreover, compared to low-order topology, the incorporation of higher-order topology further enhances F-measure and Accuracy by 8.16% and 4.75%, respectively, when applying the improved strategy. In summary, the proposed algorithm integrates high-order topological information with node metrics to achieve superior performance in both F-measure and Accuracy. These results suggest that incorporating nodal metrics enhances the ability to capture protein complexes’ structural characteristics, thereby improving recognition effectiveness and predictive accuracy.

3.4 Comparative analysis of hypergraph network embedding experiments

Building upon the incorporation of high-order hypergraph topology, the HLCA algorithm further incorporates a hypergraph network embedding framework and integrates a hierarchical compression strategy into the embedding process. To evaluate the effectiveness of this integration for protein complex identification, a series of comparative experiments were conducted against representative hypergraph neural network methods, including HJRL (Yan et al., 2024), HyperGT (Liu et al., 2024), UniG-Encoder (Zou et al., 2024), and T-HyperGNNs (Wang et al., 2024). The four hypergraph neural network models were used solely to generate their respective node embeddings. These embeddings were then employed for subsequent protein complex identification. The experimental results are shown in Figures 5, 6. In these figures, the horizontal axis denotes the compared algorithms, and the vertical axis denotes the corresponding evaluation metrics.

Five grouped bar charts comparing F-measure, Accuracy, and F1+ACC across five methods (HIBL, HyperGT, T-HyperGCNNS, UniG-Encoder, HLCA) for datasets Gavin1, Gavin2, K_extend, DIP, and BIOGRID. In all panels, HLCA consistently achieves the highest F1+ACC values.

CYC2008 as ground truth, evaluation results by different hypergraph neural network algorithms.

Five grouped bar charts compare F-measure, Accuracy, and F1 + ACC for HHDL, HyperGT, T-HyperGCNNs, UniG-Encoder, and HLCA across datasets Gavin1, Gavin2, K_extend, DIP, and BIOGRID. Yellow bars representing F1 + ACC consistently achieve the highest values in all datasets, with HLCA generally reaching the top performance in each group. Green (F-measure) and red (Accuracy) bars show lower and varied values across methods and datasets.

MIPS as ground truth, evaluation results by different hypergraph neural network algorithms.

HJRL adopts a cross-expansion strategy. In this strategy, h

View original article

FRONTIERS IN GENETICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

A graph clustering algorithm with hypergraph learning and a core-attachment strategy for protein complex identification

Comments (0)