Abstract
Protein–protein interactions (PPIs) form the foundation of many intracellular biological processes, and predicting PPIs directly from amino acid sequences remains a core direction in computational biology. The advent of next-generation Protein Language Models (PLMs), such as ESM-2, enables the generation of sequence representations rich in evolutionary information and latent structural signals. However, these representations often possess extremely high dimensionality, contain significant noise, and exhibit high internal correlation, making it difficult for traditional machine learning models to exploit them effectively and increasing the risk of overfitting. This challenge demands an approach capable of distilling knowledge and eliminating data redundancy while preserving core biological signals. In this work, we propose E–StackPPI (Embedding-Stacking Protein-Protein Interaction prediction framework), a fully embedding-based PPI prediction framework centered on a three-stage layer-wise feature selection mechanism applied directly to embeddings aggregated from the last hidden layers of the ESM-2 650M model. Specifically, the process sequentially: (1) removes dimensions with low variance; (2) retains highly discriminative features based on LightGBM feature importance; and (3) eliminates dimensions with high Pearson correlation to reduce information redundancy. The refined feature set is fed into a stacking architecture, where two parallel LightGBM branches are integrated at the decision layer via Logistic Regression (LR). Experiments on two benchmark datasets from the Database of Interacting Proteins (DIP) [1], including DIP–Yeast and DIP–Human, show that E–StackPPI achieves favorable and stable results across key metrics, including accuracy, MCC, as well as ROC-AUC and PR-AUC indices. When benchmarked against twelve advanced methods summarized in the study by Li et al. [2], our model demonstrates competitive performance on both datasets. These findings highlight the essential role of layer-wise feature selection in mitigating noise and effectively leveraging high-dimensional PLM embeddings, thereby opening a feasible and promising sequence–only approach to PPI prediction without the need for supplementary structural data. Protein-protein interaction; Multi-stage feature selection; Protein Language Models; Stacking model.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2025 Array
