TỐI ƯU HÓA DỰ ĐOÁN TƯƠNG TÁC PROTEIN-PROTEIN TỪ BIỂU DIỄN NGÔN NGỮ THÔNG QUA CƠ CHẾ CHỌN LỌC ĐẶC TRƯNG ĐA GIAI ĐOẠN VÀ HỌC MÁY XẾP CHỒNG

Xuân Văn Mai; Khánh Duy Trương; Thị Hạnh Trương; Tiến Đạt Trần; Ngoc Nhớ Nguyễn; Tuong Tri Nguyen

doi:10.26459/hueunijtt.v134i2A.8152

Vol. 134 No. 2A (2025), Research Articles

Vol. 134 No. 2A (2025)

Optimizing Protein-Protein Interaction Prediction from Language Representations via Multi-stage Feature Selection and Stacking Ensemble Learning

Research Articles

https://doi.org/10.26459/hueunijtt.v134i2A.8152

Published 2025-12-31

Xuân Văn Mai
Khánh Duy Trương
Thị Hạnh Trương
Tiến Đạt Trần
Ngoc Nhớ Nguyễn
Tuong Tri Nguyen⁺⁻

Xuân Văn Mai

Khánh Duy Trương

Thị Hạnh Trương

Tiến Đạt Trần

Ngoc Nhớ Nguyễn

Tuong Tri Nguyen

Viện đào tạo mở và cNTT

https://orcid.org/0000-0002-1379-0131

8152 (Vietnamese)

Abstract

Protein–protein interactions (PPIs) form the foundation of many intracellular biological processes, and predicting PPIs directly from amino acid sequences remains a core direction in computational biology. The advent of next-generation Protein Language Models (PLMs), such as ESM-2, enables the generation of sequence representations rich in evolutionary information and latent structural signals. However, these representations often possess extremely high dimensionality, contain significant noise, and exhibit high internal correlation, making it difficult for traditional machine learning models to exploit them effectively and increasing the risk of overfitting. This challenge demands an approach capable of distilling knowledge and eliminating data redundancy while preserving core biological signals. In this work, we propose E–StackPPI (Embedding-Stacking Protein-Protein Interaction prediction framework), a fully embedding-based PPI prediction framework centered on a three-stage layer-wise feature selection mechanism applied directly to embeddings aggregated from the last hidden layers of the ESM-2 650M model. Specifically, the process sequentially: (1) removes dimensions with low variance; (2) retains highly discriminative features based on LightGBM feature importance; and (3) eliminates dimensions with high Pearson correlation to reduce information redundancy. The refined feature set is fed into a stacking architecture, where two parallel LightGBM branches are integrated at the decision layer via Logistic Regression (LR). Experiments on two benchmark datasets from the Database of Interacting Proteins (DIP) [1], including DIP–Yeast and DIP–Human, show that E–StackPPI achieves favorable and stable results across key metrics, including accuracy, MCC, as well as ROC-AUC and PR-AUC indices. When benchmarked against twelve advanced methods summarized in the study by Li et al. [2], our model demonstrates competitive performance on both datasets. These findings highlight the essential role of layer-wise feature selection in mitigating noise and effectively leveraging high-dimensional PLM embeddings, thereby opening a feasible and promising sequence–only approach to PPI prediction without the need for supplementary structural data. Protein-protein interaction; Multi-stage feature selection; Protein Language Models; Stacking model.

https://doi.org/10.26459/hueunijtt.v134i2A.8152

8152 (Vietnamese)

This work is licensed under a Creative Commons Attribution 4.0 International License.