Boosting Prediction of Protein-Protein Interactions using Word Embedding Techniques

. Understanding protein-protein interactions (PPIs) helps to identify protein functions and develop other important applications such as drug preparation, protein-disease relationship identification. Machine learning methods have been developed for the PPI prediction task in order to reduce the cost and time of previous experimental methods. In this paper, we study a method for determining PPIs using deep learning and protein sequence representation learning. In our method, an word embedding technique is utilized for protein sequence representation learning. This technique captures the semantic relationship between amino acids in protein sequences. The semantic relationship is then used as the input information, which is fed into a neural network to help recognize the interaction signature of the input protein pair. Different from previous studies, we integrate the protein sequence embedding mechanism into a neural network model. Thereby, the protein sequence embedding is better controlled for PPI prediction by our neural network model. We evaluate our method on benchmark datasets including Yeast, Human, and eight different independent sets. In addition, we also conduct an extensive comparison with the other existing methods. Our results show that the proposed method is superior to other existing methods and achieves high efficiency in predicting cross-species PPIs. The dataset and our source code are available at https://github.com/thnhub/BoostPPIP.git.


Introduction
Determining protein-protein interactions (PPIs) is one of the important problems in the field of Bioinformatics.According to Li et al. [1,3] understanding PPIs helps to identify protein functions and develop other important applications such as drug preparation, and identification of protein-disease relationships.In recent years, computational methods based on machine learning (ML) for PPI prediction are widely proposed and studied [4,19].ML methods are developed based on different biological information sources such as protein sequences, structural information of proteins, gene ontology annotation and semantic similarity of proteins [2,7].In addition, protein sequence data is growing rapidly, which is creating an advantage over other sources of biological information [5].To date, works using machine learning have produced various high-performance models to predict protein interactions from protein sequences alone.This success is mainly based on the representation of protein sequences and the selection of suitable learning models.
Among them, Shen [8] used the Conjoint Triad (CT) descriptors to encode physicochemical properties of amino acids in a protein sequence, and chose an SVM to learn classifying from those encoded physicochemical properties.Guo [6] used the Auto Covariance (AD) descriptors to extract features from the amino acid sequence in a protein, and then fed these features into a Support Vector Machine (SVM) to predict protein interactions.Because proteins bind to each other at certain regions of the protein, so the authors, You et al. [9,20], Zhou et al. [11], Yang et al. [10], and Zhou et al. [12] suggested using Multi-scale Continuous and Discontinuous local descriptors to encode protein sequences.These authors then experimented with their ideas using SVM, Gradient Boosting Decision Tree (GBDT) as classifier in PPI prediction tasks.The physicochemical and sequence-order information can be used to describe amino acids sequence, Chen et al. [13] proposed using a combination of multiple descriptors including, Pseudo-Amino Acid Composition (PseAAC), Autocorrelation (AC), and CT to capture that information in encoding protein sequences.The authors then utilized a LightGBM algorithm to learn protein interactions from that extracted information.Besides, evolutionary information can be also mined for encoding protein sequences.In GTB-PPI model, Yu et al. [14] utilized Pseudo Position-Specific Scoring Matrix (PsePSSM) descriptors to extract the evolutionary information, which is stored in the Position-Specific Scoring Matrix (PSSM).To enhance PPI prediction performance for their model, Yu et al. [14] combined with sequenceorder and physicochemical information using PseAAC, Pseudo Position-Specific Scoring Matrix (PsePSSM), Reduced Sequence Index-Vectors (RSIV) and AC descriptors.
The works mentioned above have shown that protein sequence descriptors have been widely applied to the PPI prediction problem.Features extracted in those ways can not only be used to feed traditional machine learning models but also for deep learning models, such as the work of Du et al. [5].In addition, various techniques in feature engineering have also been proposed to build higher quality features for PPI prediction, such as Chen et al. [13], Yu et al.
[14], and Yu et al. [18].However, those feature extraction methods require a great human effort in feature engineering.To overcome this disadvantage, various works have attempted to design deep learning models capable of automatically learning protein sequence representation for the PPI prediction problem.For example, Hashemifar et al. [15] proposed the DPPI model, which is a Convolutional Neural Network (CNN), taking the evolutionary information as raw features to infer PPIs.However, Hashemifar's method runs extremely slow because it is required to run BLAST [40] against a huge protein the non-redundance database [41] to generate a PSSM matrix as its feature.Gonzalez-Lopez et al. [16]  This paper consists of 4 sections: Section 3 introduces the problem of PPI prediction and previous works, Section 4 is our proposed method, Section 5 is the experiment results and comparison with the other existing works on benchmark datasets, and final conclusion is introduced in Section 6.

Methods
Determining protein-protein interactions (PPIs) can be regarded as a binary classification problem.The objective is to classify a given protein pair as belonging to the interacting class (denoted 1) or to the non-interacting class (denoted 0).In this study, the input of the PPI prediction problem is a pair of protein sequences and its output is the probability of interaction.
Based on the interaction probability, we can completely classify the given protein sequence pairs into class 0 or 1.In this section, we will detail our proposed method.First, we will describe the technique used to represent protein sequences into embedding features.Then, we will introduce our deep neural network (DNN) architecture for determining PPIs from the obtained embedding features.

Protein sequence representation learning
Nowadays, there are many word embedding learning techniques have been proposed, for example, Word2vec [23], GloVe [24], BERT [25], etc.In this work, we utilize the Word2vec technique, Continuous bag of word (CBOW) model [23], because of its simple architecture and the ability to learn large amounts of data.To apply the Word2vec technique, we first consider the protein sequence is a sentence where each word is an amino acid.Inspired by the idea of the CBOW model, we then build an algorithm (named Amino Acid Encoding) to learn an embedding matrix, where each row of this matrix is a vector representing one of the 20 naturally occurring amino acids.Figure 1 illustrates the protein sequence translating into the corresponding sentence.Figure 3 describes and illustrates the architecture of the neural network used to learn the embedding matrix.Acid Encoding algorithm we obtain the amino acid (word) embedding matrix  1 .This matrix is then used to produce embedding vectors for amino acid  and protein sequence  according to the formulas, where ℎ() ∈ ℝ || is a one-hot vector representing , and ℎ() ∈ ℝ ,|| is one-hot matrix representing  with  is the sequence length.
On the other hand,  1 is also used in the Embedding layer of the neural network used in the PPI prediction task that we will present in the next section.The (. ) activation function is used in the 'Amino Acid Encoding' to calculate probability distribution of words in the vocabulary  when knowing the center word is explained by the formula, where ℎ() ∈ ℝ || maps  into an one-hot vector,  2 () is column  of matrix  2 .
To the network  can capture the semantic relationship between amino acids, it is important to feed it a large set of sequences.In this study, we utilized the UniProtKB database [27].Besides, the hyperparameters of  including, the window , the learning rate , and the number of steps in training  are needed to choose carefully.To get those hyperparameters, we applied the grid search method.Finally, we select  = 5,  = 0.025, and  = 10, respectively.and the architectures of the two branches are similar.Each branch is constructed by layers including, Embedding [28], Dense [28], Batch normalization [29], Dropout [30] and Flatten.The Embedding layer takes its input as a protein sequence, and its output is a matrix of size (, ℎ); where  indicates the fixed sequence length, ℎ indicates the dimensional of amino acid embeddings.The embedding matrix  1 is integrated into the Embedding layers to generate embedding vectors serving PPI prediction.The Flatten layers are tied after the Embedding layers in order to transform the output matrix of each Embedding layer into the vector with a size of ℎ * .Consequently, protein sequences after passing through the Embedding and Flaten layers, become embedding vectors.In this way, BoostPPIP can continue to adjust the weights of the matrix  1 during training on PPI datasets.By this integration, the word embedding mechanism has been integrated into a deep neural network model, and the embedding matrix is fine-tuned to be more optimal for the task of PPI classification.
Mathematically, the embedding vector  sequence at the output of the Flatten layer is expressed by the formula, where   is defined as the formula (2).In the case, the model receives a ℎ  (()) -a set of  protein sequences, the formula (4) can be expressed by the following formula, ℎ  (()) =   ℎ *   . ( The Dense layer is the layer where each of its neurons receives input from all the neurons of the previous layer.We used the Dense class to learn the non-linear relationships of their inputs, transform high-dimensional space into low-dimensional space, and extract abstract features.To learn the non-linear relationship between the inputs, the activation ReLU [31] was added after the Dense layers, except the last one.To speed up training and avoid overfitting, Batch normalization [29] and Dropout [30] layers are added after each the Dense layer.So, if the input of Dense layer is  = ℎ  (()), its output vectors are calculated by the formula, where   () randomly assign with a rate  on a number of columns of  into values of 0;  is the learnable embedding matrix of the neural network BoostPPIP.
To connect two branches together and also to produce a feature vector representing the input protein sequence pair, we use an Add layer, through which two output feature vectors of two branches are added to form a single vector.Specifically, assuming that the input of the Add layer are  1 () and  2 (), the its output is as formular (7).
() then passed to the Classification cascade.Here, the input protein sequence pair representation is transformed into a two-dimensional vector in the final layer to be used for interaction determination.Finally, assigning the interaction probability to the input protein sequence pair, we use the 2-class Softmax function whose input is vector two at the output of the final layer.Specifically, if the input of 2-class Softmax is a two-dimensional vector, its output is calculated by the formula, The interaction probability of the input protein sequence pair of BoostPPIP was determined by applying the formulas ( 6) to (8).To train our model, we choose the value  as the average length of the protein sequences in the training set.In the experiments, we utilized the Adam algorithm [32] to train BoostPPIP with the following loss function, where ,  are sets of  samples, protein pairs and labels, respectively.
In our experiments, the learning rate used to optimize the neural network is set to 0.001 with a penalty for error function of 0.001.The implementation is done with support from Python libraries including Tensorflow [33] and Scikit-learning [34].

Evaluation criteria
To evaluate the performance of the models, we use various evaluation metrics including, accuracy (Acc), sensitivity (Sen), precision (Pre), F1-score (F1) and Matthew's correlation coefficient (MCC).These metrics are defined by the following formula, where , , ,  are the number of positive samples (interacting protein pairs) predicted to be positive, and the number of negative samples (pairs) respectively.non-interacting proteins) was predicted to be negative, the number of positive samples was predicted to be negative, and the number of negative samples was predicted to be positive.
In addition, we also use the area under the Receiver curve operating characteristic (AUROC) and the Precision-Recall curve (AUPRC) to evaluate the performance of models.The large area represents the high performance of the model.

Datasets
We use ten PPI datasets for experiments on the proposed model and for comparison with existing methods as well.We divide them into two groups, the first group is used for crossvalidation and the second group is used for the independent tests.The first group consists of two datasets, Yeast and Human.The Yeast dataset was built and introduced in the paper [8].
The Yeast dataset contains interacting protein pairs selected from the DIP database [35], including 5,594 interacting pairs and 5,594 non-interacting pairs, after removing protein pairs with sequence length less than 50 and sequence identification greater than or equal to 40% using CD-HIT tool [36].The Human dataset was introduced by Huang et al. [37], which was collected from the HPRD database (https://www.hprd.org/),consisting of 3,899 interacting protein pairs and 4,262 non-interacting protein pairs.The second group consists of 5 crossspecies PPI datasets and 3 PPIs network datasets.These datasets were downloaded from the DIP database [35]

Optimal hyperparameters
To choose the optimal configuration for the BoostPPI model, we need to observe the influence  Our results (see Figure 5) show that the BoostPPIP model achieves the best performance on most measurements when the embedding vector size is 20 and the network depth is 4, with an Acc of 94.56%, Rec of 94.04%, F1 of 94.58% and MCC of 87.58%.Therefore, the embedding amino acid size of 20 and the number of Dense layers of 4 (in the Feature Extraction cascade) was selected as the optimal configuration of BoostPPIP.In addition, this experimental method is also used to find the optimal number of training times for BoostPPIP.

Comparison with traditional machine learning models
The BoostPPIP model is designed based on DNNs, which can be regarded as a classifier with protein sequences as the input.To determine if using a neural network yields better performance than classifiers based on traditional ML, we compare prediction results obtained by BoostPPIP and 6 different well-known traditional ML models through 5-fold crossvalidation on the Yeast dataset.The comparied traditional ML classifiers including, Naive Bayes (NB), AdaBoost (Ada), SVM, Decision Tree (DT), K-Nearest Neighbors (KNN), and Random Forest (RF).In this experiment, the training of traditional ML models is processed in steps: First, the length of protein sequences in the Yeast dataset was fixed according to the same way as mentioned in 4.2, the fixed sequences are then converted to feature vectors according to the formula 2, and finally these feature vectors are used to train traditional ML models.Figure 7 shows the comparison of models on AUROC and AUPRC measurements.We can see that BoostPPIP predicted better than NB, DT, Ada, KNN, SVM and RF.The AUROC and AUPRC values of BoostPPIP are 0.9%-40.4% and 0.7%-44.4% higher than the compaired classifiers.These results demonstrate that the use of neural networks in building classifiers is appropriate when combined with embedded features.We continue to perform other experiments to compare the performance of the proposed method with existing methods.
Subsequent comparison experiments are performed on the same benchmark datasets, the same sampling method, and on the same predictive performance measurements.

Performance of methods on the Yeast dataset
Most of the methods that have been proposed for PPI prediction use the Yeast dataset to experiment and measure the prediction performance.In this experiment, we use 5-fold crossvalidation on the Yeast dataset, where the mean and the standard deviation values of metrics are obtained to measure the robustness of the methods.

Performance of the methods on the Human dataset
We further compare our method with other existing methods on the Human.In this test, we also evaluate methods through 5-fold cross-validation.The prediction results of the methods are listed in Table 2.As shown in Table 2, the highest performance on Acc, Sen, and MCC metrics achieved by BoostPPIP, respectively, 99.33%, 99.74% and 98.65%.Our method hepled to increase the prediction accuracy from 0.63% to 3.73% comparing to the other methods.The sensitivity was also increased from 0.17% to 5.64%, and the MCC was increased from 1.25% to 7.45%.However, the precision achieved by BoostPPIP was 98.86%, ranking second after DeepPPI [5].However, MCC achieved by BoostPPIP was 2.35% higher than that of DeepPPI, which shows that our method has better results in predicting both interacting and noninteracting classes.Moreover, the accuracy achieved by BoostPPIP was higher than DeepPPI's in independent testing.

Independent testing
Testing on cross-species PPI prediction is very important, a classifier learned on the PPI dataset of one species (e.g.Saccharomyces cerevisiae) should be applied to identify PPIs in another (e.g. Homo sapiens), meanwhile, the PPI network dataset provides some reference information for identifying PPIs from the PPIs network which has not been identified yet [13,14].In this test, we use all 11,188 samples of Yeast dataset as training set and 8 independent data sets as test set.
The accuracy of predictions across 5 cross-species PPI datasets and 3 PPIs network datasets were used to test the generality of the methods.The Figure 4 is an illustration of a PPI network, the Cancer-specific network.The prediction results of the methods are summarized in Table 3.
Experimenting on different interactive datasets, the BoostPPIP model achieved 100% accuracy on four datasets, respectively, Celeg, Hsapi, Hpylo, Mmusc.On the Ecoli dataset, BoostPPIP achieved an accuracy of 99.88% (correct prediction of 6,946 samples out of a total of 6,954 samples), 0.12% lower than the accuracy achieved by DeepFE-PPI [17].However, the DeepFE-PPI model was not tested on the PPIs network datasets.Compared with the other existing methods, our method and DeepFE-PPI obtained the highest accuracies.Experiment results on the PPIs networks listed in Table 3 These results indicate that our proposed model has high generalizability.

Conclusion
In this study, we proposed a novel method for predicting PPIs directly from protein sequences data.In our method, protein sequences are converted into embeddings by a model learning the semantic relationship between amino acids.Our results have shown that the embedding features are effective in predicting protein interactions.In particular, this type of feature enhances the generality of our model in the task of determining across-species PPIs.Using the softmax function to calculate the probability distribution is not really beneficial in the case of a large vocabulary.However, since our proposed method considers only one amino acid as a word, the size of the generated vocabulary is small, specifically with only 25 elements (including 20 amino acids that have been identified in nature and 5 amino acids that have not been identified).Therefore, the softmax function is still efficient for our proposed method.In future work, we intend to incorporate our proposed model with other representation learning methods such as Doc2vec, Glove, and BERT.

Fig. 1 .
Fig. 1.An illustration of the translation of a protein sequence into a sentence in which an amino acid corresponds to a word The Amino Acid Encoding algorithm has two stages, the  training dataset generating stage (steps 1-3) and the  neural network training stage (steps 4-11).Let maxlen be the maximum length of the protein sequences in , from that, the computational complexity of the algorithm is determined by the formula, (maxlen × || +  × ||).After running the Amino

Fig. 2 . 2 . 2
Fig. 2. CBOW (Continuous Bag of Words) model's architecture is used in the Amino Acid Encoding algorithm.In this figure,  is illustrated 5

Fig. 3 .
Fig. 3.The architecture of our proposed model, BoostPPIP (Boost PPI Prediction) () = ( ⋅  + ) ; () = (0, ) ; ℎ() = −() √()+ with  = 0.001 , () and () are the mean and the standard deviation of column of  ; on PPI prediction through different combinations of two hyperparameters, the number of Dense layers at the Feature Extraction cascade and size ℎ of the embedding vector.Observation is performed on the Yeast dataset by dividing it into three parts, 70% for the training set, 10% for the validation set, and 20% for the test set.The optimal configuration is selected by comparing the model's scores from the scales on the test set.Figure5) shows the results obtained by the BoostPPIP model with different configurations.

Fig. 4 .
Fig. 4. Predictive performance of BoostPPIP model on the Yeast core dataset over different combinations of network depth and embedding size.The red vertical line at each metric represents the best combination

Figure 6
shows that the model shows signs of overfitting at the 30th epoch, it becomes difficult to decrease the prediction error until the 50th epoch, after the model falls into a marked overfitting state.Therefore, we set up 50 times for training the BoostPPIP model.

Fig. 5 .
Fig. 5.The correlation between the prediction error and the number of training times of the BoostPPIP model.The experiment was performed on the set Yeast dataset.The blue line indicates the training error/training accuracy error (left side/right side), and the orange line indicates the test error/test accuracy (left side/right side)

Fig. 7 .
Fig. 7. Illustration of Cancer-specific network.The node represents a protein, the edge represents a prediction proposed the DeepSequencePPI model, which is based on Recurrent Neural Networks (RNNs), learning embedding features from direct protein sequences without using any other feature extraction techniques.Since protein sequences can be

Table 1 .
Table 1 lists methods' prediction results.Performance comparison on Yeast set using 5-fold cross-validation

Table 2 .
Performance comparison on Human set using 5-fold cross-validation Note: The results are taken from the author's report.N/A means not reported.Bold fold represents the highest value.

Table 3 .
Independent test results of the methods.Note: The results (accuracy -%) are taken from the author's report.N/A means not reported.Bold fold represents the highest value.