Part of the book series:Lecture Notes in Computer Science ((LNBI,volume 14248))
Included in the following conference series:
1018Accesses
Abstract
In recent years, machine learning methods have shown remarkable results in various protein analysis tasks, including protein classification, folding prediction, and protein-to-protein interaction prediction. However, most studies focus only on the 3D structures or sequences for the downstream classification task. Hence analyzing the combination of both 3D structures and sequences remains comparatively unexplored. This study investigates how incorporating protein sequence and 3D structure information influences protein classification performance. We use two well-known datasets, STCRDAB and PDB Bind, for classification tasks to accomplish this. To this end, we propose an embedding method called PDB2Vec to encode both the 3D structure and protein sequence data to improve the predictive performance of the downstream classification task. We performed protein classification using three different experimental settings: only 3D structural embedding (called PDB2Vec), sequence embeddings using alignment-free methods from the biology domain including onk-mers, position weight matrix, minimizers and spacedk-mers, and the combination of both structural and sequence-based embeddings. Our experiments demonstrate the importance of incorporating both three-dimensional structural information and amino acid sequence information for improving the performance of protein classification and show that the combination of structural and sequence information leads to the best performance. We show that both types of information are complementary and essential for classification tasks.
A. Ali and P. Chourasia—Equal Contribution.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 10295
- Price includes VAT (Japan)
- Softcover Book
- JPY 12869
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Al-Lazikani, B., Jung, J., Xiang, Z., Honig, B.: Protein structure prediction. Curr. Opin. Chem. Biol.5(1), 51–56 (2001)
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: Pwm2vec: An efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biology (2022)
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for covid-19 spike sequences. In: IEEE International Conference on Big Data (Big Data), pp. 1533–1540 (2021)
Ali, S., Sahoo, B., Khan, M.A., Zelikovsky, A., Khan, I.U., Patterson, M.: Efficient approximate kernel based spike sequence classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022)
Ali, S., Sahoo, B., Ullah, N., Zelikovskiy, A., Patterson, M., Khan, I.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications, pp. 153–164 (2021)
Batool, M., Ahmad, B., Choi, S.: A structure-based drug discovery paradigm. Int. J. Mol. Sci.20(11), 2783 (2019)
Bepler, T., Berger, B.: Learning protein sequence embeddings using information from structure. In: International Conference on Learning Representations (2019)
Bigelow, D.J., Squier, T.C.: Redox modulation of cellular signaling and metabolism through reversible oxidation of methionine sensors in calcium regulatory proteins. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics1703(2), 121–134 (2005)
Boscher, C., Dennis, J.W., Nabi, I.R.: Glycosylation, galectins and cellular signaling. Curr. Opin. Cell Biol.23(4), 383–392 (2011)
Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics38(8), 2102–2110 (2022)
Chourasia, P., Ali, S., Ciccolella, S., Della Vedova, G., Patterson, M.: Clustering sars-cov-2 variants from raw high-throughput sequencing reads data. In: International Conference on Computational Advances in Bio and Medical Sciences, pp. 133–148. Springer (2021)
Chourasia, P., Ali, S., Ciccolella, S., Vedova, G.D., Patterson, M.: Reads2vec: Efficient embedding of raw high-throughput sequencing reads data. J. Comput. Biol.30(4), 469–491 (2023)
Chourasia, P., Tayebi, Z., Ali, S., Patterson, M.: Empowering pandemic response with federated learning for protein sequence data analysis. In: 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01–08. IEEE (2023)
Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics109(5–6), 419–431 (2017)
Denti, L., Pirola, Y., Previtali, M., Ceccato, T., Della Vedova, G., Rizzi, R., Bonizzoni, P.: Shark: fishing relevant reads in an rna-seq sample. Bioinformatics37(4), 464–472 (2021)
Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I.: Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems (NeurIPS), pp. 6935–6945 (2017)
Fiser, A., Šali, A.: Modeller: generation and refinement of homology-based protein structure models. In: Methods in Enzymology, vol. 374, pp. 461–491 (2003)
Freeman, B.A., O’Donnell, V.B., Schopfer, F.J.: The discovery of nitro-fatty acids as products of metabolic and inflammatory reactions and mediators of adaptive cell signaling. Nitric Oxide77, 106–111 (2018)
Gao, W., Mahajan, S.P., Sulam, J., Gray, J.J.: Deep learning in protein structural modeling and design. Patterns1(9), 100142 (2020)
Gohlke, H., Klebe, G.: Approaches to the description and prediction of the binding affinity of small-molecule ligands to macromolecular receptors. Angew. Chem. Int. Ed.41(15), 2644–2676 (2002)
Golubchik, T., Wise, M.J., Easteal, S., Jermiin, L.S.: Mind the gaps: evidence of bias in estimates of multiple sequence alignments. Molecular Biol. Evol.24(11), 2433–2442 (2007).https://doi.org/10.1093/molbev/msm176
Groom, C.R., Allen, F.H.: The cambridge structural database: experimental three-dimensional information on small molecules is a vital resource for interdisciplinary research and learning. Wiley Interdisciplinary Rev. Comput. Molecular Sci.1(3), 368–376 (2011)
Hardin, C., Pogorelov, T.V., Luthey-Schulten, Z.: Ab initio protein structure prediction. Curr. Opin. Struct. Biol.12(2), 176–181 (2002)
Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., Rost, B.: Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform.20(1), 1–17 (2019)
Jisna, V., Jayaraj, P.: Protein structure prediction: conventional and deep learning perspectives. Protein J.40(4), 522–544 (2021)
Kubinyi, H.: Structure-based design of enzyme inhibitors and receptor ligands. Curr. Opin. Drug Discov. Devel.1(1), 4–15 (1998)
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun.533(3), 553–558 (2020)
Leem, J., de Oliveira, S.H.P., Krawczyk, K., Deane, C.M.: Stcrdab: the structural t-cell receptor database. Nucleic Acids Res.46(D1), D406–D412 (2018)
Liu, Z., Li, Y., Han, L., Li, J., Liu, J., Zhao, Z., Nie, W., Liu, Y., Wang, R.: Pdb-wide collection of binding data: current status of the pdbbind database. Bioinformatics31(3), 405–412 (2015)
Oshima, A., Tani, K., Hiroaki, Y., Fujiyoshi, Y., Sosinsky, G.E.: Three-dimensional structure of a human connexin26 gap junction channel reveals a plug in the vestibule. Proc. Natl. Acad. Sci.104(24), 10034–10039 (2007)
Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., Graim, K., Funk, C., Verspoor, K., Ben-Hur, A., et al.: A large-scale evaluation of computational protein function prediction. Nat. Methods10(3), 221–227 (2013)
Reynolds, C., Damerell, D., Jones, S.: Protorp: a protein-protein interaction analysis server. Bioinformatics25(3), 413–414 (2009)
Roberts, M., Haynes, W., Hunt, B., Mount, S., Yorke, J.: Reducing storage requirements for biological sequence comparison. Bioinformatics20, 3363–9 (2004)
Sapoval, N., et al.: Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun.13(1), 1728 (2022)
Singh, R., Sekhon, A., Kowsari, K., Lanchantin, J., Wang, B., Qi, Y.: Gakco: a fast gapped k-mer string kernel using counting. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 356–373 (2017)
Spencer, M., Eickholt, J., Cheng, J.: A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Trans. Comput. Biol. Bioinf.12(1), 103–112 (2014)
Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: Udsmprot: universal deep sequence models for protein classification. Bioinformatics36(8), 2401–2409 (2020)
Tayebi, Z., Ali, S., Patterson, M.: Robust representation and efficient feature selection allows for effective clustering of sars-cov-2 variants. Algorithms14(12), 348 (2021)
Torrisi, M., Pollastri, G., Le, Q.: Deep learning methods in protein structure prediction. Comput. Struct. Biotechnol. J.18, 1301–1310 (2020)
Tramontano, A., Morea, V.: Assessment of homology-based predictions in casp5. Proteins: Struct. Function Bioinform.53(S6), 352–368 (2003)
Villegas-Morcillo, A., Makrodimitris, S., van Ham, R.C., Gomez, A.M., Sanchez, V., Reinders, M.J.: Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics37(2), 162–170 (2021)
Xu, J.: Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci.116(34), 16856–16865 (2019)
Yao, Y., Du, X., Diao, Y., Zhu, H.: An integration of deep learning with feature embedding for protein-protein interaction prediction. PeerJ7, e7126 (2019)
Author information
Authors and Affiliations
Georgia State University, Atlanta, GA, USA
Sarwan Ali, Prakash Chourasia & Murray Patterson
- Sarwan Ali
You can also search for this author inPubMed Google Scholar
- Prakash Chourasia
You can also search for this author inPubMed Google Scholar
- Murray Patterson
You can also search for this author inPubMed Google Scholar
Contributions
Sarwan Ali and Prakash Chourasia–Equal Contribution
Corresponding author
Correspondence toSarwan Ali.
Editor information
Editors and Affiliations
University of North Texas, Denton, TX, USA
Xuan Guo
University of Southern California, Los Angeles, CA, USA
Serghei Mangul
Georgia State University, Atlanta, GA, USA
Murray Patterson
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ali, S., Chourasia, P., Patterson, M. (2023). PDB2Vec: Using 3D Structural Information for Improved Protein Analysis. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_29
Download citation
Published:
Publisher Name:Springer, Singapore
Print ISBN:978-981-99-7073-5
Online ISBN:978-981-99-7074-2
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative