NEXT MUTATION PREDICTION OF SARS-COV-2 SPIKE PROTEIN SEQUENCE USING ENCODER-DECODER BASED LONG SHORT TERM MEMORY (LSTM) METHOD
DOI:
https://doi.org/10.53808/KUS.2022.ICSTEM4IR.0142-seKeywords:
SARS-CoV-2, Machine Learning, LSTM, S Protein, Neural Network, Covid-19.Abstract
The recent world is facing a new pandemic which is caused by a virus named Coronavirus. Its fast mutation capability makes the situation worse affecting all the countries. Handling the virus is a challenging task now as there is still no permanent remedy for this. The doctors, engineers, scientists all are working together to fight against the virus. Revealing the genome sequencing and total structure of the virus paves the way for more research on this topic. Many researchers and scientists are working relentlessly on mutation analysis. Since spike proteins are one of the most important parts of SARS-CoV-2 for affecting humans, scientists are working for vaccine and drug discovery targeting S protein. Many Machine learning, Artificial Intelligence, Deep Learning methods are used on the genome datasets to detect the mutation position and predict further insights. The goal of this work is to predict the most probable next-generation Spike Protein sequence of SARS-CoV-2. We have proposed a model that uses the Encoder-Decoder based LSTM model on date-wise ordered protein sequence data of S-protein. This has worked effectively on predicting next generation sequence of S protein. We compared our model with other deep learning models i.e. CNN-LSTM and Attention-based LSTM. We also experimented our model with large datasets as well as with small datasets, and the results of the tests are effective and efficient in both ways.
Downloads
References
Wang, C., Horby, P. W., Hayden, F. G., & Gao, G. F. (2020). A novel coronavirus outbreak of global health concern. The lancet, 395(10223), 470-473.
Ducharme, J. (2020). The WHO Just Declared Coronavirus COVID-19 a Pandemic. Time. Retrieved 28 May 2020, from https://time.com/5791661/who-coronavirus-pandemic-declaration/.
Hadfield, J., Megill, C., Bell, S. M., Huddleston, J., Potter, B., Callender, C., & Neher, R. A. (2018). Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 34(23), 4121-4123.
Zeroual, A., Harrou, F., Dairi, A., & Sun, Y. (2020). Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study. Chaos, Solitons & Fractals, 140, 110121.
Pathan, R. K., Biswas, M., & Khandaker, M. U. (2020). Time series prediction of COVID-19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos, Solitons & Fractals, 138, 110018.
Dabbura, I. (2018, September 17). K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks. Towards Data Science. https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
Asgari, E., & Mofrad, M. R. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11), e0141287.
Randhawa, G. S., Soltysiak, M. P., El Roz, H., de Souza, C. P., Hill, K. A., & Kari, L. (2020). Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. Plos one, 15(4), e0232391.
Salama, M. A., Hassanien, A. E., & Mostafa, A. (2016). The prediction of virus mutation using neural networks and rough set techniques. EURASIP Journal on Bioinformatics and Systems Biology, 2016(1), 1-11.
Yin, R., Luusua, E., Dabrowski, J., Zhang, Y., & Kwoh, C. K. (2020). Tempel: time-series mutation prediction of influenza A viruses via attention-based recurrent neural networks. Bioinformatics, 36(9), 2697-2704.
Koyama, T., Platt, D., & Parida, L. (2020). Variant analysis of SARS-CoV-2 genomes. Bulletin of the World Health Organization, 98(7), 495.
Kargarfard, F., Sami, A., Hemmatzadeh, F., & Ebrahimie, E. (2019). Identifying mutation positions in all segments of influenza genome enables better differentiation between pandemic and seasonal strains. Gene, 697, 78-85.
Bioinformatics. (n.d.). National Center for Biotechnology Information (NCBI). Retrieved Dec 1, 2020 from http://www.ncbi.nlm.nih.gov/Class/MLACourse/ Modules/MolBioReview/bioinformatics.html.
Luscombe, N. M., Greenbaum, D., & Gerstein, M. (2001). What is bioinformatics? An introduction and overview. Yearbook of medical informatics, 10(01), 83-100.
Liu, H., Wang, Z., Wu, Y., Zheng, D., Sun, C., Bi, D., ... & Xu, T. (2007). Molecular epidemiological analysis of Newcastle disease virus isolated in China in 2005. Journal of Virological Methods, 140(1-2), 206-211.
SARS-CoV-2 protein datasets - NCBI Datasets. NCBI. (2022). Retrieved 1 January 2021, from https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/.
Mohamed, T., Sayed, S., Salah, A., & Houssein, E. H. (2021). Long Short-Term Memory Neural Networks for RNA Viruses Mutations Prediction. Mathematical Problems in Engineering, 2021.
Priya, P., Basit, A., & Bandyopadhyay, P. (2022). A strategy to optimize the peptide-based inhibitors against different mutants of the spike protein of SARS-CoV-2. bioRxiv.
Yan, S., & Wu, G. (2020, November). Application of neural network to predict mutations in proteins from influenza A viruses-A review of our approaches with implication for predicting mutations in coronaviruses. In Journal of Physics: Conference Series (Vol. 1682, No. 1, p. 012019). IOP Publishing.
Albert, S. (2017). A big data approach in mutation analysis and prediction. Studia Universitatis Babes-Bolyai, Informatica, 62(1).
Walsh, I., Pollastri, G., & Tosatto, S. C. (2016). Correct machine learning on protein sequences: a peer-reviewing perspective. Briefings in bioinformatics, 17(5), 831-840.
García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining (Vol. 72, pp. 59-139). Cham, Switzerland: Springer International Publishing.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Wang, C., Liu, Z., Chen, Z., Huang, X., Xu, M., He, T., & Zhang, Z. (2020). The establishment of reference sequence for SARS‐CoV‐2 and variation analysis. Journal of medical virology, 92(6), 667-674.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Mottaqi, M. S., Mohammadipanah, F., & Sajedi, H. (2021). Contribution of machine learning approaches in response to SARS-CoV-2 infection. Informatics in Medicine Unlocked, 23, 100526.
Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7-9.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Khulna University Studies

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.