The chapters are based on progressive collaborative research work on a broad range of topics and implementations, and will be of interest to both researchers and students from computer science and biological domains.
Table of ContentsPreface Acknowledgement
Part 1 The Commencement of Machine Learning Solicitation to Bioinformatics1 Introduction to Supervised Learning
Rajat Verma, Vishal Nagar and Satyasundara Mahapatra
1.1 Introduction
1.2 Learning Process & its Methodologies
1.2.1 Supervised Learning
1.2.2 Unsupervised Learning
1.2.3 Reinforcement Learning
1.3 Classification and its Types
1.4 Regression
1.4.1 Logistic Regression
1.4.2 Difference between Linear & Logistic Regression
1.5 Random Forest
1.6 K-Nearest Neighbor
1.7 Decision Trees
1.8 Support Vector Machines
1.9 Neural Networks
1.10 Comparison of Numerical Interpretation
1.11 Conclusion & Future Scope References
2 Introduction to Unsupervised Learning in Bioinformatics
Nancy Anurag Parasa, Jaya Vinay Namgiri,
Sachi Nandan Mohanty and Jatindra Kumar Dash
2.1 Introduction
2.2 Clustering in Unsupervised Learning 37
2.3 Clustering in Bioinformatics—Genetic Data 38
2.3.1 Microarray Analysis 38
2.3.2 Clustering Algorithms 40
2.3.3 Partition Algorithms 41
2.3.3.1 k-Means Clustering 41
2.3.3.2 Cluster Center Initialization Algorithm (CCIA) 41
2.3.3.3 Intelligent Kernel k-Mean (IKKM) 41
2.3.3.4 Clustering Large Applications (CLARA) 42
2.3.4 Hierarchical Clustering Algorithms 42
2.3.4.1 AGNES (Agglomerative Nesting) 43
2.3.4.2 DIANA (Divisive Analysis) 43
2.3.4.3 CURE (Clustering Using Representatives) 43
2.3.4.4 CHAMELEON 43
2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies) 44
2.3.5 Density-Based Approach 44
2.3.5.1 DBSCAN 44
2.3.6 Model-Based Approach 45
2.3.6.1 SOM (Self-Organizing Maps) 45
2.3.7 Grid-Based Clustering 45
2.3.7.1 STING (Statistical Information Grid-Based Algorithm) 46
2.3.8 Soft Clustering 46
2.3.8.1 FCM (Fuzzy Class Membership) 46
2.4 Conclusion 46
References 47
3 A Critical Review on the Application of Artificial Neural Network in Bioinformatics 51
Vrs Jhalia and Tripti Swarnkar
3.1 Introduction 52
3.1.1 Different Areas of Application of Bioinformatics 52
3.1.2 Bioinformatics in Real World 53
3.1.3 Issues with Bioinformatics 56
3.1.3.1 Issues Related to Structure 56
3.1.3.2 Sequence Analysis 56
3.2 Biological Datasets 57
3.3 Building Computational Model 58
3.3.1 Data Pre-Processing and its Necessity 58
3.3.2 Biological Data Classification 59
>3.3.3 ML in Bioinformatics 60
3.3.4 Introduction to ANN 61
3.3.5 Application of ANN in Bioinformatics 63
3.3.6 Broadly Used Supervised Machine Learning Techniques 64
3.4 Literature Review 64
3.4.1 Comparative Analysis of ANN With Broadly Used Traditional ML Algorithms 67
3.5 Critical Analysis 72
3.6 Conclusion 73
References 73
Part 2 Machine Learning and Genomic Technology, Feature Selection and Dimensionality Reduction 77
4 Dimensionality Reduction Techniques: Principles, Benefits, Limitations 79
Hemanta Kumar Palo, Santanu Sahoo and Asit Kumar Subudhi
4.1 Introduction 80
4.2 The Benefits and Limitations of Dimension Reduction Methods 81
4.3 Components of Dimension Reduction 83
4.3.1 Feature Selection 84
4.3.2 Feature Reduction 86
4.4 Methods of Dimensionality Reduction 86
4.4.1 Principal Component Analysis (PCA) 88
4.4.2 Missing Values Ratio (MVR) 89
4.4.3 Linear Discriminant Analysis (LDA) 90
4.4.4 Backward Feature Elimination (BFE) 92
4.4.5 Forward Feature Construction (FFC) 93
4.4.6 Independent Component Analysis (ICA) 94
4.4.7 Low Variance Filter (LVF) 95
4.4.8 High Correlation Filter 97
4.4.9 Random Forests (RF)/Ensemble Trees 97
4.4.10 t-Distributed Stochastic Neighbor Embedding (t-SNE) 99
4.4.11 Autoencoder 100
4.4.12 Factor Analysis (FA) 100
4.4.13 Uniform Manifold Approximation and
Projection (UMAP) 101
4.4.14 Information Gain (IG) 101
4.4.15 Vector Quantization (VQ) 102
4.5 Conclusion 104
References 105
5 Plant Disease Detection Using Machine Learning Tools With an Overview on Dimensionality Reduction 109
Saurav Roy, Ratula Ray, Satya Ranjan Dash and Mrunmay Kumar Giri
5.1 Introduction 110
5.2 Flowchart 112
5.3 Machine Learning (ML) in Rapid Stress Phenotyping 113
5.4 Dimensionality Reduction 114
5.4.1 Feature Extraction 114
5.4.1.1 PCA (Principal Component Analysis) 115
5.4.1.2 LDA (Linear Discriminant Analysis) 115
5.4.1.3 SIFT (Scale Invariant Feature Transform) 115
5.4.1.4 SURF (Speeded Up Robust Features) 116
5.4.1.5 ORB (Oriented FAST and Rotated BRIEF) 116
5.5 Literature Survey 116
5.6 Types of Plant Stress 128
5.6.1 Biotic Stress 128
5.6.1.1 Fungal Pathogen 129
5.6.1.2 Bacterial Pathogen 129
5.7 Implementation I: Numerical Dataset 130
5.7.1 Dataset Description 130
5.7.2 Results 131
5.7.3 Discussion 133
5.8 Implementation II: Image Dataset 134
5.8.1 Dataset Description 134
5.8.2 Method Used 134
5.8.3 Results 134
5.8.3.1 Results of ORB Feature Extraction and Brute Force Matching 134
5.8.3.2 Color Histogram Comparison: Using Correlation Method 138
5.8.4 Discussions 138
5.9 Conclusion 140
References 141
6 Gene Selection Using Integrative Analysis of Multi-Level Omics Data: A Systematic Review 145
S. Mahapatra and T. Swarnkar
6.1 Introduction 146
6.4 Machine Learning Approaches for Multi-Level Data Integration 153
6.4.1 Unsupervised Integration of Omics Data 159
6.4.2 Supervised Integration of Omics Data 163
6.5 Critical Observation 165
6.6 Conclusion 166
References 166
7 Random Forest Algorithm in Imbalance Genomics Classification 173
Sudhansu Shekhar Patra, Om Praksah Jena, Gaurav Kumar, Sreyashi Pramanik, Chinmaya Misra and Kamakhya Narain Singh
7.1 Introduction 173
7.2 Methodological Issues 175
7.2.1 Decision Tree (DT) Classifier 175
7.2.2 Ensemble Techniques 177
7.2.3 Mathematical Formulation of Ensemble Technique 177
7.2.4 Bagging 178
7.2.5 Bagging Pseudocode 179
7.2.6 Random Forest 180
7.3 Biological Terminologies 181
7.3.1 DNA 181
7.3.2 Genomics 181
7.3.3 Proteins 183
7.4 Proposed Model 183
7.4.1 Balancing the Data 184
7.4.2 Ensembling of Trees 185
7.5 Experimental Analysis 186
7.6 Current and Future Scope of ML in Genomics 188
7.6.1 Gene Sequencing 188
7.6.2 Services to Consumer 188
7.6.3 Gene Editing 188
7.6.4 Pharmacy Genomics 188
7.6.5 Newborn Genetic Screening 188
7.7 Conclusion 189
References 189
8 Feature Selection and Random Forest Classification for Breast Cancer Disease 191
Shubham Raj, Swati Singh, Avinash Kumar, Sobhangi Sarkar and Chittaranjan Pradhan
8.1 Introduction 192
8.2 Literature Survey 192
8.3 Machine Learning 196
8.4 Feature Engineering 202
8.5 Methodology 204
8.5.1 Dataset Collection 204
8.5.2 Proposed Work 204
8.5.2.1 Selection of Feature by the Means of Correlation and Accuracy Calculation Using Random Forest Classification 206
8.5.2.2 Feature Selection Using one Variety and Accuracy Calculation Using Random Forest Classification 207
8.5.2.3 Feature Elimination Using RFE and Classification Using Random Forest 209
8.6 Result Analysis 209
8.7 Conclusion 210
References 210
9 A Comprehensive Study on the Application of Grey Wolf Optimization for Microarray Data 211
Swati Sucharita, Barnali Sahu and Tripti Swarnkar
9.1 Introduction 212
9.2 Microarray Data 213
9.3 Grey Wolf Optimization (GWO) Algorithm 214
9.3.1 Principle of GWO 216
9.3.2 Mathematical Model of GWO 217
9.3.2.1 The Encircling 217
9.3.2.2 Hunting 218
9.3.2.3 Attacking Prey: (Exploitation) 219
9.3.2.4 Search for Prey: (Exploration) 219
9.3.3 Algorithm and Flow Chart of GWO 219
9.4 Studies on GWO Variants 220
9.4.1 Hybridization 221
9.4.2 Extensions 221
9.4.3 Modification 232
9.5 Application of GWO in Medical Domain 232
9.6 Application of GWO in Microarray Data 232
9.7 Conclusion and Future Work 232
References 243
10 The Cluster Analysis and Feature Selection: Perspective
of Machine Learning and Image Processing 249
Aradhana Behura
10.1 Introduction 251
10.2 Various Image Segmentation Techniques 254
10.2.1 Clustering 254
10.2.2 Thresholding 254
10.2.3 Edge-Based Segmentation 254
10.2.4 Region-Based Image Segmentation 255
10.2.5 Watershed 255
10.3 How to Deal With Image Dataset 256
10.3.1 Introduction 256
10.3.2 Image Acquisition 256
10.3.3 Image Pre-Processing 256
10.3.4 Image Enhancement 257
10.3.5 Image Segmentation 258
10.3.6 K-Mean Clustering 259
10.3.6.1 Euclidian Distance 259
10.3.6.2 Clustering 260
10.3.7 Density-Based Spatial Clustering of Application With Noise (DBSCAN) 261
10.3.8 SVM Classifier 262
10.4 Class Imbalance Problem 264
10.4.1 Resampling Approaches 264
10.5 Optimization of Hyperparameter 267
10.6 Case Study 270
10.6.1 Pancreatic and Lung Tumor Prediction in the Machine Learning Era: Unique Supervised and Unsupervised Methodologies 270
10.6.2 Pancreatic Cysts (IPMN) 272
10.7 Using AI to Detect Coronavirus 273
10.7.1 BlueDot AI Technology 273
10.8 Using Artificial Intelligence (AI), CT Scan and X-Ray 274
10.9 Conclusion 276
References 276
Part 3 Machine Learning and Healthcare Applications 281
11 Artificial Intelligence and Machine Learning for Healthcare Solutions 283
Ashok Sharma, Parveen Singh and Gowhar Dar
11.1 Introduction 284
11.2 Using Machine Learning Approaches for Different Purposes 284
11.3 Various Resources of Medical Data Set for Research 286
11.4 Deep Learning in Healthcare 287
11.5 Various Projects in Medical Imaging and Diagnostics 288
11.6 Conclusion 289
References 290
12 Forecasting of Novel Corona Virus Disease (Covid-19) Using LSTM and XG Boosting Algorithms 293
V. Aakash, S. Sridevi, G. Ananthi and S. Rajaram
12.1 Introduction 294
12.2 Machine Learning Algorithms for Forecasting 296
12.3 Proposed Method 300
12.3.1 LSTM (Longest Short-Term Memory) 301
12.3.2 XG Boost (eXtreme Gradient Boosting) Algorithm 303
12.3.3 Polynomial Regression 303
12.3.4 Performance Metrics 304
12.4 Implementation 304
12.4.1 The Main Python Code for LSTM 304
12.4.2 The Main Python Code for Polynomial Regression 305
12.4.3 The Main Python Code for XG Boosting Algorithm 306
12.4.4 Libraries or Methods Used in the Proposed Work 306
12.5 Results and Discussion 307
12.6 Conclusion and Future Work 310
References 310
13 An Innovative Machine Learning Approach to Diagnose Cancer at Early Stage 313
Poongodi, P., Udayakumar, E., Srihari, K. and Sachi Nandan Mohanty
13.1 Introduction 314
13.1.1 Multiscale Cancer Detection 315
13.2 Related Work 317
13.3 Materials and Methods 320
13.4 System Design 322
13.4.1 Artificial Neural Network 326
13.4.2 Back Propagation Network (BPN) 326
13.4.3 Support Vector Machine (SVM) 327
13.4.4 Pre-Processing 329
13.4.5 Feature Extraction 329
13.4.6 Database Updation 330
13.4.7 Classification 330
13.4.8 Clustering 330
13.4.9 Segmentation Using FCM Clustering 330
13.5 Results and Discussion 331
13.6 Conclusion 335
References 335
14 A Study of Human Sleep Staging Behavior Based on Polysomnography Using Machine Learning Techniques 339
Santosh Kumar Satapathy and D. Loganathan
14.1 Introduction 340
14.2 Polysomnography Signal Analysis 341
14.3 Case Study on Automated Sleep Stage Scoring 349
14.3.1 Experimental Data 349
14.3.2 The Methodology 351
14.3.3 Experimental Results and Discussion 351
14.4 Summary and Conclusion 356
References 357
15 Detection of Schizophrenia Using EEG Signals 359
Shalini Mahato, Laxmi Kumari Pathak and Kajal Kumari
15.1 Introduction 360
15.1.1 The Human Brain 360
15.1.2 Schizophrenia 361
15.1.2.1 DSM-V Definition and Diagnosis
Criteria of Schizophrenia 361
15.1.2.2 Types of Schizophrenia 362
15.1.2.3 Causes of Schizophrenia 363
15.1.2.4 Symptoms of Schizophrenia 363
15.1.3 Electroencephalograph (EEG) 363
15.1.3.1 Characterizations of EEG Signals 364
15.2 Methodology 367
15.2.1 EEG Signal Processing 367
15.2.2 Removing the Artifacts 367
15.2.3 Feature Extraction 368
15.2.4 Normalization 369
15.2.5 Feature Selection/Reduction 369
15.2.6 Feature Classification 369
15.3 Literature Review 372
15.4 Discussion 372
15.5 Conclusion 388
References 388
16 Performance Analysis of Signal Processing Techniques in Bioinformatics for Medical Applications Using Machine Learning Concepts 391
G. Aparna, G. Anitha Mary and G. Sumana
16.1 Introduction 392
16.1.1 Role of Machine Learning in Bioinformatics 393
16.1.2 Machine Learning Applications for Bioinformatics 393
16.1.3 Recent Trends in Bioinformatics 394
16.1.4 Data Analytics in Bioinformatics 394
16.1.5 Machine Learning Algorithms 396
16.2 Basic Definition of Anatomy and Cell at Micro Level 397
16.2.1 Biological Cells 397
16.2.2 DNA, RNA and Proteins 398
16.3 Signal Processing—Genome Signal Processing 403
16.3.1 Identification of Hotspots 404
16.3.2 Advantages of Computational Hotspot Identification Techniques Over Alanine-Scanning Mutagenesis 405
16.3.3 Overview of Protein Sequences 405
16.3.4 EIIP & CPNR-Based Mapping 407
16.3.5 Feature Extraction Technique 411
16.3.5.1 RRM—Resonant Recognition Model 412
16.3.5.2 Discrete Wavelet Transforms 413
16.4 Hotspots Identification Algorithm 414
16.5 Results—Experimental Investigations 416
16.6 Analysis Using Machine Learning Metrics 418
16.6.1 Theoretical Details of Performance Metrics 418
16.6.2 Comparative Analysis of the Protein Sequence Representation 420
16.6.3 Visual Analysis 423
16.7 Conclusion 424
Appendix 424
A.1 Hotspot Identification Code 424
A.2 Performance Metrics Code 425
References 427
17 Survey of Various Statistical Numerical and Machine Learning Ontological Models on Infectious Disease Ontology 431
Yuvaraj Natarajan, Srihari Kannan and Sachi Nandan Mohanty
17.1 Introduction 432
17.2 Disease Ontology 432
17.3 Infectious Disease Ontology 433
17.4 Biomedical Ontologies on IDO 434
17.5 Various Methods on IDO 435
17.6 Machine Learning-Based Ontology for IDO 436
17.7 Recommendation or Suggestions for Future Study 437
17.8 Conclusions 438
References 438
18 An Efficient Model for Predicting Liver Disease Using Machine Learning 443
Ritesh Choudhary, T. Gopalakrishnan, D. Ruby, A. Gayathri, Vishnu Srinivasa Murthy and Rishabh Shekhar
18.1 Introduction 444
18.2 Related Works 445
18.3 Proposed Model 446
18.3.1 Elements of Experimental Methodology 446
18.3.1.1 Experimental Dataset 447
18.3.1.2 Overview of Data & Analysis 447
18.3.1.3 Data Preprocessing 448
18.3.1.4 Standardization 449
18.3.1.5 Label Encoding 450
18.3.2 Model Building 450
18.3.2.1 Support Vector Machines (SVM) 451
18.3.2.2 Logistic Regression 451
18.3.2.3 Naïve Bayes 452
18.3.2.4 Random Forests 452
18.3.2.5 Gradient Boosting 452
18.3.3 Performance Evaluation 453
18.3.4 Performance Optimization 454
18.3.4.1 N-Fold Cross Validation 454
18.4 Results and Analysis 454
18.5 Conclusion 456
References 456
Part 4 Bioinformatics and Market Analysis 459
19 A Novel Approach for Prediction of Stock Market Behavior Using Bioinformatics Techniques 461
Prakash Kumar Sarangi, Birendra Kumar Nayak and Sachidananda Dehuri
19.1 Introduction 462
19.2 Literature Review 463
19.3 Proposed Work 466
19.3.1 Encoding of Stock Market Price Behavior to Binary String 466
19.3.2 Mapping Binary String Into DNA Sequence 467
19.3.3 Sequence Alignment Using BLAST 467
19.3.4 Prediction Method 468
19.3.5 Decoding Predicted DNA Sequence Into Binary String 469
19.3.6 Mismatching Analysis of Stock Market Behavior 470
19.4 Experimental Study 470
19.4.1 Data Analysis 470
19.4.2 Results 471
19.5 Conclusion and Future Work 482
References 484
20 Stock Market Price Behavior Prediction Using Markov Models: A Bioinformatics Approach 485
Prakash Kumar Sarangi, Birendra Kumar Nayak and Sachidananda Dehuri
20.1 Introduction 486
20.2 Literature Survey 487
20.3 Proposed Work 488
20.3.1 Encoding of Stock Market Price Behavior to Binary Sequence 489
20.3.2 Conversion Between Binary Sequences to Nucleotide Sequence 489
20.3.3 Zero-Order Markov Model 490
20.3.4 First-Order Markov Model 490
20.3.5 Second-Order Markov Model 491
20.3.6 Hidden Markov Model 492
20.3.6.1 HMM for Stock Market Behavior Prediction 494
20.3.7 Decoding Predicted DNA Sequence into a Binary String 496
20.3.8 Mismatching Analysis of Stock Market Behavior 497
20.4 Experimental Work 497
20.4.1 Dataset Preparation 497
20.4.2 Results and Analysis 498
20.4.3 Performance Comparison Between Different Orders Markov Models 502
20.5 Conclusions and Future Work 504
References 505
Index
Back to Top