Applications of Deep Learning and Reinforcement Learning to Biological Data

Rapid advances of hardware-based technologies during the past decades have opened up new possibilities for Life scientists to gather multimodal data in various application domains (e.g., Omics, Bioimaging, Medical Imaging, and [Brain/Body]-Machine Interfaces), thus generating novel opportunities for development of dedicated data intensive machine learning techniques. Overall, recent research in Deep learning (DL), Reinforcement learning (RL), and their combination (Deep RL) promise to revolutionize Artificial Intelligence. The growth in computational power accompanied by faster and increased data storage and declining computing costs have already allowed scientists in various fields to apply these techniques on datasets that were previously intractable for their size and complexity. This review article provides a comprehensive survey on the application of DL, RL, and Deep RL techniques in mining Biological data. In addition, we compare performances of DL techniques when applied to different datasets across various application domains. Finally, we outline open issues in this challenging research area and discuss future development perspectives.


Introduction
The need for novel healthcare solutions and continuous efforts in understating the biological bases of pathologies have pushed extensive research in the Biological Sciences over the last two centuries [1]. Recent technological advancements in Life Sciences opened up possibilities not only to study Biological systems from a holistic perspective but provided unprecedented access to the molecular details of the living organisms [2,3]. Novel tools for DNA sequencing [4], gene expression [5], bioimaging [6], neuroimaging [7], and brain-machine interfaces [8] are now available to the scientific community. However, considering the inherent complexity of the biological systems together with the high-dimensionality, diversity, and noise contaminations, inferring 1/33 meaningful conclusion from these data is a huge challenge [9]. Therefore, novel instruments are required to process and analyze biological big data that must be robust, reliable, reusable, and accurate [10]. This encouraged numerous scientists from life and computing sciences disciplines to embark in a multidisciplinary approach to demystify functions and dynamics of living organisms with remarkable progress to biological and biomedical research [11]. Thus, many techniques of Artificial Intelligence (AI), in particular machine learning (ML), have been proposed over time to facilitate recognition, classification, and prediction of patterns in biological data [12].
The conventional ML techniques can be broadly categorized in two large setssupervised and unsupervised. The methods pertaining to the supervised learning paradigm classify objects in a pool using a set of known annotations/ attributes/ features. Instead, the unsupervised learning techniques form groups/ clusters among the objects in a pool by identifying their similarity and then use them for classifying the unknowns. Also, the other category, reinforcement learning (RL), allows a system to learn from the experiences it gains through interacting with its environment (see section 1.2 for details).
Popular supervised methods include: Artificial Neural Network (ANN) [13] and its variants, Support Vector Machines [14] and linear classifiers [15], Bayesian Statistics [16], k-Nearest Neighbors [17], Hidden Markov Model [18], and Decision Trees [19]. Also, popular unsupervised methods include: Autoencoders [20], Expectation Maximization [21], Self-Organizing Maps [22], k-Means [23], and Fuzzy [24] and Density-based [25] clustering. Figure 1 A possible representation of the DL, RL, and deep RL frameworks for biological applications. A-F. The popular DL architectures. G. Schematic diagram of the learning framework as a part of Artificial Intelligence (AI). Broadly, AI can be thought to have evolved parallelly in two main directions-Expert Systems (ES) and ML. ES takes expert decisions from given factual data using rule based inferences. ML extracts features from data mainly through statistical modeling and provides predictive output when applied to unknown data. DL, being a sub-division of ML, extracts more abstract features from a larger set of training data mostly in a hierarchical fashion resembling the working principle of our brain. The other sub-division, RL, provides a software agent which gathers experience based on interactions with the environment through some actions and aims to maximize the cumulative performance. H. Possible applications of AI to biological data. A large body of evidences shows that the above mentioned methods and their respective variants can be successfully applied to Biological data coming from various sources, e.g., Omics (covers data from genetics and [gen/ transcript/ epigen/ prote/ metabol]omics [26]), Bioimaging (covers data from [sub-]cellular images acquired by diverse imaging techniques [27]), Medical Imaging (covers data from [medical/ clinical/ health] imaging mainly through diagnostic imaging techniques [28]), and [Brain/Body]-Machine Interfaces or BMI (covers electrical signals generated by the Brain and the Muscles and acquired using appropriate sensors [29,30]).
Broadly, AI can be thought to have evolved parallelly in two main directions-Expert Systems and ML (see the schematic diagram of Fig. 1H). Focusing on the latter, ML extracts features from training dataset(s) and make models with minimal or no human intervention. These models provide predicted outputs based on test data. DL, 1 Conceptual Overview

Deep Learning
The core concept of DL is to learn data representations through increasing abstraction levels. Almost in all levels more abstract representations at a higher level are learned by defining them in terms of less abstract representations at lower levels. This type of hierarchical learning process is very powerful as it allows a system to comprehend and learn complex representations directly from the raw data [40], making it useful in many disciplines [41]. Several  For the sake of brevity, only the ones widely used with Biological data are briefly summarized below. However, the interested readers are redirected to the references mentioned in each subsection for concrete mathematical details behind each architecture.

Deep Neural Network
DNN ( Fig. 1 A) [42] is inspired by the brain's visual input processing mechanism which takes place at multiple levels (i.e., starting with cortical area 'V1' and then passing to area 'V2', and so on) [32]. The standard neural network (NN) is extended to have multiple hidden layers with nonlinear modules embodied in each hidden layer allowing it to learn part-hole of the representations. Though this formulation has been successfully used in many applications, the training process is slow and cumbersome. Fig. 1 B) [43] is a NN model designed to detect structures in streams of data [44]. Unlike feedforward NN which performs computations unidirectionally from input to output, RNN computes the current state's output depending on the outputs of the previous states. Due to this 'memory'-like property, despite learning problems related to vanishing and exploding gradients, RNN gained popularity in many fields involving streaming data (e.g., text mining, time series, genomes, etc.). In recent years, two main variants, bidirectional RNN (BRNN) [45] and long short-term memory (LSTM) [46] have also been applied [47,48]. Fig. 1 C) [49] is a multilayer NN model [50], inspired by the neurobiology of visual cortex, that consists of convolutional layer(s) followed by fully connected layer(s). In between these two types of layers the may exist subsampling steps. They get the better of DNNs which have difficulty in scaling well with multidimensional locally correlated input data. Therefore, the main application of CNN has been in datasets where the number of nodes and parameters required to be trained is relatively large (e.g., image analysis). Exploiting the 'stationary' property of an image, convolution filters (CF) can learn data-driven kernels. Applying such CF along with a suitable pooling function reduces the features that are supplied to the fully connected network to classify. However, in case of large datasets even this can be daunting and can be solved using sparsely connected networks. Some of the popular CNN configurations include: AlexNet [51], VGGNet [52], and GoogLeNet [53].

Deep Autoencoder
DA architecture (Fig. 1 D) [54] is obtained by stacking a number of Autoencoders which are data driven NN models (i.e., unsupervised) designed to reduce data dimension by automatically projecting incoming representations to a lesser dimensional space than that of the input. In an Autoencoder, equal amount of units are used in the input/output layers and less units in the hidden layers. (Non)linear transformations are embodied in the hidden layer units to encode the given input into smaller dimensions [55]. Despite that it requires a pre-training stage and suffers from vanishing error, this architecture is popular for its data compression capability and have many variants, e.g., Denoising Autoencoder [54], Sparse Autoencoder [56], Variational Autoencoder [57], and Contractive Autoencoder [58]. [R]BM is an undirected probabilistic generative model representing specific probability distributions [59]. It is also considered as nonlinear feature detector. The learning process of [R]BM is based on optimizing its parameters for a set of given observations to obtain the best possible fit of the probability distribution through Gibbs sampling (a Markov Chain Monte Carlo method [60]) [61]. BM has symmetrical connections among its units and has one visible layer with (multiple) hidden layers. Usually, the learning process of a BM is slow and computationally expensive, thus, requires long to reach equilibrium statistics [40]. By restricting the intralayer units of a BM to connect among themselves a bipartite graph is formed (i.e., an RBM has a visible and a hidden layer) where the learning inefficiency is solved [59]. Stacking multiple RBMs as learning elements yields the following two DL architectures.
Deep Boltzmann Machine DBM (Fig. 1 E) [62] is a stack of undirected RBMs. Being undirected, there is a feedback process among the layers where feature inference from higher level units affect the inference of lower level units. Despite this powerful inference mechanism which allows an input's alternative interpretations through concurrent competition at all levels of the model, estimating model parameters from data remains difficult. Gradient based methods (e.g., persistent contrastive divergence [63]) fail to explore the model parameters sufficiently [62]. Though this learning problem is overcome by pretraining each RBM in a layerwise greedy fashion, with outputs of the hidden variables from lower layers as input to upper layers [59], the time complexity remains high and may not be suitable for large training datasets [64].
Deep Belief Network DBN (Fig. 1 F) [65] is formed by ordering several RBMs in a way that one RBM's latent layer is linked to the subsequent RBM's visible layer. The connections of DBN are downward directed to its immediate lower layer, except that the upper two layers are undirected [65]. Thus, DBN is a hybrid model with the first two layers as undirected graphical model and the rest being directed generative model. The different layers are learned in a layerwise greedy fashion and fine tuned based on required output [33], however, the training procedure is computationally demanding.

Reinforcement Learning
Rooted in behavioral psychology, RL is a distinctive member of the ML family. An RL problem is solved by learning new experiences through trial-and-error. An RL agent is trained, as such, it's actions to interact with the environment maximizes the cumulative reward resulting from the interactions. Generally, RL problems are modeled and solved using Markov Decision Processes (MDP) theory through Monte Carlo (MC) and dynamic programming (DP) [66].
The learning of an agent is a continuous process where the interactions with the environment occurs at discrete time steps. In a typical RL cycle (at time t), the agent receives the environment's state (i.e., state, s t ) and selects an action (a t ) to interact. The environment responds to the action and progresses to a new state (s t+1 ). The reward (r t+1 ), that the agent either receives or not for the selected action, associated to the transition (s t , a t , s t+1 ) is also determined [66]. Accordingly, after each cycle, the agent updates the value function V (s) or action-value function Q(s, a) based on certain policy, where, policy (π) is a function that maps states s ∈ S to actions a ∈ A, i.e., π : S → A ⇒ a = π(s) [36].
A possible way to solve the RL problem is to describe the environment as MDP with a set of state-value function pairs, a set of actions, a policy, and a reward function. The value function can be separated to solve state-value function (V ) or action-value 5/33 function (Q). In the state-value function the expected outcome, of being in state s following policy π, is determined by sum of the rewards at future time steps with a given discount factor (γ ∈ [0, 1]), i.e., V π (s) = E π ( ∞ k=0 γ k r t+k+1 |s t = s). And in the action-value function the expected outcome, of being in state s taking action a following policy π, is determined by sum of the rewards for each state action pairs, i.e., Q π (s, a) = E π ( ∞ k=0 γ k r t+k+1 |s t = s, a t = a). The MDP can be solved and the optimum policy can be achieved through DP by: either starting with an initial policy and improving it iteratively (policy iteration), or starting with arbitrary value function and recursively refining an estimate of an improved state-value or action-value function to compute an optimal policy and its value (value iteration) [67]. In the simplest case, the state-value function for a given policy can be estimated using Bellman expectation equation as: V π (s) = E π (r t+1 + γV π (s t+1 )|s t = s). Considering this as a policy evaluation process, an improved and eventually optimal policy (π * ) can be achieved by taking actions greedily that maximizes the state-action value. But in scenarios with unknown environments, model-free methods are to be used without MDP. In such cases, instead of the state-value function, the action-value function can be maximized to find the optimal policy (π * ) using a similar policy evaluation and improvement process, i.e., Q π (s, a) = E π (r t+1 + γQ π (s t+1 , a t+1 )|s t = s, a t = a). There are several learning techniques, e.g., Monte Carlo, Temporal Difference (TD), and State-Action-Reward-State-Action (SARSA), which describe various aspects of the model-free policy evaluation and improvement process [68].
However, in real world RL problems, the state-action space is very large and storing a separate value function for every possible state is cumbersome. In such situations generalization of the value function through function approximation is required. For example, the Q value function approximation is able to generalize to unknown states by calculating a function (Q) for a given state action pair (s, a), i.e., Q(s, a, w) ≈ Q π (s, a) = x(s, a) w. In other words, a rough approximation of the Q function is obtained from the feature vector representing (s, a) pair (x) and the provided parameter (w which is updated using MC or TD learning) [69]. This approximation allows to improve the Q function by minimizing the loss between the true and approximated values (e.g., using gradient descent), i.e., J(w) = E π ((Q π (s, a) −Q(s, a, w)) 2 ). Examples of differentiable function approximators include: neural network, linear combinations of features, decision tree, nearest neighbor, Fourier bases, etc. [70].

Deep Reinforcement Learning
The autonomic capability to learn without any feature crafting makes RL a powerful tool applicable to many disciplines, but it falls short in cases when the data dimensionality is large and the environment is non-stationary [71]. Also, DL's capability to learn complex patterns is sometimes prone to misclassification [72]. To mitigate, in recent years, RL algorithms have been successfully combined with deep NN [39] giving rise to novel learning strategies. This integration has been used either in approximating RL functions using deep NN architectures or in training deep NN using RL.
The first notable example of such an integration is the Deep Q-network (DQN) [31] which combines Q-learning with deep NN. The DQN agent, when presented with high-dimensional inputs, can successfully learn policies using RL. The action-value function is approximated for optimality using deep CNN. The deep CNN, using experience replay and target network, overcomes the instability and divergence sometimes experienced while approximating Q-function with shallow NN.
Another deep RL algorithm is the Double DQN which is an extension of the DQN algorithm [73]. In certain situations the DQN suffers from substantial overestimations inherited from the implemented Q-learning which are overcome by replacing the Q-learning of the DQN with a double Q-learning algorithm [74]. The DQL learns two value functions, by assigning an experience randomly to update one of them, resulting in two sets of weights. During every update one set determines the greedy policy while the other its value. Other deep RL algorithms include: Deep Deterministic Policy Gradient, Continuous DQN, Asynchronous N-step Q-learning, Dueling network DQN, Prioritized Experience Replay, Deep SARSA, Asynchronous Advantage Actor-Critic, and Actor-Critic with Experience Replay [39].

Applications to Biological Data
The techniques outlined above, also available as open-source tools (e.g., see [75] for a mini review on tools based on DL), have been used in mining Biological data. The applications, as reported in the literature, are provided below for data coming from each of the application domains. Table 1 summarizes the state-of-the art applications of DL and RL to biological data (see Fig. 1 H). It also reports on individual applications in each of these domains and the data type on which the methods have been applied.

Omics
Some DL and RL methods have been extensively used in Omics (such as genomics, proteomics or metabolomics) research to extract features, functions, structure, and molecular dynamics from the raw biological sequence data (e.g., DNA, RNA, and amino-acids). Specifically, mining sequence data is a challenging task. Different analyses (e.g., gene expression profiling, splicing junction prediction, sequence specificity prediction, transcription factor determination, protein-protein interaction evaluation, etc.) dealing with different types of sequence data have been reported in the literature.
To identify splicing junction at DNA level, a tedious job to do manually, Lee et al. proposed a DBN based unsupervised method to perform the auto-prediction [79]. Profiling gene expression (GE) is a demanding job. Chen et al. exploited a DNN based method for GE profiling on RNA-seq and microarray-based GE Omnibus dataset [83]. The ChIP-seq data were preprocessed, using CNN, into a 2D matrix where each row denoted a gene's transcription factor activity profile [92]. Also, somatic point mutation based cancer classification was performed using DNN [90]. In addition, DA based methods have been used for feature extraction in cancer diagnosis and classification (Fakoor et  Identifying the best discriminative genes/microRNAs (miRNAs) is a challenging task. Ibrahim et al. proposed a group feature selection method from genes/miRNAs based on expression profile using DBN and active learning [80]. CNN was used to interpret noncoding genome by annotating them [94]. Also, Zeng et al. employed CNN to predict the binding between DNA and protein [95]. Zhou et al. proposed a CNN based approach to identify noncoding GV [96] which was also used by Huang  NS [190][191][192][193][194][195][196] Legends miRNA precursor [99]. Also, Lee et al. presented a deep RNN framework for automatic miRNA target prediction [100]. DNA methylation (DM) causes DNA segment activity alteration without affecting the sequence, thus, detecting it's state in a sequence is important. Angermueller et al. used DNN based method to estimate DM state by predicting the changes in single nucleotides and uncovering sequence motifs [85].
Proteomics pose many complex computational problems to solve. Estimating complete protein structures from biological sequences, in 3D space, is a complex and NP hard problem. Alternatively, the protein structures can be divided into independent sub-problems (e.g., torsion angle, access surface area, dihedral angles, etc.) and solved in parallel, and estimate the secondary protein structures (2-PS). Predicting compounds-protein interaction (CPI) is very interesting from drug discovery point of view and tough to solve. Heffernan

Bioimaging
In biology, DL architectures targeted on pixel levels of a biological image to train the NN. Ning et al. used CNN for pixel-wise image segmentation of nucleus, cytoplasm, cell, and nuclear membranes using Electron Microscope Image (EMI) [108]. Reduced pixel noise and better abstract features of biological images can be obtained by adding multiple layers. Ciresan et al. employed deep convolutional neural networks to identify mitosis in histology images of the breast [109], and similar architecture was also used to find neuronal membranes and automatically segment neuronal structures in EMI [110]. Xu et al. used Stacked Sparse DA architecture to identify nuclei in the histopathology images of the breast cancer [105]. Xu et al. classified Colon cancer images using Multiple Instance Learning (MIL) from DNN learnt features [106].
Besides pixel level analysis, DL have also been applied to cell and tissue level analysis. Chen et al. employed DNN in label-free cell classification [107]. Pärnamaa and Leopold used CNN to automatically detect fluorescent protein in various subcellular localization patterns using microscopy images of yeast [111]. Ferrari et al. used CNNs to count bacterial colonies in agar plates [112]. Kraus et al. integrated both the segmentation as well as classification in a model which can be utilized to classify the microscopy images of the yeast [113]. Flow cytometry is used in cellular biology through cycle analysis to monitor different stages of a cell-cycle. Eulenberg et al. proposed deep flow model, combining non-linear dimension reduction with CNN, to analyze single cell flow cytometry images [114]. Furthermore, CNN architecture was employed to segment and recognize neural stem cells in images taken by bright field microscope [115], and DBN for analyzing Gold immunochromatographic strip [197]. 9/33
Segmentation is a process of partitioning an image based on some specific patterns. Sirinukunwattana et al. reported the results of the Gland Segmentation competition from colon histology images [156]. Kamnitsas et al. proposed 3D dual pathway CNN to simultaneously process multi-channel MRI and segment lesions related to tumors, traumatic injuries, and ischemic stroke [130]. Stollenga et al. segmented neuronal structures from 3D EMI and brain MRI using multi dimensional RNN [131]. Fritscher et al. used deep CNN for volume segmentation from head-neck region's CT scans [134]. Havaei et al. segmented brain tumor from MRI using CNN [125], and DNN [123]. Brosch and Tam proposed a DBN based manifold learning method of 3D brain MRI [119]. Cardiac MRIs were segmented for heart's left ventricle using DBN [145], and blood pool and myocardium using CNN [157]. Mansoor et al. automatically segmented anterior visual pathway from MRI sequences using stacked DA model [116]. Lerouge et al. proposed DNN based method to label CT scans [133].
Success of many medical image analysis methods depends on image denoising. Gondara proposed a denoising technique utilizing convolutional denoising DA, and validated it with mammograms and dental radiography [144]. While Agostinelli et al. presented an adaptive multi-column stacked sparse denoising autoencoder (DA) method for image denoising which was validated using CT Scan images of the head [132].
Detecting anomaly in medical images is widely used for disease diagnosis. Several models were applied to detect Alzheimer's Disease (AD) and Mild Cognitive Impairment (MCI) from MRI and PET scans including DA [117,118], DBM [120], RBM [121], and multimodal stacked deep polynomial network (MM-SDPN) [124].
In addition, DBN was successfully applied to identify: Attention Deficit Hyperactivity Disorder [142], and Schizophrenia (SZ) and Huntington Disease from (f/s)MRI [122]. And, a DNN based method was proposed to successfully identify the fetal abdominal standard plane in UlS images [146].
RL was used in segmenting transrectal UlS images to estimate location and volume of the prostate [158].

10/33
Various DL architectures have been used in classifying EEG signals to decode Motor Imagery (MoI). CNN was applied in the classification pipeline using -augmented common spatial pattern features which covered various frequency ranges [172]; features based on combined selective location, time, and frequency attributes which were then classified using DA [173]; and signal's dynamic energy representation [174]. DBN was also employed-in combination with softmax regression to classify signal frequency information as features [161]; and in conjunction with Ada-boost algorithm to classify single channels [162]. DNN was used-with variance based common spatial pattern (CSP) features to classify MoI EEG [171], and to find neural patterns occurring at each time points in single trials where the input heatmaps were created with layer-wise relevance propagation technique [170]. In addition, MoI EEG signals were classified by denoising DA using multifractal attribute features [159].
DBN was used by Li et al. to extract low dimensional latent features as well as critical channel selection which led to an early framework for affective state classification using EEG signals [163]. In a similar work, Jia et al. used semi-supervised approach with an active learning to train DBN and generative RBMs for the classification [164]. Later, using differential entropy as features to train DBN, Zheng et al. examined dominant frequency bands and channels of EEG in an emotion recognition system [165]. Jirayucharoensak et al. used PCA extracted power spectral densities from each EEG channel, which were corrected by covariate shift adaptation to reduce non-stationarity, as features to stacked DA to detect emotion [160]. Tripathi et al. explored DNN (with Softmax activator and Dropout) and CNN [198] (with Tan Hyperbolic, Max Pooling, Dropout, and Softplus) for emotion classification from the DEAP dataset using EEG signals and response face video [175]. Using similar data from the MAHNOB-HCI dataset, Soleymani et al. detected continuous emotion using RNN-LSTM [178]. Channel-wise CNN & its variant with RBM [176], and AR-model based features with sparse-DBN [169], was used to estimate driver's cognitive states using EEG data.
In another approach to model cognitive events, EEG signals were transformed to time-lagged multi-spectral images and fed to CNN for learning the spectral and spatial representations of each image, followed by an adapted RNN (LSTM) to find the temporal patterns in the image sequence [179].
DBN has been employed in classifying EEG signals for anomaly detection in diverse scenarios including: online waveform classification [166]; AD diagnosis [167]; integrated with HMM to understand sleep phases [168]. To detect and predict seizures-CNN was used through classification of synchronization patterns [177]; RNN predicted specific signal features related to seizure after being trained with data preprocessed by wavelet decomposition [180]. Also, a lapse of responsiveness warning system was proposed using RNN (LSTM) [181].
ECG Arrhythmias were successfully detected using DBN [188] and DNN [189]. DBN was also used to classify ECG signals acquired with two-leads [187], and in combination with nonlinear SVM and Gaussian kernel [186].
RL has also been applied in BMI research. Concentrating mainly on controlling (prosthetic/robotic) devices, several studies have been reported, including: mapping neural activity to intended behavior through coadaptive BMI (using TD(λ)) [190] and symbiotic BMI (using actor-critic) [191], a testbed targeting center-out reaching task in primates for creating more realistic BMI control models [192], Hebbian RL for adaptive control by mapping neural states to prosthetic actions [193], BMI for unsupervised decoding of cortical spikes in multistep goal-directed tracking task (using Q(λ)) [194], adaptive BMI capable of adjusting to dramatic reorganizing neural activities with minimal training and stable performance over long duration (using actor-critic) [195], 11/33 BMI for efficient nonlinear mapping of neural states to actions through sparsification of state-action mapping space using quantized attention-gated kernel RL as an approximator [196]. Also, Lampe et al. proposed BMI capable of transmitting imaginary movements evoked EEG signals over the Internet to remotely control robotic device [182], and Bauer and Gharabaghi combined RL with Bayesian model to select dynamic thresholds for improved performance of restorative BMI [183].

Performance Analysis and Comparison
Comparative test results, in the form of performances/ accuracies of each DL technique when applied to the data coming from Omics (Fig. 2), Bioimaging (Fig. 3), Medical Imaging (Fig. 4), and [Brain/Body]-Machine Interfaces (Fig 5), are summarized below to facilitate the reader in selecting the appropriate method for h[is/er] research. The reported performances can be regarded as a metric to evaluate the strengths/ weaknesses of a particular technique with a given set of parameters on a specific dataset. It should be noted that several factors (e.g., data pre-processing, network architecture, feature selection and learning, parameters' optimization, etc.) collectively determine the accuracy of a method.
In Figs. 2-5, each group of bars indicates accuracies/performances of comparable DL or non-DL techniques when applied to same data and reported in an individual study. And, each bar in a group shows the (mean) performance of different runs of a technique on either multiple subjects/datasets (for means, error bar is ± standard deviation). Performance comparison of CPI is shown in Fig. 2B. Tested over two CPI datasets, a DNN based method (DNN*) achieved superior prediction accuracy (93.2% in dataset1 and 93.8% in dataset2) compared to other methods based on RF (83.9% and 86.6%), LR (88.3% and 89.9% using LR 2 ), and SVM (88.7% and 90.3% using SVM 3 ) [88]. In another study, a similar DNN* was applied on DUD-E dataset where it achieved higher accuracy (99.6%) over RF (99.58%) and CNN (89.5%) based methods [89]. As per the accuracies reported in [88], the RF based method had lower values in comparison to the LR and SVM based methods which had similar values. Whereas, when applied on DUD-E dataset (reported in [89]), the RF based method outperforms the CNN based method. This may be attributed to the fact that, classification problems are data dependent and despite being one of the best classifiers [200], RF performs poorly on the DUD-E dataset.

Omics
In predicting 2-PS, DL based methods outperforms other methods (see Fig. 2C). When applied on two datasets (CASP11 and TS1199), the stacked sparse autoencoder (StSAE) based method achieved superior prediction accuracy ( To annotate GV in identifying pathogenic variants from two datasets (TS and CVESP in Fig. 2D), a DNN based method performed better (72.2% and 94.6%) than 13/33 LR (63.5% and 95.2%) and SVM (63.1% and 93.0%) based methods. Another CNN based approach to predict DNA sequence accessibility was tested on data from ENCODE & REC databases and was reported to outperform gapped k-mer SVM method (mean AUC of 0.89 vs. 0.78) [94]. In classifying cancer based on somatic point mutation using raw TCGA data containing 12 cancer types, a DNN based method outperformed non-DL methods (60.1% vs. [SVM: 52.7%, kNN: 40.4%, NB: 9.8%]) [90]. To detect breast cancer using GE data from TCGA database, a Stacked Denoising Autoencoder (StDAE) was employed to extract features. According to the reported accuracies of non-DL classifiers (ANN, SVM, and SVM-RBF), StDAE outperformed other feature extraction methods such as PCA and KPCA (SVM-RBF classification accuracies for StDAE, PCA, and KPCA were 98.26%, 89.13%, and 97.32%, respectively) [77]. Also, deeply connected genes were better classified with StDAE extracted features (accuracies-ANN: 91.74%, SVM: 91.74%, and SVM-RBF: 94.78%) [77]. Another study on classifying cancer, using 13 different GE datasets taken from the literature, reported that the use of PCA in data dimensionality reduction, before applying SAE, StAE, and StAE-FT for feature extraction, facilitates more accurate extraction of features (except AC and OV in Fig. 2D) for classification using SVM with Gaussian kernel [76].
Sequence specificities of [D/R]BPs prediction was performed more accurately using a deep CNN based method in comparison to other non-DL methods participated at the DREAM5 1 challenge [93]. As seen in Fig. 2E [201,202].
Moreover, in predicting RBP, DL based methods outperformed non-DL methods as seen in Fig

[Brain/Body]-Machine Interfaces
Test results in the form of performance comparison of DL/non-DL methods applied to EEG data to detect MoI, emotion & affective state, and anomaly are shown in Fig. 5A.

Open Issues and Future Perspectives
Overall, it is believed that brain solves problems through reinforcement learning and neuronal networks organized as hierarchical processing systems. Though since the 1950's the field of AI has been trying to adopt and implement this strategy in computers, notable progress has been seen only recently due to our better understanding about learning systems, increase of computational power, decline of computing costs, and last but not the least, the seamless integration of different technological and technical breakthroughs. However, there are still situations where these methods fail, underperforming traditional methods, and, therefore, must be improved. Below we outline, what in our opinion are, the shortcomings of current techniques, the existing open research challenges, and speculate about some future perspectives that will facilitate further development and advancement of the field.
The combined computational capability and flexibility provided by the two prominent ML methods (i.e., DL and RL) has also limitations [33]. Both of these methods require heavy computing power and memory and, therefore, they are not worthy of being applied to moderate size datasets. Additionally, the theory of DL is not completely understood, making the high level outcomes obscure and difficult to interpret. This turns into a situation when the models are considered as 'Black box' [206]. In addition, like other ML techniques, DL is also susceptible to misclassification [72] and overclassification [207]. Furthermore, in representing action-value pairs in RL, it is not possible to use all nonlinear approximators which may cause instability or even divergence in some cases [31]. Also, bootstrapping makes many of the RL algorithms NP hard and inapplicable to real-time applications as they are too slow to converge and in some cases too dangerous (e.g., autonomous driving). Moreover, very few of the existing techniques support harnessing the potential power of distributed and parallel computation through cloud computing. Arguably, in case of cloud, distributed, and parallel computing, data privacy and security concerns are still prevailing [208], and real-time processing capability of the gigantic amount of experimentally acquired data is still underdeveloped [209,210].
To proceed towards mitigating the shortcomings and addressing the open issues, first of all, improving the existing theoretical foundations of DL on the basis of experimental data becomes crucial to be able to quantify the performances of individual NN models [211]. These improvements should be able to address issues like-specific assessment of an individual model's computational complexity and learning efficiency in relation to well defined parameter tuning strategies, the ability to generalize and topologically self-organize based on data-driven properties. Also, novel data visualization techniques should be incorporated so that the interpretation of data becomes intuitive and less cumbersome. In terms of learning strategies, updated hybrid on-and off-policy with new advances in optimization techniques are required. The problems pertaining to observability of RL are yet to be completely solved, and optimal action selection is still a huge challenge.
As seen in Table 1, there are great opportunities to employ deep RL in Biological data mining. For example, deriving dynamic information from Biological data coming from multiple levels to reduce data redundancy and discover novel biomarkers for disease detection and prevention. Also, new unsupervised learning for deep RL methods are required to shrink the necessity of large-set of labeled data at the training phase.
Multitasking and multiagent learning paradigm should advance in order to cope with dynamically changing problems.
In addition, to keep up with the rapid pace of data growth in the biological application domains, computational infrastructures in terms of distributed and parallel computing tailored to those applications are needed.

Conclusion
The recent bliss of technological advancement in Life Sciences came with the huge challenge of mining the multimodal, multidimentional and complex Biological data. Triggered by that call, interdisciplinary approaches have resulted in development of cutting edge machine learning based analytical tools. The success stories of artificial neural networks, deep architectures, and reinforcement learning in making machines intelligent are well known. Furthermore, computational costs have dropped, computing power has surged, and quasi unlimited solid-state storage is available at reasonable price. These factors have allowed to combine these learning techniques to reshape machines' capabilities to understand and decipher complex patterns from Biological data. To facilitate wider deployment of such techniques and to serve as a reference point for the community, this article provides-a comprehensive survey of the literature of techniques' usability with different Biological data; a comparative study on performances of various DL techniques, when applied to the data coming from different application domains, as reported in the literature; and highlights of some open issues and future perspectives.