Biologically-inspired neural coding of sound onset for a musical sound classification task

A biologically-inspired neural coding scheme for the early auditory system is outlined. The cochlea response is simulated with a passive gammatone filterbank. The output of each bandpass filter is spike-encoded using a zero-crossing based method over a range of sensitivity levels. The scheme is inspired by the highly parallellised nature of the auditory nerve innervation within the cochlea. A key aspect of early auditory processing is simulated, namely that of onset detection, using leaky integrate-and-fire neuron models. Finally, a time-domain neural network (the echo state network) is used to tackle the what task of auditory perception using the output of the onset detection neurons alone. A set of interim results are presented.


I. INTRODUCTION
The mammalian auditory system performs a diverse range of signal processing tasks in near real time. Presented with a raw sound field, analysis is carried out to extract meaningful features, which may or may not be buried along with contributions from other sound sources. Such useful features include the direction from which a particular sound arrived (the where task) [1], [2], the nature of an individual sound (the what task) [3], interpreting the meaning of the sound (as in speech perception) [4] and decomposing a many-source sound field into seperable audio streams [5], [6]. In many cases several of these tasks must be performed at the same time.
The processing of sound within the auditory system is highly integrated, involving neural processes at all levels, from the cochlea to the cortex. The system is two-way, with information passed both upwards to the cortex [7], and back downwards towards the sensory units through the efferent system [8], [9]. A key feature is that certain kinds of processing occur early on, even in advance of the brain stem [10].
In this work a biologically-inspired scheme for sound onset representation within the auditory system is investigated. There is strong evidence to suggest that mammalian auditory systems are particularly attuned to the detection of sound onsets, even from the earliest stages of the auditory processing chain [11], [12]. The auditory nerve itself is known to respond more strongly to the start of a stimulus, and there are neurons within the cochlear nucleus which spike strongly at stimulus onset [4], [13], [14]. Sound onsets may be important for sound source location [1], sound identification [15], [ and are thought to play a role in the segregation of auditory streams [5], [6], [17].
From an ecological perspective the sound onset is potentially useful because its location at the start of a sound may aid in priming a response. The initial onset also tends to be relatively untainted by reverberation, as it usually arrives at the listener via a direct path from the source. For most tasks later reflections are ignored in favour of the initial onset [18].
Every sound begins with an onset. However, the precise definition of what constitutes the 'sound onset' is less clear [19]. It is possible to analyse a sound onset based on the physics of the sound production mechanism. In the case of a trumpet blowing a pitched note, for example, there is a short period of time at the beginning of the note when the vibrating lips of the player are not influenced by the acoustics of the instrument. At some later time a coupled interaction begins, which leads to the steady-state pitched note. It may be argued that the onset portion of the note occurs before full coupling between instrument and player, and the steady-state portion follows coupling. However, such a physical process is not necessarily perceived in the same clear order by the auditory system. A number of further factors, such as reverberant reflections, may contribute to the final waveform which reaches the ear.
The precise meaning of 'onset' in the context of perception can thus only be properly explored by studying the response of the auditory system to real sounds. What is clear is that the temporal fine-structure and frequency evolution of sound onsets varies widely, both in terms of perception [13] and from a generative standpoint. A drum hit, for example, clearly involves a different kind of physical onset than a slowly bowed violin string, and would be expected to produce a different sensation of 'onset' in a listener. We henceforth refer to the perceptual onset simply as the onset, and, in seeking to explore it with an auditory model, define it as a sudden and rapid rise in signal energy as seen by the sound receptor (in this case the cochlea). This may be a rise from a zero-level, or a pronounced increase from one level to a higher level.
In this work the (perceptual) onset is simulated using a spiking time-domain auditory model, based on the gammatone filterbank [20]. Section II provides an overview of the model and the coding scheme. In section III a method is outlined which uses the simulated spiking onset response as a descriptor for a musical sound classification task. Musical samples are sourced from the McGill dataset [21]. The classification task is then performed using a time-domain reservoir neural network known as the echo state network [22]  A schematic diagram outlining the auditory model. Note that AN spike generation is shown for only one channel, and onset neurons/depressing synapses for a single sensitivity level and three input channels.
is outlined in section IV. Section V provides an overview of some initial classification results.

II. BIOLOGICALLY-INSPIRED CODING OF THE EARLY AUDITORY SIGNAL
There have been numerous attempts to design a sound onset detector [23], [24], [25], [26], [27]. The most common uses for such a detector have been in automatic music transcription [28], sound segmentation [10], [17], [29], lip synchronisation [30], monaural sound-source separation [5], [31] and sound direction finding [1], [32]. In this work an attempt was made to use a neural-like coding of the sound onset as a descriptor in a musical sound classification problem.
The onset detection technique was based on a biologicallyinspired model of the mammalian auditory system, illustrated schematically in Fig. 1. The cochlea response was modelled with the ubiquitous passive gammatone filterbank (A) [20]. The output from each gammatone filter was spike-encoded (B) using a zero-crossing based technique [33], the design of which was inspired by the phase-locked spiking behaviour observed in neurons which innervate the cochlea's inner hair cells (IHC) [34]. This encoding thus provides a crude simulation of the auditory nerve's (AN) early response to sound stimuli. The strong spiking onset response observed by certain neurons within the cochlear nucleus [14] was then modelled using an array of leaky integrate-and-fire (LIF) neurons, innervated by the simulated AN signal (C), as implemented in [33]. Example outputs from these processing stages are shown in Fig. 2.

A. Gammatone filtering
The first processing step of the auditory model was to filter the sounds using a gammatone filterbank [20]. This filterbank was comprised of n channels = 100 bandpass filters, the (roughly logarithmic) spacing and bandwidths of which are designed to mimic the first order response of the basilar membrane. The 6dB down point bandwidth is approximately 20% of the centre frequency of the channel. Using 100 channels between 0.1kHz and 10kHz ensured considerable overlap between adjacent filters, as is the case with the cochlea filter. All sound samples used were sampled at 44.1kHz and 16bits.

B. AN-like spike encoding
The outputs from the filterbank channels were coded in a manner inspired by the neural coding within the mammalian auditory nerve. The output from each channel was spikeencoded over n levels = 16 sensitivity levels, leading to a total of n channels × n levels = 1600 individual spike trains. The use of multiple sensitivity levels per channel provided information about the dynamic level changes of the signal across frequency and time.
Spikes were produced at positive-going zero-crossings of the filtered signals. For each detected zero-crossing i, the mean signal amplitude during the previous quarter cycle E i was calculated and compared to the values S j=1:16 described by the n levels = 16 sensitivity levels. If E i > S j then a spike was produced at the j th sensitivity level. The sensitivity levels ran from small values at j = 1 (high sensitivity, low signal level required to produce a spike) to large values at j = 16 (low sensitivity, large signal level required to produce a spike), with a difference δ levels of 3dB between levels. For any spike produced at level j = k, a spike was necessarily produced at all levels j < k. This representation is similar to that employed in [35], where Ghitza noted that it led to an improvement in automatic speech recognition in a noisy environment.

C. Onset detection
The AN-like representation described above does not emphasise onsets in the encoded sound signal, unlike the real mammalian auditory nerve [12]. However, its highly parallelised design makes it suitable for use with a secondary onset detection system. This system was inspired by the onset response behaviour exhibited by certain cells within the cochlea nucleus (octopus, and some bushy and stellate cells) [14].
The AN-like spike trains were passed through depressing synapses to a leaky integrate-and-fire (LIF) neuron layer. There was one LIF neuron per filterbank channel per sensitivity level (i.e. n channels × n levels = 1600 onset neurons), and each neuron was innervated by AN spike-trains from n adj adjacent frequency bands (at the same sensitivity level) on either side of its centre frequency.
The synapse model was based on the 3-reservoir model used in [36] in the context of IHC-to-AN fibre transduction. A similar model has also been used in [37] to model rat neocortex synapses. The model employed three interconnected reservoirs of neurotransmitter. Reservoir M represented the available presynaptic neurotransmitter, reservoir C was the neurotransmitter currently in use, and reservoir R contained neurotransmitter in the process of reuptake (i.e. used, but not yet available for reuse). The reservoir quantities were related by three first order differential equations as follows: Fig. 2. Example plots showing the raw sound signal, AN-coded spikes and onset spikes for a single tone produced by a brass instrument. Onset spikes are clustered at the start of the note. Only the 5 th sensitivity level is shown here. We call the overall pattern of onset spikes, across all channels and sensitivity levels, the onset fingerprint of the sound.
where α and β are rate constants, and γ was positive during an AN-spike, and zero otherwise. The differential equations were calculated for each time sample as the AN spike train signals were fed to the onset layer through the depressing synapses. The loss and manufacture of neurotransmitter was not modelled, and the amount of post-synaptic depolarisation was assumed to be directly proportional to C.
Innervation of each onset neuron in channel b and sensitivity level j from n adj adjacent channels resulted in a total input to the neuron of where w was the weight of each synapse (the same for all inputs here) and C h,j was the neurotransmitter currently in use in the cleft between the AN input from channel h, at sensitivity level j and the onset neuron. A n adj value of 3 was used, so that each onset neuron was innervated by 7 AN channels.
Assuming the signal in a given bandpass channel b was strong enough to produce AN spikes at sensitivity level j, the corresponding onset neuron for channel b, at sensitivity level j, would receive at least F b spikes per second (where F b is the centre frequency of the channel). In the case of multiple co-innervating adjacent channels on each onset neuron (n adj > 0), as used in this study, this rate would normally be greater due to contributions from higher frequency channels. However, depletion of the available neurotransmitter reservoir M , in conjunction with a slow reservoir recovery rate, meant that an evoked post-synaptic potential (EPSP) would only be produced for the first few incoming AN spikes. The recovery rate was purposefully set low to ensure that synapses did not continue to produce EPSPs much beyond the initial sound onset.
The synapse weights w were further set to ensure that a single EPSP was insufficient to cause the onset neuron to fire. This ensured that multiple ESPSs from adjacent synapses were required for the neuron potential to be large enough to fire. The neurons employed were also leaky [33], [38], meaning that the ESPSs needed to be close to concurrent for an action potential, or 'onset spike', to be produced. The overall aim was to ensure that onset spikes were only produced by sudden, cross-frequency rises in signal energy.

REPRESENTATIONS OF THE ONSET
The problem of musical sound classification has been the subject of extensive study. The most common approach to the task has been to calculate a range of descriptors for a sound based on its Fourier components [15], [16], [39], [40], [41]. Such descriptors may be based on analysis of the whole sound, or upon just the steady state and/or the initial transient portion of the sound. Cepstral coefficients are a particularly popular quantity, and have shown good performance with certain classification tasks [42]. A mixture of frequency and time-domain quantities has also been proposed and shown to be up to 90% successful in a 15 class task [43].
Most of the outlined approaches have used standard signal processing techniques to calculate a large number of descriptors (∼ 30 − 50), which form a one-dimensional feature vector D. Many sounds can be quickly processed, and a standard feed-forward learning framework may be employed to classify the sounds based on their D vector [43], [42]. Although such techniques can be remarkably successful, their underpinnings are somewhat removed from the spiking, highly parallelised nature of the mammalian auditory perception and learning systems. The work presented here is an attempt to work within a more biologically realistic framework, both for the formation of sound descriptors, and for the task of sound learning and classification itself.
The auditory model described in section II takes raw sound as an input and provides a simplified representation of the onset response within the cochlea nucleus as an output. The objective of this work was to use this onset response as a descriptor in a musical sound classification task similar to that presented in [43]. The key feature of this approach is that it operates purely in the time domain, and produces onset spikes which are also in the time domain. If the principle advantage of the method was to be exploited, namely the retention of precise timing and frequency information during the sound onset, then the standard feed-forward classification procedures (such as back-propagating or radial basis function neural networks) were unsuitable.
It was thus proposed that the classification descriptor be based entirely on the pattern of onset spikes, which we call the onset fingerprint, and that the descriptor should remain as a time-domain representation of the onset. In order to exploit such a temporal representation, a neural network which operated in the time-domain was required. The echo state network approach, though originally developed for timeseries prediction [22], has also proven to be a popular choice for similar classification tasks [44], [45], and was employed here. Its recurrent, temporal nature was also appropriate for the biologically-inspired framework of the present study.
It would be possible to use the raw onset fingerprint as the time-domain onset descriptor. However, in the present study a simplified form was used which reduced the large 3dimensional onset array (over multiple channels and sensitivity levels) to a smaller 2-dimensional time-series matrix. This was done to reduce the number of input channels required by the echo state classification network (see section IV). For each processed sound, the entire onset fingerprint was first grouped to identify the onset feature which corresponded to the start of the musical note. This was important as certain sounds, such as a flute tone played with heavy vibrato, can produce onset spikes during the steady state due to the rather large amplitude variations introduced by the vibrato. Here the onset descriptor was limited exclusively to the initial onset transient. The grouping procedure examined the time separation between onset spikes within each onset fingerprint. Groups of onset spikes separated by more than a critical time period δt group (here set to 20ms) were treated as separate onset events. Only the first onset event grouping was picked out as the descriptor, which we term the initial onset fingerprint (IOF, see Fig. 2).
The IOF was further processed to reduce the number of temporal sample points. This was achieved by time-slicing the IOF into windows of duration δt step (2ms used here). The spiking onset behaviour (across all sensitivity levels) within each time window of each filter channel b was examined, and only a single spike at the least sensitive sensitivity level (corresponding to the maximum signal level) retained. If no spikes occurred, a zero was recorded. In this manner the onset data within each time window was reduced to a single vector, as shown in Fig. 3. For an IOF lasting 24ms, this resulted in a 12-step initial onset fingerprint time-series (IOFTS) T i=1:12 (where i indexes time-step), with each step comprised of an n channels -sized vector. Without this reduction to a 2D timeseries, the same IOF, in its raw onset state, would require a 1058×n channels ×n levels 3-dimensional matrix (where 1058 is the original number of time samples). More sophisticated methods for performing this step, such as PCA, are currently under investigation. The key outcome was that the raw IOF, while coded in a reduced space to give the IOFTS, remained as a time-domain representation of the sound onset. This was the signal used as input to the time-domain neural network (see section IV-A) to solve the classification task outlined in section IV-B.  Fig. 4. A schematic diagram outlining the structure of the echo state network [22]. A single input layer consisting of n channels nodes connects directly into a large 1500 unit untrained reservoir layer. Only connections from the reservoir layer to the output layer are trained. The green node in the output layer illustrates the manner in which the network is trained to flag a certain class according to the current input time sequence.

A. The echo state network approach
The echo state network (ESN) approach to recurrent neural networks has grown in popularity over the past decade [22]. It represents an implementation of reservoir computing, where a large, fixed and interconnected reservoir layer is perturbed by an input signal(s), as illustrated in Fig. 4. A trained linear combination of the nonlinear responses of the reservoir units is used as the learning framework. This approach is related to the support vector machine techniques, which transform data from an input space into a (much) higher dimensional feature space [46], within which the data is easier to separate. Such networks have proved particularly effective at certain kinds of time-series learning problems [46]. Time domain classification problems have also been addressed using the ESN approach, in particular for speech recognition [44], [45], [47].
In this work an ESN was used as a framework for classifying the time-series' (IOFTS, see section III) obtained from the onset fingerprints of single musical instrument notes. A Matlab implementation of the ESN method developed by Jaeger et al was employed [48]. The input layer consisted of 100 nodes, one for each of the n channels filter channels (see Fig. 4), each fed by the corresponding channel of the IOFTS illustrated in Fig. 3. The reservoir layer consisted of 1500 randomly connected and weighted additive sigmoid neurons, the interconnections of which were untrained. In this initial implementation the neurons were non-leaky. The output layer consisted of n classes nodes, one for each of the musical instrument classes (5 in this case). Only connections between the reservoir layer and the output nodes were trained, using a Delta-rule method.

B. The classification task
The classification task used musical instrument samples drawn from the McGill Master Samples dataset [21]. This dataset is comprised of high quality recordings of orchestral musical instruments, generally playing isolated notes. Rather than trying to classify individual instruments, the present work attempted to use the initial onset fingerprint time-series descriptor to differentiate between instruments based on their excitation technique, as in [43]. This is an easier task, but as outlined in section I it is both physically and taxonomically relevant, and so provides a useful test case for the method.
The instrument categories chosen each involve a different excitation mechanism [49], and so may be expected to produce a different kind of perceptual onset. The five instrument categories (n classes ) used were brass, reed (both single and double reed), plucked string, bowed string and struck string. These classes are summarised in Table II, together with a note of their mean onset durations. A total of 2397 individual sounds were used, with approximately 450 sounds per class. The data was split into a 70%/30% training/testing ratio, with 10 different random permutations run through the echo state network classifier and analysed separately.
To assemble the dataset, the sounds were first processed individually to obtain a set of stand-alone initial onset fingerprint time-series' T s i , where s indexes each sound, and thus runs from 1 : 2397 in the present study. The temporal duration of each T s i , i.e. the number of time-steps i, varied between approximately 10 and 50 (20-100ms) depending on the nature of the onset (see Fig. 3 for an illustration of the IOFTS). The training dataset was assembled by randomly picking 70% of the individual time-series sequences T s i . These sequences were concatenated together one after the other in a random order, separated by short periods of 'silence', to create a single overall input training sequence R T r .
A teaching sequence G T r was created with the same number of time-steps as R T r , and consisting of n classes parallel sequences (one for each instrument class). At each time-step i, the teacher signal G T r i was zero everywhere except in the sequence index of the current IOFTS, which had a value of unity.
Periods of 'silence', composed of 10 time-steps with zeros across all channels and outputs, were inserted between each time-series T s within the training/teachings sequences R T r and G T r . Inserting silence in this manner was found to aid the classification success, most likely because it allowed the network to revert back to a 'rest state' between being stimulated by the successive time-series' T s . Further study into the nature of this effect are ongoing. The testing sequences, composed of the input signal R T e and the target signal G T e , were similarly assembled from the remaining 30% of the onset data.

C. Analysis of the ESN classification task
The echo state network described in section IV-A was trained/tested ten times with different 70%/30% splits of the five-class musical instrument dataset outlined in section IV-B and Table II. During the training phase the network was stimulated with the training sequence R T r . At each timestep the difference between the measured signal in the output layer M T r and the target output G T r was used to refine the weights between the reservoir layer and the output layer.
During the testing phase the trained network was stimulated with the testing sequence R T e and the observed output layer signal M T e recorded. No training was performed. The network's classification at each time-step i was deduced by identifying the largest component of the M T e signal.
The comparison between target and measured output class was performed separately for each T s (IOFTS) within the overall testing sequence R T e . In order to allow the network time to compute the class of the current T s , this evaluation was performed only during the final 50% of each IOFTS. The class computed by the network for each 50% portion of an IOFTS was taken as the most frequently occurring class in the output signal M T e during this time period. This class was then treated as the network's prediction of the instrument class for the current IOFTS, and was directly compared with the true class recorded in the teacher signal G T e for the testing data.

V. RESULTS OF THE CLASSIFICATION TASK
A. Classification success rates Fig. 5 shows a mean confusion matrix, produced from a ten-fold cross-validation of the testing data, for the five class problem described in section IV-B (and Table II). It is important to note that this classification result was based exclusively on the initial onset fingerprint representation produced by the auditory model outlined in section II. No information regarding the steady state timbre was used.
The reed (Rd), bowed string (SB) and struck string (SS) classes were all identified correctly more frequently than not. The bowed string, which produced the longest mean onset duration of all five classes (see Table II), was particularly well identified (more than three-quarters of the time). The reed, which also featured a relatively long mean onset duration, was also identified with reasonable success.
However, the confusion matrix revealed an overall classification success of approximately 45%. This low value was clearly influenced in part by the poor performance of the network in identifying the brass (Bs), plucked string (SP) and struck string (SS) classes. Such a result suggests that in its present condition the onset fingerprint/echo state A mean confusion matrix calculated from testing data (30% of the total data), using ten cross-validations of the 5 class musical instrument type identification task (see Table II network method under-performed, at least with instrument classification problems, relative to results reported elsewhere which use the entire audio signal [43]. However, the work reported here represents an initial pilot study, the results of which may improve with further refinement of both the onset fingerprinting method and the implementation of the echo state network. It may also be the case that there are limits as to the feasibility of identifying musical instrument types based on such a reduced representation of their initial transients. Indeed, while there is much evidence in the literature which reports on the significance of the onset for sound identification tasks [15], [16], this has always been in combination with other features of the sound. The fact that it can be difficult to identify the musical instrument type in absence of the original sound onset does not necessarily imply that the sound onset alone may permit a successful identification. However, the relative success of the technique with three of the five instrument classes does suggest that the technique, with suitable refinement, may prove useful.

B. Analysis of the network learning strategy
It is interesting to note that the two instrument classes with the longest mean onset durations, the brass and the reed groups (see table II), were the most accurately identified. There may be two factors at play here. Firstly, a longer onset means more average training time spent by the network in learning the onset fingerprints from these groups. Secondly, the way in which the network was configured to learn may have favoured longer onsets. During the training phase the network was set to learn continuously, including during the 'silent' periods between the successive T s sequences within the overall R T r signal. A key feature of the ESN lies in its memory of previous states. By allowing the network to continue to train during the silent periods, regardless of the immediately previous target class, it is possible that it's ability to accurately identify the most recent time steps of the previous T s sequence was disrupted. This would be proportionally less significant for onsets of longer duration. This theory is further supported by the fact that progressively reducing the testing classification period (see section IV-C) towards the end of the onset did not increase the success rate. This is in direct contrast to what would be expected if the network was consistently and successfully classifying the current T s . The 50% quotient used here was found to be about as good as could be obtained within the current framework.
Work is currently under way to alter the network's learning pattern during the training phase. In particular, the network will stop learning during all silent portions of the training signal R T r . It will also be prevented from learning during the first part of each T s within R T r , in a further attempt to prevent echoed states from recent T s sequences from interfering with the current input signal.

VI. CONCLUSIONS AND FURTHER WORK
The technique of onset fingerprinting was used to form sound descriptors for a five class musical instrument identification task. The initial results presented here suggest that such a method may provide useful as an initial classifier which, in combination with further parameterisation of the auditory signal, could allow a robust biologically-inspired classification framework to be developed. Very recent initial experiments suggest that using leaky neurons in the reservoir layer, coupled with resetting of the network between sound examples, can appreciably improve performance.
On their own the results here appear relatively poor in terms of the method's overall success rate. Further work will be required to determine if such a representation of the auditory signal, based on the onset fingerprint technique, can prove to be robust in isolation from the steady-state period of a sound. A more advanced implementation of a reservoir network, such as an echo state network with periodically engaged learning, may prove to be more suited to highly variable onset fingerprint patterns than the continuous online learning employed so far. Development of a more traditional Fourier-based description of the sound onset is also ongoing. This will allow a more detailed comparative picture to emerge of the true success of the present biologicallyinspired technique.
Alternative methods for reducing/encoding the full onset fingerprint as a time-series are currently under investigation. In particular, principle component analysis (PCA) may prove to be a more useful technique for capturing the detail of the onset fingerprint than the simple time-windowing method employed here. Sound descriptors which involve aspects of the steady state timbre, in combination with the onset fingerprinting technique, are also under development. The objective throughout remains to develop descriptors which are biologically-inspired representations of the auditory signal. The remarkable success of the ear with all auditory tasks remains a high benchmark at which to aim.