Efficient coding of timerelative structure using spikes

Evan Smith1 Michael S. Lewicki2

evan+@cnbc.cmu.edu lewicki@cnbc.cmu.edu

Departments of Psychology1 & Computer Science2 Center for the Neural Basis of Cognition Carnegie Mellon University

To whom correspondence should be addressed.

Abstract

Nonstationary acoustic features provide essential cues for many auditory tasks including sound localization, auditory stream analysis, and speech recognition. These features can be best characterized relative to a precise point in time such as the onset of a sound or the beginning of a harmonic periodicity. Extracting these types of features is a difficult problem. Part of the difficulty is that with standard blockbased signal analysis methods the representation is sensitive to the arbitrary alignment of the blocks with respect to the signal. Convolutional techniques such as shiftinvariant transformations can reduce this sensitivity, but these do not yield a code that is efficient, i.e. one that forms a nonredundant representation of the underlying structure. Here, we develop a nonblock based method for signal representation that is both timerelative and efficient. Signals are represented using a linear superposition of timeshiftable kernel functions each with an associated magnitude and temporal position. Signal decomposition in this method is a nonlinear process that consists of optimizing the kernel function scaling coefficients and temporal positions to form an efficient, shiftinvariant representation. We demonstrate the properties of this representation for the purpose of characterizing structure in various types of nonstationary acoustic signals. The computational problem investigated here has direct relevance to the neural coding at the auditory nerve and the more general issue of how to encode complex, timevarying signals with a population of spiking neurons.

1 Introduction

Nonstationary and timerelative acoustic structures such as transients, timing relations among acoustic events, and harmonic periodicities provide essential cues for many types of auditory processing. In sound localization, human subjects can reliably detect interaural time differences as small as 10 µs, which corresponds to a binaural sound source shift of about 1 degree (Blauert, 1997). In comparison, the sampling interval for an audio CD sampled at 44.1 kHz is 22.7 microseconds. Auditory grouping cues, such as common onset and offset, harmonic comodulation, and sound source location, all rely on accurate representation of timing and periodicity (Slaney and Lyon, 1993). Timerelative structure is also crucial for the recognition of consonants and many types of transient, nonstationary sounds. Neurophysiological research in the auditory brainstem of mammals has found cells capable of conveying precise phase information up to 4 kHz or of tracking the quickly varying envelope of a highfrequency sound (Oertel, 1999).The importance of these acoustic cues has long been recognized, but extracting them from natural signals still poses many challenges because the problem is fundamentally illposed. In natural acoustic environments, with multiple sound sources and background noises, acoustic events are not directly observable and must be inferred using numerous ambiguous cues.

Another reason for the difficulty in obtaining these cues is that most approaches to signal representation are blockbased, i.e. the signal is processed piecewise in a series of discrete blocks. Transients and nonstationary periodicities in the signal can be temporally smeared across blocks. Large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts.

Shiftinvariance alone, however, is not a sufficient constraint on designing a general sound processing algorithm. Another important constraint is coding efficiency or, equivalently, the ability of the representation to capture underlying structure in the signal. A desirable code should reduce the information rate from the raw signal so that the underlying structures are more directly observable. Signal processing algorithms can be viewed as a method for progressively reducing the information rate until one is left with only the information of interest. We can make a distinction between the observable information rate, i.e. the rate of the observable variables and the intrinsic information rate, or the rate of the underlying structure of interest. In speech, the observable information rate of the waveform samples is about 50,000 bits per second, but the intrinsic rate of the underlying words is only around 200 bits per second (Rabiner and Levinson, 1981). Information reduction can be achieved either by selecting only the desired information (and discarding everything else) or by removing redundancy, e.g., the temporal correlations between samples. This reduces the observable information rate while preserving the intrinsic information.

In this paper, we investigate algorithms for fitting an efficient, shiftinvariant representation to natural sound signals. The outline of the paper is as follows. The next section describes the motivations behind this approach and illustrates some of the shortcomings of current methods. After defining the model for signal representation, we present different algorithms for signal decomposition and contrast their complexity. Next we illustrate the properties of the representation on various types of speech sounds. We then present a measure of coding efficiency and compare these algorithms to traditional methods for signal representation. Finally, we discuss the relevance of the computational issues discussed here to spike coding and signal representation at the auditory nerve.

2 Representing Nonstationary Acoustic Structure

Encoding the acoustic signal is the first step in any algorithm for performing an auditory task. There are numerous approaches to this problem which differ in both their computational complexity and in what aspects of signal structure are extracted. Ultimately, the choice about what the representation encodes depends on the tasks that need to be performed. In the ideal case, the encoding process extracts only that information which is necessary to perform the task and suppresses noise or unrelated information. A “generalistapproach, like that taken by most mammalian auditory systems, requires a representation which is efficient for a wide range of signals. As natural sounds contain both relatively stationary harmonic structure (e.g. animal vocalizations) as well as nonstationary transient structure (e.g. crunching leaves and twigs), this generalist approach requires a code capable of efficiently representing these disparate sound classes (Lewicki, 2002a).Here we seek an auditory representation that is useful for a variety of different tasks.

2.1 Blockbased Representations

Most approaches to signal representation are blockbased in which signal processing takes place on a series of overlapping, discrete blocks. This not only obscures transients and periodicities in the signal, but can also have the effect that, for nonstationary signals, small time shifts can produce large changes in the representation, depending on whether and where a particular acoustic event falls within the block. Figure 1 illustrates the sensitivity of blockbased representation with small shifts in speech signals. The upper panel shows a short speech waveform sectioned into blocks using two sequences of Hamming windows (solid and dashed curves). Each window spans approximately 30 msecs (512 samples) and successive blocks (A1, A2, etc) are shifted by 10 msecs. The Figure 1: Blockbased representations are sensitive to temporal shifts. The top panel shows