Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Journal of Neuroscience
Articles, Behavioral/Systems/Cognitive

Sparse Representations for the Cocktail Party Problem

Hiroki Asari,Barak A. Pearlmutter andAnthony M. Zador
Journal of Neuroscience12 July 2006,26(28)7477-7490;https://doi.org/10.1523/JNEUROSCI.1563-06.2006
Hiroki Asari
1Watson School of Biological Sciences,2Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, and3Hamilton Institute, National University of Ireland Maynooth, County Kildare, Ireland
Barak A. Pearlmutter
1Watson School of Biological Sciences,2Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, and3Hamilton Institute, National University of Ireland Maynooth, County Kildare, Ireland
Anthony M. Zador
1Watson School of Biological Sciences,2Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, and3Hamilton Institute, National University of Ireland Maynooth, County Kildare, Ireland
Loading

Abstract

A striking feature of many sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have ∼30,000 cochlear neurons, but at least 1000 times as many neurons in the auditory cortex. Such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise. We instead posit that they directly subserve computations of interest. Here we provide an example of how sparse overcomplete linear representations can directly solve difficult acoustic signal processing problems, using as an example monaural source separation using solely the cues provided by the differential filtering imposed on a source by its path from its origin to the cochlea [the head-related transfer function (HRTF)]. In contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space. The experimentally testable predictions that arise from this model, including a novel method for estimating the optimal stimulus of a neuron using data from a multineuron recording experiment, are generic and apply to a wide range of sensory computations.

Introduction

Animals in nature confront an acoustic environment consisting of sounds from a rich, indeed often bewildering, combination of sources. Survival depends on responding appropriately to potential threats, food sources, and mates (e.g., at a cocktail party), while at the same time ignoring the many irrelevant sound sources that may constitute the majority of the acoustic energy received. Source separation, or “stream segregation,” is therefore one of the central problems in acoustic processing that organisms must solve. Animals must confront many of the same challenges in solving this problem as do artificial systems, and the insights gained from the one can be applied to the other. However, little is currently known about how animals solve this problem (but seeFishman et al., 2004;Micheyl et al., 2005), and no artificial system can solve it in a general setting.

Animals exploit a variety of binaural and monaural cues to separate acoustic sources (Bregman, 1990). For example, two tones occurring simultaneously are more likely to be grouped together perceptually (i.e., perceived as arising from the same source) than the same notes occurring sequentially. Such grouping makes sense under the assumption that the auditory system is trying to discover the statistically independent causes of the acoustic signals received at the ears (Bell and Sejnowski, 1995,1997;Lewicki and Sejnowski, 2000;Simoncelli and Olshausen, 2001); simultaneous onset of two tones is unlikely to arise purely by chance; thus, it is more parsimonious to assume that the tones were caused by a single source (e.g., as harmonics of a single fundamental frequency.) Many of the spectral, temporal, and spatial cues used for stream segregation can be interpreted in this context.

A striking feature of this and many other sensory processing problems is that there appear to be many more neurons engaged in the internal representations of the signal than in its transduction. For example, humans have only ∼30,000 cochlear neurons, but at least 1000 times as many neurons in the auditory cortex. Although such apparently redundant internal representations have sometimes been proposed as necessary to overcome neuronal noise, here we posit that they contribute to computation.

To extract the behaviorally relevant information embedded in natural acoustic environments, animals must be able to separate auditory streams originating from distinct acoustic sources (“cocktail party problem”). The auditory cortex has orders of magnitude more neurons than the cochlea, thus many different patterns of cortical activity may faithfully represent any given pattern of cochlear activity. We propose that the cortex exploits this excess “representational bandwidth” (DeWeese et al., 2005), or the excess degrees of freedom, by selecting the sparsest representation within an overcomplete set of features. This model suggests how this excess representational bandwidth can be used for computation, instead of merely to overcome neuronal noise as is usually assumed. We illustrate this model by showing how sparseness can be used to separate sources perceived monaurally. The model makes testable predictions about the dynamic nature of representations in the auditory cortex. Our results support the idea that sparse representations may underlie efficient computations in the auditory cortex.

Our approach is to adopt a practical computational framework for the cocktail party problem and then explore the testable implications that follow. Here we describe a model of how the auditory system can exploit one particular sort of monaural segregation cue, namely the spectral cues introduced by the differential filtering imposed by the head-related transfer function (HRTF). Note that in contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space; our model assumes that the locations of the sources have already been determined by other mechanisms.

The model posits that the neural representation of an acoustic stimulus is overcomplete in the sense that there are many more neurons available than are needed to represent the stimulus with high fidelity (Olshausen and Field, 1997;Lee et al., 1999;Lewicki and Sejnowski, 2000;Zibulevsky and Pearlmutter, 2001). Because the representation is overcomplete, there are many patterns of neural activity that all faithfully encode any given stimulus. We show that constraining neural activity to be sparse selects one of these representations and that the resulting pattern of neural activity solves the source separation problem, even when multiple sources are audible to only a single ear. The framework is quite general and can serve as a starting point for understanding how cortical circuits might exploit other sensory cues as well.

Materials and Methods

All programming was done in Matlab (The MathWorks, Natick, MA).

HRTF.

The HRTF is the filter imposed by the head and the detailed shape of the ear on sounds received at the cochlea. The HRTF depends on the spatial position (both the relative azimuth and elevation) of the source (Yost et al., 1996). At some frequencies, the HRTF can attenuate sound from one location by as much as 40 dB more than from another.

Although every individual has his or her own HRTF, the basic characteristics of HRTFs are similar across individuals. We used a representative left human pinna HRTF downloaded fromhttp://www.itakura.nuee.nagoya-u.ac.jp/HRTF/ (Nishino et al., 2001).

Spectral basis for sources via non-negative matrix factorization.

We tested our algorithm on mixtures of musical sources. We used non-negative matrix factorization (NMF) to obtain basis elements for each source.

NMF is an algorithm for factorizing a data matrix under an elementwise non-negativity constraint (Lee and Seung, 1999). The original data matrix is given as ann ×m matrixV, each column of which here contains then data values for one ofm spectrogram segments. The data matrixV is approximated by NMF asVQH, where the dimension of the factorsQ andH aren ×r andr ×m, respectively. The rank of factorization,r, is chosen sonr +rm <nm, so as to compress the originalnm elements in the matrixV into a smaller number of elements,nr inQ plusrm inH. Each column ofQ contains one of the basis spectrograms, and the matrixH represents the coefficients for reconstructing the columns of the original data matrix V in this basis. TheQ matrices obtained by NMF for each individual source were concatenated to form an overcomplete source-space basis matrix = [Q1 |Q2 | … ]. Each column of was then filtered through each HRTF and the results concatenated to form the feature matrix:D = [h1(t) ∗ | … |hN(t) ∗].

In our experiments, the spectrograms in the data matrixV were obtained from music sounds, natural sounds, or speech sounds: commercial audio CDs (instrumental solos, classical and jazz, one each on cello, clarinet, trumpet, harp, and harpsichord, for a total of five), the audio CDsThe Diversity of Animal Sounds andSounds of Neotropical Rainforest Mammals (Cornell Laboratory of Ornithology, Ithaca, NY), and spoken poetry (Dylan Thomas, T. S. Eliot, Frank O’Hara, and William Butler Yeats on the commercial audio CDPoetry speaks: hear great poets read their work from Tennyson to Plath, Sourcebooks Inc., 2001), respectively. Samples of 100–150 s were taken, stereo channels averaged, and the signal down-sampled from the original 44.1 kHz to 8 kHz. Log-scaled spectrograms were generated using a custom Matlab routine (available upon request) with a bin size of 5 ms and 75 frequency bands ranging from 55 to 3951 Hz in steps of 1/12 octave. Each column ofV held a strip of spectrogram, yielding a dimensionality ofn = 75, andm = 5000 samples were used for the training. Note that the training samples were distinct from those used for the testing. Specifically, we used 10,000 samples to assess the representational sparseness achieved by the NMF basis (seeFig. 2) and 20,000 random combinations of three sources (using the 10,000 samples inFig. 2) to assess separation performance (seeFig. 4).

Each NMF run consisted of 500 iterations with 10 restarts from random initial conditions, with the restart that yielded the minimum total error chosen. The factorization rank wasr = 15. Concatenating the fiveQ matrices for the five instruments yielded a dictionary of 75 basis elements, each of which was filtered by each of three different HRTFs, resulting in a feature matrixD with 225 columns. The source locations were randomly chosen but 90° apart from each other in the simulations (inFig. 3, the three sources were located on your left, center and right, corresponding to the HRTFs for azimuth −90°, 0°, and 90°, respectively, with zero elevation). The analyses on the natural sound and speech sound were performed in a similar manner, with 5000 training samples for each data matrixV.

Minimization.

Pseudoinverses (L2-norm minimization) were computed with the pinv routine in Matlab, which uses an algorithm based on singular value decomposition. The linear programming problem (Eq. 14) was solved using the Matlab Optimization Toolbox linprog routine. We did not impose a non-negativity constraint on the coefficients. As a result, the dense solution consists of negative coefficients as well as positive ones, whereas all of the substantially nonzero elements are positive for the sparse solution.Figure 6 shows the absolute values of the dense solution coefficients.

The linear programming problem given inEquation 14 can be sensitive to noise. We therefore solved an augmented version of this that included a noise model. In particular, we assumed that the total amount of noise was bounded. Thus,Equation 14 was reformulated as the following:Embedded Image where β is proportional to the noise level and withp = 1, 2, or ∞. Letting β → 0 is equivalent to assuming that the noise is very small, and the solution converges to the zero-noise solution,Equation 14.

The Gaussian noise case,p = 2, can be solved by semidefinite programming methods (Fletcher, 1985). Bothp = 1 andp = ∞ can be solved using linear programming. All approaches yield qualitatively similar results.

The solutions presented here all usedp = 1. For this case, noise vectorse+ ande are introduced and included in the optimization, allowingEquation 1 to be rewritten in standard form:

Embedded Image

We typically examined four different noise levels (log10y1/β = 1, 2, 3, 4) and selected the one with the best separation performance on average as the result (seeFigs. 4,5).

Signal-to-noise ratios (SNRs) were calculated as the reciprocal of the average across sources of 〈(xi(t) −i(t))2〉/〈xi(t)2〉, wherexi(t) is the original spectrogram of theith source,i(t) is its estimate when recovered from the mixture, and the average 〈·〉 is over time.

To measure the degree of sparse representations by NMF basis elements (seeFig. 2) and its relationship to the separation performance (seeFig. 4), we introduced a “sparseness index” defined as the number of nonzero elements in the presence of a single source divided by the dimension size. This index is unity for a dense representation and approaches zero as the representation becomes sparser. The noise level was log10y1/β = 1 inFigure 2B, resulting in the reconstruction SNR of 18.3 ± 3.8, 16.0 ± 3.0, and 18.0 ± 3.6 (median ± interquartile range in decibels) for music, natural sound, and speech ensembles, respectively.

Estimation of linear encoders and decoders.

Given a set of stimuliyk (fork = 1, 2, …) and the corresponding responsesck generated usingEquation 14, the optimal linear decoding filter was estimated by solving the following regression problem:Embedded Image where ‖·‖2 denotes theL2 (Euclidean) norm. Similarly, the optimal linear encoder Ê was obtained by solving the following equation:Embedded Image

Note that we used a fraction of the elements inck for the linear filter estimation and showed the average results inFigures 7 and8 over 200 random samplings of neurons. Also note that theith column of and theith row ofÊ correspond to the optimal linear decoder and encoder for theith neuron, respectively.

InFigure 7, we used a 1168 × 3600 feature matrixD, each column of which held a feature spanning over 16 time bins (96 ms), with a bin size of 6 ms and 73 frequency bands ranging between 55 and 3520 Hz in steps of 1/12 octave. As the original feature of a target neuron, we chose the one obtained from cello ensembles, and thus we used cello sounds as input stimuli in the simulation.

Asymmetry of sparse representations.

To illustrate the asymmetry of linear encoding and decoding in the framework of our model, we ran simulations in 25 dimensions with 75 neurons. In the simulations, the threefold overcomplete features (a 25 × 75 feature matrixD) were first generated randomly on the unit hypersphere. Neural activities for sample stimuli drawn from a Gaussian distribution were then determined byEquation 14.

For simulated single unit data (seeFig. 8A), we computed the mutual informationI(c,s) between the simulated neural responsesc and stimuluss usingI(c,s) =H(c) −H(c|s) =H(c), whereH(c) is the response entropy andH(c|s), the conditional of the response given the stimulus, is zero because the relationship between stimuli and responses was deterministic. Thus the mutual information between the single neuron and the stimulus was just equal to the response entropy, which we estimated by direct binning from the histogram of neural responses. We compared this information to either the mutual information between the optimal linear estimate of the response given the stimulus and the actual stimulus (encoding), or between the optimal linear estimate of the stimulus given the response and the actual response (decoding). For these information estimations, we used the Gaussian approximation to bound the entropy of the reconstruction error (Bialek et al., 1991). We then normalized these linear information estimates to full mutual information to obtain the “reconstruction quality”:

Embedded Image

For multiunit data (seeFig. 8B), the computation of the full mutual information (rather than the linear approximation) was computationally intractable. We therefore computed the following simpler measure of the reconstruction quality of the models:Embedded Image where ‖·‖2 denotes theL2 norm, and 〈·〉 denotes the mean over data. Note that the measure is based on the relative length of the model errors and that it gives zero for pure noise and one for perfect reconstruction.

Context dependence of spatial temporal receptive fields.

InFigure 10, for demonstration purposes, we used a 1168 × 3600 feature matrixD (the same one as inFig. 7) and used two different sets of 300 active features (i.e., 1168 × 300 packed matricesDk; seeAppendix A.3) to estimate the spectrotemporal receptive fields (STRFs) for the two different contexts. Note that some of the features were active in both contexts (including the one shown inFig. 10A), whereas others were active only in either context.

Results

Our main goals are to explore a model of computation with sparse representations and to generate new experimentally testable predictions from this model. To make our model concrete, we consider a specific computation: a special case of the monaural cocktail party problem in which the HRTF provides the critical cue for disentangling sources. We focus on this special case not because it is of central importance from a psychophysical perspective (in a general setting, the HRTF is typically just one of many cues, and often not the most important) but rather because this problem provides a convenient way to illustrate the key predictions. The same sparse framework can be generalized to exploit other cues for source separation and to other sensory processing problems (e.g., vision) as well.

The presentation is organized as follows. First, we define the particular source separation problem we consider, in which there are several sources and a single ear. Next, we show how a sparse overcomplete representation, such as that seen at the cortical level in the auditory system, can be used to separate the sources. Finally, we identify experimentally testable predictions of the model.

Problem formulation

We have all experienced the basic cocktail party problem as a part of everyday life: we stand in a room full of people chatting, chairs scraping, fans humming, and so forth, and strain to understand the words of a single interlocutor. This familiar but challenging scenario is interesting precisely because it tests the limits of what we humans can achieve. The cocktail party is, however, just an extreme example of a more general problem that the auditory system constantly confronts. It is rare that we can listen to an acoustic source without interference from other sources, yet our auditory system filters the interfering sources out of our conscious perception so effectively that we are often almost unaware of them. The apparent effortlessness with which we solve the cocktail party problem is deceptive and is a testament to the effectiveness of our auditory system. Indeed, the problem of background noise represents one of the main factors limiting the widespread practical adoption of artificial speech recognition systems.

The auditory system uses a wide variety of psychophysical cues to segregate auditory streams (Bregman, 1990), including both binaural and monaural cues. Many monaural cues have been identified, such as common onset time or comodulation of stimulus power in different parts of the spectrum.

For simplicity, we focus here on just one set of cues: those provided by the differential filtering imposed on a source by its path from its origin in space to the cochlea. This filtering is caused both by the head and the detailed shape of the ear (the HRTF) and by the environment on sources at different positions in space (Yost et al., 1996). The HRTF is important for generating a three-dimensional experience of sound, so that acoustic sources that bypass the HRTF (e.g., those presented with headphones) are typically perceived unnaturally, as though arising inside the head (Wightman and Kistler, 1989;Kulkarni and Colburn, 1998). Whereas the importance of the HRTF in sound localization has been studied extensively (Knudsen and Konishi, 1979;Wightman and Kistler, 1989;Wenzel et al., 1993;Hofman and Opstal, 2002), its role in source separation as such has not. In contrast to much previous work, the HRTF is used here to separate auditory streams rather than to localize them in space.

It is often reasonable to assume that sound arriving from different locations should be treated as arising from distinct sources. For the purposes of the present study, all sounds from a given position are defined to belong to the same source, and any sounds from a different position are defined to belong to different sources. We emphasize that although sound localization (the process by which an animal determines where in space a source is located) is related to source separation (the process by which an animal extracts different auditory streams from a single waveform), the two computations are distinct; neither is necessary nor sufficient for the other. Here we focus on the separation problem and assume that source localization occurs by other mechanisms.

The particular source separation problem we consider is as follows. Suppose there areN acoustic sources located at known distinct positions in space, withxi(t) being the time course of the stimulus sound pressure of theith source at its point of origin. Associated with each position is a known filter given byhi(t). In what follows we will refer tohi(t) as the HRTF, but in general,hi(t) will include not just the filtering of the head and external ear but also the filter function of the acoustic environment (reverberation, etc.).

The signaly(t) at the ear is the sum of the filtered signals:Embedded Image where ∗ indicates convolution andi(t) =hi(t) ∗xi(t) is theith source in isolation after filtering. (We can say thatxi(t) is theith source measured in source space, whereasi(t) is the same source measured in sensor space.) The organism’s goal in source separation is to recover the underlying sourcesxi(t) from the signaly(t), using knowledge of the directional filtershi(t). For example, ifxalice(t) andxbob(t) are speech streams generated by two speakers (sources), Alice and Bob, at a cocktail party, the goal is to disentangle these two streams using the only signal available, the sumy(t) =halice(t) ∗xalice(t) +hbob(t) ∗xbob(t). Note that the actual spatial locations of the sources are not computed during the separation; we do not address the localization problem in this paper.

The particular monaural version of this problem that we consider here is a special, more difficult, case of the binaural (or, in artificial systems, the multiple microphone) problem.

Neural representation for source separation

How might a neural system solve the source separation problem described above? We begin by assuming that each short segment (e.g., 5 ms) of each acoustic sourcei(t) (as it sounds at the cochlea) is represented in the activitiescij of a population of neurons indexed by componentsj and source positionsi:Embedded Image wheredij(t) are stimulus “features,” i.e., elements of a (not necessarily orthogonal, and possibly overcomplete) linear basis. We will interpret the neural activitiescij as the spike rate of the corresponding neurons during each segment. The signaly(t) is then given by the following:Embedded Image We have introducedEquation 9 as an analytic model: given a stimulusy(t), find a set of featuresdij(t) and neural activitiescij that represent that stimulus; if the features span the stimulus space, such a representation will always exist. Below we will focus on the case in which the feature set permits a sparse representation, i.e., where only a few of the neural activitiescij are significantly nonzero. (Although “sparse” might colloquially refer to the case in which most of the activitiescij are exactly zero, here we use a generalized notion of sparseness, common in the literature, which requires only that most activities be close to zero.)

Equation 9 assumes a linear relationship between an auditory stimulusy(t) and its neural representation in terms of featuresdij(t). The assumption of linearity is common in both visual and auditory physiology. For example, it is often assumed that a population of neurons in cortical area V1 represents a visual scene in terms of a collection of oriented edges; in this case, the scene and the features inEquation 9 would be rewritten as functions of spatial rather than temporal coordinates, but the formulation would be otherwise identical. Similarly, in auditory physiology, stimuli are sometimes represented as a weighted sum of basis elements such as moving ripples (Kowalski et al., 1996;Klein et al., 2000); in the context ofEquation 9, this implies assuming a one-to-one correspondence between a basis element (derived from the ripple basis)dij(t) and the firing ratecij of a corresponding neuron.

To relate the neural representation of the signaly(t) inEquation 9 to the sourcesxi(t), we further assume that each source can be expressed as a linear combination of (not necessarily orthogonal) basis elementsqj(t):Embedded Image where the basis elementsqj(t) are related to the featuresdij by convolution with each filterhi(t):

Embedded Image

Combining these expressions, the signaly(t) received at the ear is related to the sum of the filtered sources by the following:

Embedded Image

There are thus more featuresdij(t) in the neural representation than there are basis elementsqj(t). In particular, if there are known to beN sources, then there areN-fold more featuresdij(t) than basis elementsqj(t). As before,Equations 911 represent an analytic model: given a set of featuresdij(t) [or equivalently a set of basis elementsqj(t) and position-dependent filtershi(t)] and an inputy(t), find an appropriate set of neural activitiescij.

The basis elementsqj(t) reflect statistical correlations within sources; each source typically consists of several such elements. These basis elements can be thought of as an internal model of the components of acoustic sources, in the same way that edges might be thought of as components of visual sources (objects). Because the neural representation involves prefiltering with the HRTF (Eq. 11), the coefficientcij associated with featuredij(t) is then better thought of as representing the hypothesis that an elementqj(t) is present at positioni. In the same way, neurons in the primary visual cortex can be thought of as representing the hypothesis (dij) that an oriented edge (qi) is present at a particular position (i) in the visual field. In other words, the elementsqj(t) reflect only the properties of the stimulus, whereas the featuresdij(t) arise from the interaction of these elements with the sense organs.

A population of neural activities satisfyingEquations 911 has effectively solved the source separation problem, because a given sourcei can be reconstructed merely by summing over all neurons associated with positioni. This formulation therefore recasts the source separation into a new problem: finding the appropriate neural activitiescij. Such a representation, if it could be found, is especially appealing because it permits the sources to be reconstructed (byEq. 10) directly in terms of the stimulus elementsqj(t) as they sound at the source (i.e., before filtering); the reconstruction is therefore invariant to changes in stimulus position. In the next section, we will show that sparseness provides the key to specifying the appropriate representation.

For notational and computational convenience we discretize time and rewriteEquation 9 in matrix form (using boldface to indicate vectors and matrices):Embedded Image wherey is a column vector whoseNrow elements correspond to the discrete-time sampled elementsy(tk),c is a column vector of lengthNcol representing the complete neural activity patterncij, andD is anNrow ×Ncol matrix whose columnsdij hold the features with elementsdij(tk).

Sparse neural representation of sources

Source separation thus requires finding the neural activitiesc such that the neural representation represents the sourcesxi as closely as possible. We assume that the neural representation is overcomplete (Riesenhuber and Poggio, 2000;Olshausen and Field, 1997), i.e., that the number of neurons (features) is large (Ncol >Nrow). In this case, many different neural activity patternsc could represent the stimulusy equally well (Fig. 1A). However, the goal is not merely to represent the stimulusy, but to find a representation in which the underlying sourcesxi are apparent and from which they can be readily recovered.

Figure 1.

Overcomplete representation in two dimensions.A, Three nonorthogonal feature vectors dij inN = 2 dimensions constitute an overcomplete representation, offering many possible ways to represent a data pointy with no error.B, The conventional solution is given by the pseudoinverse, which yields a dense representation because it minimizes the squared sum of the neural activity, Σij ci2j. This representation invokes all features about evenly.C, The sparse solution invokes at mostN = 2 features because it minimizes Σij|cij|.D, Comparison of neural activity for the two cases. For the dense representation, all three neurons participate about equally, whereas for the sparse representation, activity is concentrated in neuron 2.

Because the neuronal population does not have access to the sources themselves, but only to their sumy, not enough information is available to recover the sources uniquely. The source separation problem is thus ill posed. (In the same way, knowing that the sum of two scalarsa andb is 12 is not sufficient to recovera andb, and any choice fora andb that satisfiesa +b = 12 is a possible solution.) The problem can be made well posed by adding additional constraints (regularizers) on the responses, as is often done in computational vision (Poggio et al., 1985). Here we consider a sparseness regularizer on the neural representation (Olshausen and Field, 1996,1997;Bell and Sejnowski, 1997;Chen et al., 1998;Lee et al., 1999;Lewicki and Sejnowski, 2000;Vinje and Gallant, 2000;Simoncelli and Olshausen, 2001;Zibulevsky and Pearlmutter, 2001;Hahnloser et al., 2002;Olshausen and O’Connor, 2002). In neural terms, this sparseness assumption corresponds to representing the acoustic stimulusy in terms of the minimum number of spikes (Fig. 1C), a biologically appealing constraint which leads to an energy-efficient representation (Levy and Baxter, 1996;Laughlin and Sejnowski, 2003). Thus we assume that the neural representationc satisfies (seeAppendix A.1):

Embedded Image

Equation 14 specifies a linear programming problem with a single global optimum. Formally, the solution minimizes theL1 norm ‖c1 = Σij|cij| of the solution vector. In practice, the problem we consider allows for reconstruction noise (seeEq. 1).

Dense neural representation of sources

An alternative regularizer is that implicit in the pseudoinverse (Strang, 1988), corresponding to the usual least-squares solution (seeAppendix A.1),

Embedded Image

The pseudoinverse finds the solutionc that minimizes theL2 norm, i.e., the squared neural activity Σijcij2 (Fig. 1B). However, it is not obvious why it would be useful for the brain to minimize this quantity, which has units of spikes squared, rather than some other quantity (such as spikes; see below). Moreover, we show below that it fails in practice to separate the sources successfully.

Separation of harmonic sources

Successful source separation based onEquation 14 requires that two conditions be satisfied. First, the sources must be sparsely representable, as is the case with natural auditory stimuli (Attias and Schreiner, 1997;Lewicki, 2002;Klein et al., 2003;Smith and Lewicki, 2006). Second, the sources must have spectral correlations matched to the HRTF. We found that the model was able to separate acoustic sources consisting of mixtures of music, natural sounds, and speech.

Finding a feature set

We used NMF to generate a set of basis features from spectrograms obtained from samples of solo instrumental music, natural sounds, and speech (Fig. 2). NMF is an algorithm for factorizing a data matrix (a matrix whose columns contain the snippets of solos) under non-negativity constraints (Lee and Seung, 1999). In contrast to some other decomposition approaches, such as principal component analysis, NMF often yields representations in which the elements are fairly local, and which can be interpreted as “parts.”

Figure 2.

NMF can be used to find the parts of sound ensembles.A, NMF basis elements for three sound classes (music, natural sounds, and speech) were aligned in columns by the peak frequency. Note that power is concentrated in the fundamental frequency, but higher harmonics are clearly visible. Also note that each column, which reflects statistical correlations present in the sources, is an example ofqj(t) defined in Equation 10; it is the filtered versionsdij(t) that form the neural representation inEquation 11.B, The ability of the NMF bases inA to represent sounds in a sparse model is quantified in terms of the “sparseness index,” defined as the number of nonzero elements in the presence of a single source divided by the dimension size. This index is unity for a dense representation and approaches zero as the representation becomes sparser. The distribution of the “sparseness index” was 0.61 ± 0.27, 0.64 ± 0.17, and 0.49 ± 0.13 (median ± interquartile range) for music, natural sounds, and speech, respectively, over 10,000 test samples; see Materials and Methods for details.

When applied to music, NMF typically yielded elements suggestive of musical notes, each with a strong fundamental frequency and weaker harmonics at higher frequencies. In many cases, listeners could easily use timbre to identify the instrument from which a particular element was derived. When applied to sounds from other ensembles (natural sounds and speech), NMF yielded elements that had rich harmonic structure, but it was not in general easy to “interpret” the elements (e.g., as vowels). Nonetheless, these elements captured aspects of the statistical structure of the underlying ensemble of sounds and led to sparse representations of the ensembles (Fig. 2B).

The choice of NMF in this context was merely a matter of convenience; we could have used any basis that captured the spectral correlations in the sources and permitted a sparse representation. Finding good overcomplete dictionaries from samples of a stimulus ensemble is a subject of ongoing research (Kreutz-Delgado et al., 2003). We do not imagine that NMF is the “algorithm” by which features are established in real neural circuits; such features must surely arise through a complex interaction of genetic and environmental cues. We need not, therefore, expect to find a precise correspondence between the features obtained by NMF and those observed in the auditory cortex. In this respect, our results complement previous work on finding the features underlying auditory or visual scenes (Bell and Sejnowski, 1997;Olshausen and Field, 1997,1996;Schwartz and Simoncelli, 2001;Lewicki, 2002); the emphasis here is not on the elements themselves but rather on how they work together to form a representation that separates sources.

Separation

To test the ability of the model to separate sources, we generated digital mixtures of three sources positioned at three distinct positions in space (Fig. 3). In the left column are the spectrograms of the sources at their origin. Two of the sources (a harp playing the note “D”; center and bottom) were chosen to be identical; this example is thus particularly challenging, because the only cue for separating the sources is the filtering imposed by the HRTF.

Figure 3.

Separation of three musical sources. Three musical instruments at three distinct spatial locations were filtered (by h1, …, h3, respectively) and summed to produce the input y, and then separated using a sparse overcomplete representation to produce the output. Note that two of the sources (a harp playing the note “D,” center and bottom) were chosen to be identical; this example is thus particularly challenging, because the only cue for separating the sources is the filtering imposed by the HRTF. Nevertheless, separation was good (compare left and right columns.)

Separation was nevertheless quite successful (Fig. 3, compare left and right columns). These results were typical: whenever the underlying assumptions about the sparseness of the stimulus were satisfied, sources consisting of mixtures of music, natural sounds, or speech were all separated well (Fig. 4). Separation worked particularly well for mixtures of sparsely representable sources (i.e., smaller sparseness index values), whereas it did not work for sources that were not sparsely represented (i.e., larger sparseness index values).Figure 5 shows that separation without differential prefiltering by the HRTF was unsuccessful, as was separation using the Gaussian prior instead of the sparseness prior (dense representation).

Figure 4.

Performance of different separation approaches with three sources. The separation performance (SNR across sources) is shown as a function of the sum of the “sparseness index” of the three sources (average over 20,000 sample sets). Note that sparse prior (black) always outperforms dense prior (gray) and that excellent separation was achieved especially when the sources were sparsely representable. Also note that the model does not depend strongly on choosing the basis carefully, as demonstrated by the good performance of the “combined” example in which a concatenated basis was taken from all the ensembles.

Figure 5.

Separation performance for different source locations. Using a typical example of three novel stimuli (trumpet and two same harp), separation performance (y-axis) was examined with all of the possible combinations of the three sources (from 0 to 120 degrees apart;x-axis). The average performance is shown here under either sparse (black) or dense (gray) prior. Note that separation was unsuccessful at angle zero because we cannot exploit differential filtering, whereas the performance gets better as the sources get further apart.

The neural representations underlying separation provide insight into these results.Figure 6A shows the representations of each of the three sources (the same as inFig. 3) presented in isolation. In each panel, the activity in a population of 225 neurons (corresponding to the 225 featuresdij =hiqj) is indicated by the intensity of points on a 15 × 15 grid. Because the sources occupy three positionsi, there are three copies of the basisqj in each panel (corresponding to the three filtershi.) The activity patterns are sparse; only a relatively small number of units are active in each representation. Note that because the middle and the right sources (source 2 and 3, respectively) in this example were chosen to be identical, the middle and right neural representations differ only by a shift.

Figure 6.

Neural representations underlying source separation. Each panel shows the activity of a population of 225 neurons, corresponding to the 225 features dij = hi * qj. The intensity of each dot in the 15 × 15 grid is proportional to the log of the firing rate of each neuron. Because the sources occupy three positionsi, there are three copies of the basis qj in each panel (corresponding to the three filters hi.). The copies are arranged from left to right for convenience and separated by vertical lines. However, the arrangement is for purposes of illustration only; we do not mean to imply any spatial organization of sources within the cortex. The sources are the same as in the previous figure.A, Sparse representations of the three sources (corresponding to the original spectrograms inFig. 3) presented in isolation. Only a relatively small number of units are active in each panel.B, Sparse representation of the mixed sources (input spectrogram inFig. 3). Note that activity is approximately the sum of the activities of the isolated sources inA.C, Dense representation of the mixed sources. Note that most units are active.

The procedure for recovering a source from such a representation is straightforward: the estimate of the left source (source 1) is simply the summed activity of the left third of the neurons (those representing features prefiltered by the HTRF corresponding to the leftmost position in space); and likewise for the middle and right thirds. The HRTF can thus be seen as a kind of “tag” for grouping together elements from a single source. This suggests dividing source separation into two conceptually distinct steps (although in practice the steps occur simultaneously.) In the first step, the stimuli are decomposed into the appropriate features. In the second step, the features are tagged and bundled together with other features from the same source. It is for this bundling step that the HRTF along with the prior knowledge of source locations is essential.

The failure of the dense representation to separate sources (Fig. 4) results from a failure of the first step. Instead of decomposing the sources into a small number of features, the dense representation (Fig. 6C) assumes that each instrument contributed about equally to the received signal and thus finds a representation in which a large fraction of neurons are active. That is, instead of “explaining” the sources in terms of two harps and a trumpet, the dense representations also finds some clarinet, some cello, etc., at all positions. This is intrinsic to the dense solution, because it finds the “minimum power” solution in which neural activity is spread among the population (Fig. 1B).

The failure of even the sparse approach when the spectral cues induced by the HRTF are absent (Fig. 5, leftmost point showing 0° separation) results from a failure at the second step. That is, the sparse approach finds a useful decomposition at the first step even without the HRTF, but in the absence of HRTF cues the active features are not tagged, and thus the features cannot be assigned appropriately to distinct sources. Other psychophysical cues relevant for source separation, such as common onset time, might provide alternative or additional tags in this same framework. A more general formulation of source separation might allow tagging on longer time scales, so that a feature active at one moment might be more (or less) likely to be active the next, reflecting the fact that sources tend to persist, but we do not pursue that approach further here.

Experimental predictions

Our model of sparse representations makes at least three experimentally testable predictions.

Optimal feature estimation requires multineuron recording

In this model, the firing rate of a given neuron {ij} is maximized when there is a perfect match between the stimulus and the feature of that neuron, i.e., wheny =dij. Because the featuredij is used in the linear reconstruction of the stimulus from the neural activities (Eq. 9), one might imagine that the optimal stimulus (i.e., the stimulus that maximizes the firing rate) can be obtained by estimating the optimal linear decoder of the target neuron considered alone. Experiments based on this idea have shown that the optimal linear decoder can sometimes drive neurons in the auditory cortex to fire vigorously (deCharms et al., 1998).

Surprisingly, this model predicts that the linear estimate of the decoder obtained in this way is not the optimal stimulus, although the optimal decoder is linear. Instead, finding the optimal stimulus requires recording from all of the neurons involved in the representation. This follows from the fact that we have assumed that the features are not orthogonal (see alsoAppendix A.2). Note that in this model, optimal decoding (Eq. 9) need not take neural correlations into account, even when they are present.

This first prediction is illustrated by a simulation (Fig. 7). They-axis shows the firing rate of a target neuron (normalized to its maximum firing rate) in response to the presentation of the stimulus that matches the optimal linear decoder constructed by recording the activity of a target neuron and a variable number of other neurons. When the optimal linear decoder is estimated from only the target neuron, the firing rate is submaximal. As the number of neurons used to estimate the optimal linear decoder is increased (x-axis), the response of the target neuron converges to unity, indicating that the optimal decoder has converged to the feature of the target neuron.

Figure 7.

Prediction 1: stimulus optimization requires multineuron recording. They-axis shows the simulated firing rate of a target neuron (normalized to its maximum firing rate) in response to the presentation of the optimal linear decoder constructed by recording the activity of a target neuron and a variable number of other neurons. When the optimal linear decoder is estimated from only the target neuron, the firing rate is submaximal. As the number of neurons used in this simulation to estimate the optimal linear decoder is increased (x-axis), the response of the target neuron converges to unity, indicating that the optimal decoder has converged to the feature of the target neuron.

Figure 7 represents a novel and testable prediction of the model: jointly estimating the optimal linear decoder from a population of neurons should yield a stimulus that is closer to optimal. Moreover, it also leads to a novel experimental approach for finding the optimal stimulus. Note that although in principle the activity of all neurons involved in the representation must be recorded, in practice the activity of even a few can be useful. With modern techniques (e.g., tetrodes) for isolating the activity of several nearby neurons, this approach might be practical.

Linear decoding and nonlinear encoding

A second testable prediction of the model is that there should be an asymmetry between encoding and decoding: the optimal encoding function is nonlinear, but the optimal decoding function is linear. Here “decoding” refers to the process of “reading out” a neural representation (e.g., by forming an estimate or reconstruction of the stimulus), whereas “encoding” refers to the process by which the nervous system constructs a pattern of neural activities from a stimulus. Surprisingly, however, this asymmetry emerges only for populations of neurons; the optimal linear encoder and decoder of an isolated neuron perform about equally, and both underperform the optimal nonlinear decoder (Fig. 8).

Figure 8.

Prediction 2: linear decoding outperforms linear encoding for multineuron but not single neuron experiments.A, The mutual information between the response of a simulated neuron and the stimulus (Total) was compared with the mutual information between the stimulus and the optimal linear estimate of the stimulus obtained from the activity of a single neuron (Dec) and between the actual response and an optimal linear estimate of the response obtained from the stimulus (Enc). The two linear estimates were comparable, and both captured only a fraction of the total information, indicating that encoding and decoding are comparable for single neurons.B, Decoding outperforms encoding in a simulated multineuron experiment. The reconstruction quality is plotted as a function of the optimal linear decoder (dark curve) or the optimal linear encoder (light curve). The reconstruction quality is a normalized measure of the accuracy of reconstruction, defined as 1 − 〈‖error‖2/‖signal‖2〉; see Materials and Methods for details. Encoding and decoding perform comparably when only a few neurons are recorded, but as the number of neurons recorded increases, the reconstruction quality of decoding grows faster. When the activity of all neurons involved in the representation is recorded (75 in this simulation), decoding is perfect.

The fact that optimal decoding of a neuronal population is linear (i.e., that the optimal linear decoder of the neuronal population response provides perfect reconstruction of the stimulus under the model, so that no nonlinear model can do better) is a direct consequence of our fundamental assumption (Eq. 9) that the neural representation is a linear combination of features. The linearity of neural decoding does not imply that the neural encoding function (the inverse transformation from the stimulus to the response) need be linear, and in general it is not.

Sparseness induces a nonlinear encoding function; more precisely, it induces a piecewise linear encoding function (Fig. 9). Sparseness implies that only at mostNrow out of the possibleNcol featuresdij are active in the representation of a particular stimulus; the precise subset of active neurons changes for different stimuli. Piecewise linearity arises because the encoding function is linear for all stimuli that activate the same subset of features, but changes for different subsets (see alsoAppendix A.2). Note that not just any nonlinear function can be implemented. For example, any saturating nonlinearities must be introduced by a preprocessor, because doubling the stimulusy necessarily doubles the neural representationc, i.e.,y =Dc implies 2y =D(2c).

Figure 9.

Encoding is nonlinear (piecewise linear).A, Three features in two dimensions constitute an overcomplete basis. A sample signal y is indicated with an asterisk.B, Tuning curves for the three features are piecewise linear. The firing rate of each of the three units inA is given as a function of angle for stimuli of unit length; the point y inA is at ∼45°. Because the sample space is two-dimensional, any given point is represented by at most two active neurons. Decoding is linear: the point y is recovered by a weighted sum of the features, with the corresponding neural activities constituting the weights. Encoding, however, is nonlinear: the slope of the activation functions of all active neurons can change at the boundaries, whenever any neuron becomes active or inactive. The basic intuition shown here generalizes to the other examples in this paper, in which the dimensionality of the space (given by the number of elements in the spectrogram) is much higher.

The prediction that there is an asymmetry between the linearity of the decoding function and the nonlinearity of the encoding function can be tested experimentally (Fig. 8). Given an ensemble of stimulus–response pairs (i.e., the neural responses to an ensemble of sounds) obtained from a population of neurons, the model predicts that a linear stimulus reconstruction approach (i.e., a decoding model) will outperform a linear “forward” (i.e., encoding) model, but only if the optimal linear reconstructors are estimated from a population of neurons.

The idea that a linear approximation is better suited for the neural decoding than encoding function was first exploited to estimate the information rate of fly visual neurons (Bialek et al., 1991). In contrast, our model predicts that, if the neural representation is sparse and overcomplete, then the asymmetry should emerge only in multineuron recordings. To our knowledge, this asymmetry has not been tested for high-level auditory representations. Our model thus makes a strong prediction: linear decoding does not provide an advantage over linear encoding for single neuron experiments, whereas the former outperforms the latter for multineuron experiments.

Context dependence of STRFs

A third prediction that follows from the piecewise linearity of the encoding function is that the linear component of receptive fields should depend on the acoustic context. Following conventional usage in auditory physiology, we will use the term STRF to refer only to the linear component of the encoding function, although the encoding function itself may be highly nonlinear (Kowalski et al., 1996;Theunissen et al., 2000,2001). (In visual physiology, STRF is used to refer to the “spatial temporal receptive field,” but the quantities are analogous.) The STRF is the analog (in a high-dimensional input space) of the slope of the tuning curve of a neuron in one dimension.

In an experimental setting, piecewise linearity predicts that the STRF should depend on the acoustic context. We define the acoustic context of a featuredij with respect to a stimulusy as the collection of other features activated simultaneously by that stimulus. In music, for example, the features tend to resemble musical notes, and the acoustic context can be thought of as the set of notes (e.g., in a chord) that accompany a given note.Figure 10 shows the STRF of the same neuron (a trumpet feature) in two different contexts (either clarinet or flute.) The gross features of the STRF (e.g., the excitatory band around 880 Hz) are preserved in both contexts, but the secondary features (e.g., the addition of an inhibitory sideband) is context sensitive. Changes in the STRF for different features and different contexts can be larger or smaller than in this example. Stimulus context thus changes the neural encoding function, suggestive of the nonclassical receptive field modulation observed in visual and auditory cortexes (David et al., 2004;Valentine and Eggermont, 2004).

Figure 10.

Prediction 3: dependence of STRF on context.A, Spectrogram of trumpet feature, showing a strong fundamental ∼880 Hz and some higher harmonics.B,C, The STRFs corresponding to the feature inA when that feature is activated in two different contexts (clarinet or flute played simultaneously), derived under the assumption of a sparse neural representation. The STRF provides the encoding from the stimulus to neural activity. The color at any point of the STRF indicates the value (in spikes per second) of the kernel, which is convolved with the spectrogram of the stimulus to generate a neural response. Under the sparse assumption, the encoding is piecewise linear, and the STRFs shown are two out of the many possible pieces. The STRF is obtained from the appropriate row of the matrix Dk (seeAppendix A.3).D, The difference between the two spectrograms. Note that they show the same basic harmonic structure, but differ in details such as the relative contributions of the excitatory and inhibitory sidebands. The differences can be as large as the STRFs themselves.

Context dependence as defined here is stronger than simple nonlinearity. Specifically, the prediction is that there should exist extended subregions of stimulus space in which the encoding function of a given target neuron is one linear function and across some boundary in stimulus space switch to a second linear function. These boundaries are demarcated by the activation of another (nontarget) neuron in the population and the deactivation of a second (nontarget) neuron (Fig. 9). This prediction could be tested using a multineuron recording technique.

The locally linear encoding induced by sparseness may help reconcile some of the apparent contradictions in the auditory literature. STRFs obtained using a “moving ripple” basis can predict responses to linear combinations of basis elements (Kowalski et al., 1996). However, linear encoding (STRF) models fail to predict neural responses when the stimulus domain is extended to include a wide selection of complex sounds (Linden et al., 2003;Machens et al., 2004), consistent with the idea that ripples represent a subspace within which encoding is linear. Context sensitivity may also provide an explanation for a proposed neural correlate of comodulation masking release in which the addition of a pure tone can suppress the response to temporally modulated noise (Nelken et al., 1999); this form of contextual modulation cannot be explained by any purely linear encoding model.

Discussion

Our main result is that the appropriate “sparse” neural representation implicitly separates a mixture of sound sources into its constituent auditory streams. In this model, sources at different positions in space were separated with only monaural information by exploiting the differential filtering imposed by the HRTF, under the assumption that the source locations have already been identified by other mechanisms. This model provides a possible explanation for an important question about cortical organization: Why are there so many more neurons in the auditory (or visual) cortex than in the cochlea (or retina)? The answer we provide, motivated by the ability of an overcomplete sparse representation to separate sources, is potentially quite general and may be applicable to other brain regions as well.

This model was motivated foremost by the computational demands of source separation. Source separation is a complex computation, and we could no more expect to solve the whole problem in its entirety here than we could expect to solve completely its visual analog (scene segmentation) or any of the many other challenging problems in computational vision. We have instead concentrated on a restricted form of the problem involving only the spatial cues introduced by the HRTF, with the expectation that the same framework can be generalized to understand how some other cues might be used.

Sparseness provides a powerful and useful constraint on neural activities. Our results complement and extend previous work on finding features that permit an efficient representation of auditory or visual scenes (Bell and Sejnowski, 1997;Olshausen and Field, 1997,1996;Schwartz and Simoncelli, 2001;Lewicki, 2002;Klein et al., 2003;Smith and Lewicki, 2006). In our framework, the HRTF “tags” these features so that they can be assigned to the appropriate source. Other psychophysical cues important for acoustic stream segregation, such as common onset time, could be used in a similar way.

Efficient coding and sparseness

Our approach is compatible with the “efficient coding hypothesis” (Barlow, 1961), according to which the goal of sensory processing is to construct an efficient representation of the sensory environment. Building on these results, we used NMF to derive a set of basis elements with which auditory stimuli could be represented sparsely. Although we did not test bases obtained using other approaches, we do not expect that our results would be sensitive to the particular method used to find the basis; any sparse basis would likely have worked.

The principle of efficient, or sparse, coding has been used to predict receptive field properties of both auditory and visual neurons (Olshausen and Field, 1996;Bell and Sejnowski, 1997;Simoncelli and Olshausen, 2001;Lewicki, 2002;Smith and Lewicki, 2006). However, the focus of the present work is not on the receptive field properties themselves but rather on how the resulting sparse representation can subserve a computation. Our predictions are therefore not about the detailed structure of receptive fields but rather about how receptive fields interact.

The motivation for sparseness here is not coding [or metabolic (Levy and Baxter, 1996;Laughlin and Sejnowski, 2003)] efficiency per se, but rather performance on a particular computational problem: source separation. There are contexts in which coding efficiency imposes important constraints on representation; for example, the retina compresses visual information collected at 108 photoreceptors into a signal that is carried by only ∼106 fibers in the optic nerve. For source separation, however, sparseness provides a mathematical instantiation of Occam’s Razor: it allows a search for the most likely interpretation to be conducted by searching for the sparsest interpretation.

Sparse encoding implies that most stimuli should elicit only modest firing in most neurons, as has been observed experimentally for both simple and complex auditory and visual stimuli (Vinje and Gallant, 2000;DeWeese et al., 2003;Machens et al., 2004), but it does not imply that responses must be weak for all stimuli. Sparseness implies merely that stimuli elicit only a small number of spikes across the neuronal population; indeed, a neuron encoding some particular featured will fire maximally when the stimulusd is presented. Sparseness in this model is therefore a constraint on the activity of the population of neurons involved in a representation, rather than on the activity of any single neuron. Our results are thus fully consistent with experiments indicating that it is sometimes possible to optimize stimuli online to obtain high firing rates (deCharms et al., 1998;Barbour and Wang, 2003), because in our framework such a stimulus is the feature associated with the neuron.

Directly assessing the sparseness of a neuronal representation experimentally is difficult. The key issue is how many neurons (or spikes) participate in the representation of a typical stimulus. Ideally this would be measured by recording all spikes from all neurons simultaneously, but this is not possible using the experimental techniques currently available. Nevertheless, there is growing evidence that natural auditory (DeWeese et al., 2003;Machens et al., 2004) and visual (Baddeley et al., 1997;Vinje and Gallant, 2000) stimuli activate only a relatively small number of neurons (Olshausen and Field, 2004). Thus the representational sparseness assumed by this model should be viewed as at least provisionally consistent with the current experimental evidence about cortical representations.

Overcomplete representations as a model of cortex

Although there is nothing in the model that explicitly ties it to one or another brain area, we think it most likely that, at least in mammals, the operations we describe occur in the cortex, rather than at subcortical stations. First, receptive fields in auditory cortex are heterogeneous and often have the broad and complex spectrotemporal structure (Sutter, 2000) required to exploit the HRTF.

Second, and more significantly, auditory cortex has the characteristics expected from an overcomplete representation. There are ∼30,000 auditory nerve fibers but more than 1000 times as many auditory cortex neurons. Assuming that the “representational fidelity” (the amount of information that can be represented by a single spike) of neurons in cortex is comparable with that at the periphery, the “representational capacity” of cortex is far in excess of what is needed to form merely a complete representation. This model suggests a way in which this excess representational bandwidth can be used for computation, instead of merely to overcome neuronal noise, as has sometimes been proposed.

Although we do not know how sparse cortical representations are achieved, it seems likely that the underlying circuitry involves lateral interactions. Indeed, “sparsification” is similar to divisive normalization approaches, which are motivated by both circuit and computational considerations (Schwartz and Simoncelli, 2001). Explicit circuit dynamics and connection weights can be obtained using gradient descent to minimize the total neural activity inEquation 14.

Model predictions

We have identified three clear experimental predictions of the model. First, the optimal linear decoder estimated from an experiment in which the activity of multiple neurons are recorded should maximize the firing rate of a target neuron. Second, there should be an asymmetry between the performance of the optimal linear encoder and decoder, but this asymmetry should become evident only in interpreting multineuron recording experiments; the model predicts that the optimal linear encoder and decoder in a single neuron experiment both underperform the “true” optimal (i.e., nonlinear) decoder. Finally, the STRF should be dynamically influenced by acoustic context. These predictions can be used to test (and falsify) the model.

Perhaps the most surprising prediction of this model is how stimulus optimization for a target neuron can improve by recording from other neurons involved in the representation. To our knowledge, online stimulus optimization using data from more than one neuron has not been previously proposed but could be practical using modern multineuron recording techniques.

We have assumed that the neural decoding function (the transformation from the neural response to the stimulus) is linear. However, we have shown that sparseness implies that the neural encoding function (the inverse transformation from the stimulus to the response) is in general nonlinear (piecewise linear). This asymmetry, which emerges only in multineuron recording experiments, is a strong and testable prediction of the model.

This asymmetry further implies the context dependence of the STRF. We speculate that this may explain why linear (STRF) encoding models work well in restricted domains (Kowalski et al., 1996) but fail for richer stimulus ensembles (Linden et al., 2003;Machens et al., 2004). However, context dependence has not yet been tested directly.

Relationship to independent component analysis

Our formulation of the source separation problem (Eq. 7) differs in two respects from the one usually considered in the independent component analysis (ICA) literature. First, we have assumed prefiltering of each source by a known filterh, whereas in the usual formulation, the weighting of each source is given by an unknown scalar. Second, in most formulations the receiver is assumed to have access to the sources via several sensors, each of which is exposed to a different (linear) combination of the sources, whereas here we assume only a single sensor with inputy.

Most approaches to solving such multisensor formulations focus on recovering an “unmixing matrix” which inverts the mixing matrix governing the weighting of each source at each sensor. In such cases, it is generally sufficient to assume simply that the sources are statistically independent (Comon et al., 1991;Comon, 1994;Bell and Sejnowski, 1995;Belouchrani et al., 1997;Amari and Cichocki, 1998). If, however, there is only a single sensor, the problem is degenerate and such approaches fail: separating multiple sources from a single sensor requires assumptions stronger than simple independence (Cauwenberghs, 1999;Hochreiter and Mozer, 2001;Roweis, 2001;Jang and Lee, 2003;Smaragdis, 2004). The intermediate case (at least two sensors but more sources than sensors) simplifies the problem considerably (Rickard and Dietrich, 2000;Bofill and Zibulevsky, 2001;Linsker, 2001), because binaural cues as well as monaural ones can be used.

Recent advances in ICA have emphasized the utility of sparse overcomplete representations for source separation problems in acoustic, visual, and other domains (Farid and Adelson, 1999;Lee et al., 1999;Lewicki and Sejnowski, 2000;Zibulevsky and Pearlmutter, 2001;Levin and Weiss, 2004;Li et al., 2004). Here we have built on these ideas and developed a novel approach to separating multiple “augmented” (prefiltered) signals combined at a single sensor.

Our framework can be generalized to exploit other cues used in single sensor separation, such as common onset time. It can also readily be extended to make use of binaural information. Each HRTF function is made single-input two-output, and the lengths of the column vectors corresponding to the post-HRTF dictionary elements and the observation vector are doubled. Interaural time and level disparity can then be used to separate sources. Information from two (or more) sensors can thus be naturally incorporated.

HRTF and source separation

The model assumes that sources have a statistical structure consisting of spectral correlations that can be exploited by filtering by the HRTF. One novel contribution of this work is its specific proposal for how the HRTF can be used for source separation, a process related to but distinct from sound localization. Spectral cues are not strictly required for sound localization: binaural cues can provide robust cues even in the absence of spectral cues. Conversely, source separation can proceed when spectral cues are weak, or indeed, even when spatial cues are completely absent, as for example when picking out a violin from within a concerto played over a single speaker. This illustrates a general principle: no single cue is essential to source separation, and the auditory system will promiscuously exploit any cues that are available.

Nevertheless, it is clear that HRTF cues, when present, help in source separation (Yost et al., 1996). We have shown how neural systems can exploit these cues using sparse representations. We speculate that sparseness may represent a general adaptation used by the nervous system to separate acoustic sources, and that similar principles may be also involved in source separation in other modalities, such as olfaction.

Appendix

A.1. Notes on regularizers

L0 minimization

Minimizing sparseness in theL1 sense is not the only possible choice. One natural alternative is theL0 norm, which minimizes the total number of active neurons (the total number of nonzero activitiescij) rather than the total number of spikes. Although this constraint also seems biologically sensible, it leads to a computationally intractable (NP-complete) combinatorial problem (Donoho and Elad, 2003); moreover, in many cases it leads to the same solution as the minimumL1 solution (Li et al., 2004), particularly in the presence of a noise model. We therefore consider only theL1 solution here.

Conditioning of dense representation

FromEquation 15 it is clear that the uniqueness of the solution depends on the invertibility and condition number (ratio of largest to smallest singular values) ofD, whose columnsdij =hiqj depend in turn on both the filtershi and the source elementsqj. In particular, if there is no filterhi, as in the usual formulation of source separation, the columns ofD are identical, and the solution is therefore degenerate. The greater the difference between the filtered copies of the sources (the columns ofD), the smaller the condition number ofD and the more numerically stable the problem.

Probabilistic interpretation of regularizers

Interpreted probabilistically, the regularizers on neural representations (Eqs. 14 and15) correspond to maximum likelihood estimates using different a priori assumptions about the processes generating the stimuli, whose estimates are represented as the neural activities participating in a representation (Fig. 1D). The pseudoinverse assumes that the underlying causes represented by the activitiescij were drawn from a Gaussian distribution, whereas the sparseness regularizer assumes instead a Laplacian distribution:p(cij) ∝e−|cij|. Because a Laplacian distribution has more elements very close to (and very far from) zero than does a Gaussian with the same variance, it corresponds to a sparser description in terms ofcij. Note that without a noise term, the maximum likelihood estimates using any prior yield perfect (zero reconstruction error) representations of the stimulus; the prior here is on the distribution of the underlying causes represented by the coefficientscij, rather than on the distribution of reconstruction errors (as for example in robust fitting methods). Only when a noise term is added (Eq. 1) do the neuronal activitiescij cease to represent the stimulus perfectly.

A.2. Asymmetry of sparse representations

Why does an overcomplete sparse representation predict an asymmetry between encoding and decoding (see Results, Linear decoding and nonlinear encoding)? To understand this, let us begin by considering the more familiar case of a complete (but not overcomplete) orthonormal representation, such as a Fourier or ripple basis. In this case, there is perfect symmetry between the encoding and decoding transformations. (These transformations are referred to as “analysis” and “synthesis” in the wavelet literature.) That is, if the columns ofD in the decoding equationy =Dc (Eq. 13) are orthonormal, the inverse (encoding) transformation is given byc =DTy, where we have used the fact that the inverse of an orthonormal matrix is its transpose,DT =D−1. In this case, the encoding and decoding filters (the rows and columns ofD) are identical. This explains the familiar symmetry between the forward and inverse Fourier transform, in which the rows and columns ofD are sinusoids.

If the columns of a square matrixD are not orthonormal, its inverse (if it exists) is not equal to its transpose. However, the encoding transformationc =D−1y is still linear, because it can be expressed as a linear combination of the columns ofD−1. Even in the overcomplete case (Eq. 15), linearity of encoding is preserved if the dense representation is used, because in that case the encoding filters are the columns of the pseudoinverseD.

In the sparse overcomplete case, asymmetry arises because the inverse transformation is in general nonlinear. Here, the representation is found by solving the optimization problem ofEq. 14. Note that this in turn gives a reason that stimulus optimization requires multineuron recordings (see Results, Optimal feature estimation requires multineuron recording). Specifically, because the neural responses cannot be explained by a single (or a global) linear encoding function, the feature (or the stimulus that elicits the maximum response) cannot be estimated correctly by a linear regression using single unit data.

A.3. Context dependence of STRFs

To understand the context dependence of STRFs (see Results, Context dependence of STRFs), we consider the encoding function associated with the sparse representation of a particular stimulusyk. Consider the “packed matrix”Dk, whose columns are the subset of features involved in the sparse representation of the stimulusyk, i.e., only the columns corresponding to the nonzero elements ofck. This matrix satisfies the following decoding relationship:Embedded Image wherek consists ofck without zero elements. However, because of the sparseness (L1) prior, the matrixDk is at most full-rank and is constructed from only at mostNrow features. It is thus not overcomplete, and thus the encoding can be specified by a matrix using the pseudoinverse:

Embedded Image

Thus, the encoding function is continuous and piecewise linear, with the linear segments defined by pseudoinversesDk, and the discontinuities in the first derivative occurring whenever the stimulus activates (or deactivates) a new feature (i.e., an element ofc becomes nonzero or zero) and thereby changesDk.

The STRF for theith neuron is then obtained from theith row of the pseudoinverse of the packed matrixDk. (Recall that, following convention, we use the term STRF to refer only to the linear component of the encoding function from stimulus to response.)

Three comments on the STRF estimation are in order (see Materials and Methods, Context dependence of STRFs). First, we note that constructing the matrixDk requires knowledge of the solutionck, so that this does not actually constitute an algorithm for findingck. Second, in the special case of the dense representation, both encoding and decoding are linear. In this case, the encoding function for any stimulus (Eq. 11) is simply the pseudoinverseD of the full matrixD.

Finally, we note the fact that even if two STRFs obtained from the response of a single neuron to two different stimulus ensembles differ, and each accounts perfectly for the data to which it was fit, this does not rule out the possibility that there might exist some third STRF that perfectly fits the amalgamation of the two datasets. Consider two stimuliyk andyk′, which activate only features inck andck′, respectively. The pseudoinverseDk corresponding to the first stimulus will be valid (Eq. 17) for any stimulus composed of any subset of features that are nonzero inck but will not be valid for any features that were not active. Therefore, any stimulus consisting of sums of features in the union of the feature setsck andck′ will yield incorrect results if either of the corresponding packed pseudoinverse matrices are used. However, a new pseudoinverse matrix constructed from features in the union of the feature setsck andck′ could yield correct values for a superset of the stimulus ensembles for which either of the original two STRFs were valid.

Such supersets can only be constructed from a number of features that is limited by the rank ofD. That is, there does not exist a matrixDk that can be used in (Eq. 17) to generate the correctc for all stimuli. For the example inFigure 10, we confirmed that there does not exist an STRF that includes both as special cases.

Footnotes

  • This work was supported by the Higher Education Authority of Ireland (An tÚdarás Um Ard-Oideachas) and Science Foundation Ireland Grant 00/PI.1/C067 (B.A.P.), a Farish-Gerry Fellowship (H.A.), and the Sloan Foundation, the Mathers Foundation, the National Institutes of Health, the Packard Foundation, and the Redwood Neuroscience Institute (A.M.Z.).

  • Correspondence should be addressed to Anthony M. Zador, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724. Email:zador{at}cshl.edu

References

  1. Amari S, Cichocki A(1998) Adaptive blind signal processing: neural network approaches. Proc IEEE86: pp.2026–2048.
  2. Attias H, Schreiner C(1997) Temporal low-order statistics of natural sounds.In: Advances in neural information processing systems 9 (Mozer MC, Jordan MI, Petsche T, eds) pp.27–33. Cambridge, MA: MIT.
  3. Baddeley R, Abbott LF, Booth MC, Sengpiel F, Freeman T, Wakeman EA, Rolls ET(1997) Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc R Soc Lond B Biol Sci264: pp. pp.1775–1783.
  4. Barbour DL, Wang X(2003) Auditory cortical responses elicited in awake primates by random spectrum stimuli.J Neurosci23:7194–7206.
  5. Barlow HB(1961) Possible principles underlying the transformations of sensory messages.In: Sensory communication (Rosenblith WA, ed.) pp.217–234. Cambridge, MA: MIT.
  6. Bell AJ, Sejnowski TJ(1995) An information-maximization approach to blind separation and blind deconvolution.Neural Comput7:1129–1159.
  7. Bell AJ, Sejnowski TJ(1997) The “independent components” of natural scenes are edge filters.Vision Res37:3327–3338.
  8. Belouchrani A, Abed Meraim K, Cardoso J-F, Moulines É(1997) A blind source separation technique based on second order statistics. IEEE Trans Signal Proc45: pp.434–444.
  9. Bialek W, Rieke F, de Ruyter van Stevenick RR, Warland D(1991) Reading a neural code.Science252:1854–1857.
  10. Bofill P, Zibulevsky M(2001) Underdetermined blind source separation using sparse representations. Signal Proc81: pp.2353–2362.
  11. Bregman AS(1990)In: Auditory scene analysis: the perceptual organization of sound Cambridge, MA: MIT.
  12. Cauwenberghs G(1999) Monaural separation of independent acoustical components. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems Vol5: pp. pp.62–65. Orlando, FL: IEEE.
  13. Chen SS, Donoho DL, Saunders MA(1998) Atomic decomposition by basis pursuit.SIAM J Sci Comput20:33–61.
  14. Comon P(1994) Independent component analysis: a new concept. Signal Proc36: pp.287–314.
  15. Comon P, Jutten C, Hérault J(1991) Blind separation of sources, part II: problem statement. Signal Proc24: pp.11–20.
  16. David SV, Vinje WE, Gallant JL(2004) Natural stimulus statistics alter the receptive field structure of V1 neurons.J Neurosci24:6991–7006.
  17. deCharms RC, Blake DT, Merzenich MM(1998) Optimizing sound features for cortical neurons.Science280:1439–1443.
  18. DeWeese MR, Wehr M, Zador AM(2003) Binary spiking in auditory cortex.J Neurosci23:7940–7949.
  19. DeWeese MR, Hromadka T, Zador AM(2005) Reliability and representational bandwidth in the auditory cortex.Neuron48:479–488.
  20. Donoho DL, Elad M(2003) Maximal sparsity representation via l1 minimization.Proc Natl Acad Sci USA100:2197–2202.
  21. Farid H, Adelson EH(1999) Separating reflections from images by use of independent components analysis.J Opt Soc Am16:2136–2145.
  22. Fishman YI, Arezzo JC, Steinschneider M(2004) Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration.J Acoust Soc Am116:656–670.
  23. Fletcher R(1985) Semidefinite matrix constraints in optimization.SIAM J Control Opt23:493–513.
  24. Hahnloser RH, Kozhevnikov AA, Fee MS(2002) An ultra-sparse code underlies the generation of neural sequences in a songbird.Nature419:65–70.
  25. Hochreiter S, Mozer MC(2001) Monaural separation and classification of mixed signals: a support-vector regression perspective. Paper presented at 3rd International Conference on Independent Component Analysis and Blind Signal SeparationSan Diego12.
  26. Hofman PM, Opstal AJV(2002) Bayesian reconstruction of sound localization cues from responses to random spectra.Biol Cybern86:305–316.
  27. Jang G-J, Lee T-W(2003) A maximum likelihood approach to single-channel source separation.J Mach Learn Res4:1365–1392.
  28. Klein D, Konig P, Kording KP(2003) Sparse spectrotemporal coding of sounds. J Appl Signal Proc7: pp.659–667.
  29. Klein DJ, Depireux DA, Simon JZ, Shamma SA(2000) Robust spectrotemporal reverse correlation for the auditory system: optimizing stimulus design.J Comput Neurosci9:85–111.
  30. Knudsen EI, Konishi M(1979) Mechanisms of sound localization in the barn owl.J Comp Physiol133:13–21.
  31. Kowalski N, Depireux DA, Shamma SA(1996) Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra.J Neurophysiol76:3524–3534.
  32. Kreutz-Delgado K, Murray JF, Rao BD, Engan K, Lee T-W, Sejnowski TJ(2003) Dictionary learning algorithms for sparse representation.Neural Comput15:349–396.
  33. Kulkarni A, Colburn HS(1998) Role of spectral detail in sound-source localization.Nature396:747–749.
  34. Laughlin SB, Sejnowski TJ(2003) Communication in neuronal networks.Science301:1870–1874.
  35. Lee DD, Seung HS(1999) Learning the parts of objects with non-negative matrix factorization.Nature401:788–791.
  36. Lee T-W, Lewicki MS, Girolami M, Sejnowski TJ(1999) Blind source separation of more sources than mixtures using overcomplete representations. IEEE Sign Proc Lett4: pp.87–90.
  37. Levin A, Weiss Y(2004) User assisted separation of reflections from a single image using a sparsity prior. Computer vision–ECCV 2004: 8th European Conference on Computer Vision, Proceedings, Part 1 pp.602–613. (Pajdla T, Matas J, eds) Berlin: Springer.
  38. Levy WB, Baxter RA(1996) Energy efficient neural codes.Neural Comput8:531–543.
  39. Lewicki MS(2002) Efficient coding of natural sounds.Nat Neurosci5:356–363.
  40. Lewicki MS, Sejnowski TJ(2000) Learning overcomplete representations.Neural Comput12:337–365.
  41. Li Y, Cichocki A, Amari S(2004) Analysis of sparse representation and blind source separation.Neural Comput16:1193–1234.
  42. Linden JF, Liu RC, Sahani M, Schreiner CE, Merzenich MM(2003) Spectrotemporal structure of receptive fields in areas AI and AAF of mouse auditory cortex.J Neurophysiol90:2660–2675.
  43. Linsker R(2001) inventor; International Business Machines Corporation, assignee. Separation of a mixture of acoustic sources into its components. U.S. Patent 6,317,7031113.
  44. Machens CK, Wehr MS, Zador AM(2004) Linearity of cortical receptive fields measured with natural sounds.J Neurosci24:1089–1100.
  45. Micheyl C, Tian B, Carlyon RP, Rauschecker JP(2005) Perceptual organization of tone sequences in the auditory cortex of awake macaques.Neuron48:139–148.
  46. Nelken I, Rotman Y, Yosef OB(1999) Responses of auditory-cortex neurons to structural features of natural sounds.Nature397:154–157.
  47. Nishino T, Nakai Y, Takeda K, Itakura F(2001) Estimating head related transfer function using multiple regression analysis (in Japanese).IEICE Trans AJ84-A:260–268.
  48. Olshausen BA, Field DJ(1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature381:607–609.
  49. Olshausen BA, Field DJ(1997) Sparse coding with an overcomplete basis set: a strategy employed by V1?Vision Res37:3311–3325.
  50. Olshausen BA, Field DJ(2004) Sparse coding of sensory inputs.Curr Opin Neurobiol14:481–487.
  51. Olshausen BA, O’Connor KN(2002) A new window on sound.Nat Neurosci5:292–293.
  52. Poggio T, Torre V, Koch C(1985) Computational vision and regularization theory.Nature317:314–319.
  53. Rickard ST, Dietrich F(2000) DOA estimation of many W-disjoint orthogonal sources from two mixtures using DUET. Proceedings of the 10th IEEE Workshop on Statistical Signal and Array Processing (SSAP2000) pp.311–314. (Jemison B, ed.) Pocono Manor, PA: IEEE.
  54. Riesenhuber M, Poggio T(2000) Models of object recognition.Nat Neurosci3:[Suppl],1199–1204.
  55. Roweis ST(2001) One microphone source separation.In: Advances in Neural Information Processing Systems 13 (Leen TK, Dietterich TG, Tresp V, eds) pp.793–799. Cambridge, MA: MIT Press.
  56. Schwartz O, Simoncelli EP(2001) Natural signal statistics and sensory gain control.Nat Neurosci4:819–825.
  57. Simoncelli EP, Olshausen BA(2001) Natural image statistics and neural representation.Annu Rev Neurosci24:1193–1216.
  58. Smaragdis P(2004) Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Fifth International Conference on Independent Component Analysis, LNCS 3195 pp.494–499. (Puntonet CG, Prieto A, eds) Berlin: Springer.
  59. Smith EC, Lewicki MS(2006) Efficient auditory coding.Nature439:978–982.
  60. Strang G(1988)Linear algebra and its applications Ed 3 Belmont, CA: Thomson Brooks/Cole.
  61. Sutter ML(2000) Shapes and level tolerances of frequency tuning curves in primary auditory cortex: quantitative measures and population codes.J Neurophysiol84:1012–1025.
  62. Theunissen FE, Sen K, Doupe AJ(2000) Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds.J Neurosci20:2315–2331.
  63. Theunissen FE, David SV, Singh NC, Hsu A, Vinje WE, Gallant JL(2001) Estimating spatio-temporal receptive fields of auditory and visual neurons from their responses to natural stimuli.Network12:289–316.
  64. Valentine PA, Eggermont JJ(2004) Stimulus dependence of spectro-temporal receptive fields in cat primary auditory cortex.Hear Res196:119–133.
  65. Vinje WE, Gallant JL(2000) Sparse coding and decorrelation in primary visual cortex during natural vision.Science287:1273–1276.
  66. Wenzel EM, Arruda M, Kistler DJ, Wightman FL(1993) Localization using nonindividualized head-related transfer functions.J Acoust Soc Am94:111–123.
  67. Wightman FL, Kistler DJ(1989) Headphone simulation of free-field listening. II. Psychophysical validation.J Acoust Soc Am85:868–878.
  68. Yost WA, Dye RH Jr, Sheft S(1996) A simulated “cocktail party” with up to three sound sources.Percept Psychophys58:1026–1036.
  69. Zibulevsky M, Pearlmutter BA(2001) Blind source separation by sparse decomposition in a signal dictionary.Neural Comput13:863–882.
Sparse Representations for the Cocktail Party Problem
Hiroki Asari,Barak A. Pearlmutter,Anthony M. Zador
Journal of Neuroscience12 July 2006,26(28)7477-7490;DOI: 10.1523/JNEUROSCI.1563-06.2006
Twitter logoFacebook logoMendeley logo

Responses to this article

Jump to comment:

No eLetters have been published for this article.

[8]ページ先頭

©2009-2025 Movatter.jp