[SciPy-user] Frequency content of a transient signal

Tue Jul 22 11:55:44 EDT 2008

2008/7/22 Nils Wagner <nwagner at iam.uni-stuttgart.de>:

> I made some progress (See test_stft.py).
> But, how do I choose proper values for the parameter
> NFFT,noverlap ?

Suppose you have a time series with 4096 points and duration 1 s. If
you take a real FFT, you will get a complex array of length about
2048. In each bin of this array is a complex number representing the
amplitude and phase of a sinusoid at a particular frequency. The
frequencies are evenly spaced from zero (a constant, necessarily real)
up to (4096/2)/(1 s), that is, 2048 Hz. The spacing is linear, so bin
37 is 37 Hz (in this example). If you don't care about the phase and
just want to find the power, you can take the (squared) amplitude of
the complex number in each bin.

This is useful, but it interprets *everything* as a sinusoid. If you
have an array containing ten seconds of noise, one second of a tone,
and ten more seconds of noise, doing the above will produce a spectrum
in which sinusoids are very carefully chosen to cancel everywhere
except during that one second. Correct, and occasionally useful. but
hard to interpret. Instead, what you often want is something like the
human ear: changes on a short timescale are interpreted in the Fourier
domain, as tones, but changes on a longer scale remain in the time
domain (tones turning on and off). This is a slightly messy sort of
problem; be aware that you are introducing a timescale of transition,
and things get very confusing near that timescale. (For the human ear,
this frequency is somewhere between a few Hertz and maybe thirty
Hertz.) Also, wavelet transforms are designed to deal with this sort
of problem in a systematic way, but their results are more complicated
to interpret still.

So let's use a Fourier transform to compute a "dynamical power
spectrum". For illustrative purposes, let's suppose we have a data
stream at 48000 samples per second and we want to roughly mimic the
human ear. So we want a lowest frequency of about 10 Hz; let's take
chunks of 4096 samples, then. Now a first approach would simply chop
the time sttream into 4096-sample chunks, FFT each one, and take the
(squared) amplitude. This gives you a frequency spectrum with a
spectral resolution of (48 kHz/2)/4096, roughly five Hertz, and a time
resolution of a tenth of a second. If you want more time resolution
take shorter FFTs; if you want more frequency resolution take longer
FFTs. You can't have both (even changing the sample rate just changes
your highest frequency, not the resolution). This is a fundamental
mathematical limitation, the same one that produces the uncertainty
principle in quantum mechanics. But subject to that limitation, you've
got a dynamic power spectrum.

What about that overlap business? Well, suppose you have a tone that
turns on for a tenth of a second. Ideally, you'd see it in one FFT but
not in any of the others, and it'd be nice and strong. But what if its
start time falls in the middle of a chunk? Then you'll see it half as
strong in two adjacent chunks - much less visible. So what people
often do is make successive chunks overlap. Let's say you do that by a
factor of two. Then your tone, instead of falling in the middle of a
chunk, appears half as strong in one chunk, at full power in the next,
and half as strong in the third. Of course misalignments are still
possible, but the worst possible misalignment now puts three-quarters
of the tone in each block. The less power you want to risk losing, the
more overlap you should use (and, of course, the more time and space
it takes). Also, you should note that you won't get any more time
resolution this way - it's like (in fact it is) interpolating the
power spectrum in the time dimension. You get more measurements, but
the narrowest possible peak is still about a tenth of a second wide.

A similar phenomenon occurs in the frequency domain (often called
"scalloping"). If you take such an FFT, you get the coefficients
needed to express your signal in terms of sinusoids at frequencies (48
kHz/2)/4096 * i. What happens if the true frequency of your tone falls
halfway between two of these frequencies? Well, it appears as two
sinusoids, one on either side at about half the power (actually there
are more further out, but they are even weaker). If you're looking to
detect tones, this means that you're more sensitive to some
frequencies than others. You can address this with a similar
"interpolation" procedure. In this case you'd take your 4096 samples
and copy them into an array of size, say, 8192; you'd fill the rest of
the array with the mean value of your samples (or zero, but this
slightly reduces the junk at the low frequencies). Now when you do
your FFT, you get twice as many coefficients; the new ones interpolate
the old ones, reducing the "scalloping". As before, the more
interpolation you do, the less you lose in sensitivity, but the more
it costs in space and time.

How should you choose the FFT size and the overlap? Well, to pick the
FFT size, decide on the time resolution you want to have, keeping in
mind that the reciprocal of this is (roguhly) the frequency resolution
you will get. To pick the overlap, you need to think about space and
time costs, but overlapping by a factor of two is a good habit, or
four if it's not too slow. Whether you want to do interpolation in the
frequency domain is up to you, but it's fairly cheap. I'd go with a
factor of two.

Good luck,
Anne