Spectrograms and how to get a good result

This is something I spent a lot of time working on for Oxford Wave Research’s SpectrumView app.

I saw a question on stackoverflow today on exactly all the problems I spent a good while working out over time so I thought I’d share my response on my blog for posterity. It now follows:

Well it all depends on the frequency range you’re after. An FFT works by taking 2^n samples and providing you with 2^(n-1) real and imaginary numbers. I have to admit I’m quite hazy on what exactly these values represent (I’ve got a friend who has promised to go through it all with me in lieu of a loan I made him when he had financial issues ;)) other than an angle around a circle. Effectively they provide you with an arccos of the angle parameter for a sine and cosine for each frequency bin from which the original 2^n samples can be, perfectly, reconstructed.

Anyway this has the huge advantage that you can calculate magnitude by taking the euclidean distance of the real and imaginary parts (sqrtf( (real * real) + (imag * imag) )). This provides you with an unnormalised distance value. This value can then be used to build a magnitude for each frequency band.

So lets take an order 10 FFT (2^10). You input 1024 samples. You FFT those samples and you get 512 imaginary and real values back (the particular ordering of those values depends on the FFT algorithm you use). So this means that for a 44.1Khz audio file each bin represents 44100/512 Hz or ~86Hz per bin.

One thing that should stand out from this is that if you use more samples (from whats called the time or spatial domain when dealing with multi dimensional signals such as images) you get better frequency representation (in whats called the frequency domain). However you sacrifice one for the other. This is just the way things go and you will have to live with it.

Basically you will need to tune the frequency bins and time/spatial resolution to get the data you require.

First a bit of nomenclature, the 1024 time domain samples I referred to earlier is called your window. Generally when performing this sort of process you will want to slide the window on by some amount to get the next 1024 samples you FFT. The obvious thing to do would be to take samples 0->1023, then 1024->2047, and so forth. This unfortunately doesn’t give the best results. Ideally you want to overlap the windows to some degree so that you get a smoother frequency change over time. Most commonly people slide the window on by half a window size. ie your first window will be 0->1023 the second 512->1535 and so on and so forth.

Now this then brings up one further problem. While this information provides for perfect inverse FFT signal reconstruction it leaves you with a problem that frequencies leak into surround bins to some extent. To solve this issue some mathematicians (far more intelligent than me) came up with the concept of a window function. The window function provides for far better frequency isolation in the frequency domain though leads to a loss of information in the time domain (ie its impossible to perfectly re-construct the signal after you have used a window function, AFAIK).

Now there are various types of window function ranging from the rectangular window (effectively doing nothing to the signal) to various functions that provide far better frequency isolation (though some may also kill surrounding frequencies that may be of interest to you!!). There is, alas, no one size fits all but I’m a big fan (for spectrograms) of the blackmann-harris window function. I think it gives the best looking results!

However as I mentioned earlier the FFT provides you with an unnormalised spectrum. To normalise the spectrum (after the euclidean distance calculation) you need to divide all the values by a normalisation factor (I go into more detail here).

this normalisation will provide you with a value between 0 and 1. So you could easily multiple this value by 100 to get your 0 to 100 scale.

This, however, is not where it ends. The spectrum you get from this is rather unsatisfying. This is because you are looking at the magnitude using a linear scale. Unfortunately the human ear hears using a logarithmic scale. This rather causes issues with how a spectrogram/spectrum looks.

To get round this you need to convert these 0 to 1 values (I’ll call it ‘x’) to the decibel scale. The standard transformation is 20.0f * log10f( x ). This will then provide you a value whereby 1 has converted to 0 and 0 has converted to -infinity. your magnitudes are now in the appropriate logarithmic scale. However its not always that helpful.

At this point you need to look into the original sample bit depth. At 16-bit sampling you get a value that is between 32767 and -32768. This means your dynamic range is fabsf( 20.0f * log10f( 1.0f / 65536.0f ) ) or ~96.33dB. So now we have this value.

Take the values we’ve got from the dB calculation above. Add this 96.33 value to it. Obviously the maximum amplitude (0) is now 96.33. Now didivde by that same value and you nowhave a value ranging from -infinity to 1.0f. Clamp the lower end to 0 and you now have a range from 0 to 1 and multiply that by 100 and you have your final 0 to 100 range.

And that is much more of a monster post than I had originally intended but should give you a good grounding in how to generate a good spectrum/spectrogram for an input signal.

and breathe

Further reading

As an aside I found kiss FFT far easier to use, my code to perform a forward fft is as follows:

    CFFT::CFFT( unsigned int fftOrder ) :
        BaseFFT( fftOrder )
    {
    	mFFTSetupFwd	= kiss_fftr_alloc( 1 << fftOrder, 0, NULL, NULL );
    }

    bool CFFT::ForwardFFT( std::complex< float >* pOut, const float* pIn, unsigned int num )
    {
    	kiss_fftr( mFFTSetupFwd, pIn, (kiss_fft_cpx*)pOut );
    	return true;
    }

8 Responses to “Spectrograms and how to get a good result”

  1. James says:

    Excellent blog on FFT for spectral displays, your app is brilliant. Did you use the Kiss fft in the iPhone app? If so was it better than the apple iOS fft in terms of speed as well?

  2. Goz says:

    Hi James.,

    I didn’t actually use KissFFT on SpectrumView. I have used it when I was playing with porting the app to android in it was incredibly quick on an HTC Desire.

    I should at some point compare the performance with the 2 but I’d be very surprised if KissFFT was faster because the vDSP routines are well optimised. AFAIK, they are hand optimised assembler routines which would be next to impossible to beat with straight C code (unless the algorithm being used is a lot poorer with vDSP).

    That said I found the HTC Desire significantly more capable on the rendering than the iPhone (Which seems counter intuitive). Generating the bitmap seemed way faster.

    And thanks for the comments on the app! Always nice to see people giving good feedback 🙂

    Oscar

  3. James says:

    Thanks Oscar,

    I have been using C# WPF to create a similar spectral display using a rendered bitmap, then overlaying a new bitmap with an offset to get the scrolling feature. Is this similar to how you would implement it in iOS, or are there some better graphics features that can be used?

    I’m up to speed with undestanding the FFT parts but I struggle a little with graphics.

  4. Goz says:

    In iOS I implement it in a not hugely efficient way.

    Basically I shift each row to the left 1 pixel and then fill in the last column. It would be significantly more efficient to draw a new column 1 pixel to the right (then wrapping round when you hit the edge of your image) and rendering it with something like OpenGL’s wrap mode.

    I’m not aware of a way to do this on iPhone however 🙁

  5. Liu says:

    Can you describe how to select a normalisation factor?

  6. Goz says:

    Hi Liu, sorry about the slow response I never got the notification 🙁 I explain how to calculate the normalisation factor in my stackoverflow post linked in the blog 🙂

  7. Bart says:

    Good post. It’s a nice collection of knowledge you’d otherwise have to look for in many different places.
    One thing though. When you described dB normalisation to 0:1 range:
    “Take the values we’ve got from the dB calculation above. *ADD* this -96.33 value to it. Obviously the maximum amplitude (0) is now 96.33.”
    You said ADD but I believe you meant SUBTRACT since we want the result to be positive and adding a negative dynamic range factor to already negative dB scale would produce the opposite.

  8. Goz says:

    Yes you’re quite right. I’ll fix the post 🙂 Thanks!

Leave a Reply