How robots listen: Classifying Audio

If it quacks like a duck, it probably is a duck

This is the second blog in a series of deep learning blogs. The previous one: “Robots writing stories: a closer look at language generation” can be found here.

10 minute read
Level: slightly technical

Practically every piece of music released that is somewhat controversial has a “radio edit” released with it. These versions are edited for broadcast on radio and television with all content removed that is considered objectionable. Now, think of all the combined hours that went into doing this. Or think about transcriptionists who literally listen to hours of recordings. Many hours can be saved if the transcriptionist is supported by an algorithm that does a part of the transcribing in their place.
Audio classification can be used in situations like these. Noise removal, curse words detection and automatic classification of music or text are just some examples of how audio classification can be used. With more and more successful applications surfacing every day, this sub-field of deep learning has a bright future ahead. In this slightly technical blog, we’ll explain the basic approaches of audio classification.

How do we hear?

Human hearing

Before we look at how we can classify audio, it is a good idea to discuss what sound actually is.

Sound is composed of changes in air pressure, causing a vibration at a certain frequency, which is measured in Hertz (Hz). A pure tone consists of a perfect wave, but most of the time sounds combine numerous frequencies and create a more complex wave.

In our ears, some processing happens to make sense of these waves. Sounds are funneled by our outer ears and led to the eardrum, which vibrates to replicate the sound waves. In our inner ear, a number of small bones transmit the waves to the cochlea. This cochlea is shaped like a spiral and contains special hair cells that generate electrical signals. Due to the cochlea’s shape, different hair cells respond to different frequencies. Typically, the range that these cells react to is between 20 Hz - 20 kHz, although this range tends to get smaller with age. Finally, the auditory system in our brain constructs a sound out of the electric signals sent from the hair cells in the cochlea.

Now that we understand how humans interpret sounds, we can take a look at the different AI methods that we can use to mimic this human process.

Raw audio

Sound wave of speech

One evident approach is to take the entire sound wave as input and learn to create a prediction from it. In this way, we attempt to mimic the functioning of the entire ear and the auditory system.

Before we can apply math on sound waves, we need to turn them from continuous waves into a number of samples. Therefore, a sound wave is sampled at a certain interval, giving us an absolute number for the air pressure at that time. Sampling is commonly done at 44100 Hz, which is about double the human hearing range. This gives us 44100 input values for each second of audio to classify

In order to process a sound wave, we can use a convolutional neural network (CNN), which is a Deep Learning method. This type of network combines inputs in an iteration based on how close they are. After the first iteration the combined inputs, called features, will describe small audio features, such as pitch increase or decrease. After multiple iterations, the features will learn to describe more complex auditory structures, such as instruments or parts of speech. Classifications can be made based on these complex features.

While this method is quite intuitive and can perform very well in some cases, the input size is an issue. Since 44100 input values are created per second, the network often receives hundreds of thousands of inputs, and filtering those inputs to find the correct information is not easy, even when using deep learning.

Audio as an image

A different way to process and classify audio is to use techniques from the Computer Vision domain. Since there is a large research interest in processing images and video, this domain has been developing very rapidly. And like with audio, methods dealing with vision are used to retrieve information from a large number of inputs.

To apply Computer Vision techniques, we need to transform audio into an image.

This is done by converting a sound wave into a spectrogram.

By applying a Fourier transform on the sound wave we can extract the pure frequencies it is composed of, along with the volume and time of that frequency.

A spectrogram thus shows the time in the x-axis, the frequency in the y-axis and the volume in dB as value.

Humans notice smaller differences in lower frequencies. For example, the difference between 100 and 110 Hz is audible, while the difference between 1000 and 1010 is not. Therefore we use a special logarithmic scale which is designed to group the frequencies based on how they sound to humans. This scale is called a Mel-frequency cepstrum.

Interesse? Vraag nu uw Quickscan aan!

  • Uw naam
  • Uw telefoonnummer (of e-mailadres)
Spectrogram of a violin

Now that we have converted the sound into an image, we can apply models from the visual domain to classify the sounds. By creating an image, we can also greatly reduce the number of inputs.

Since we already prepare the data in this approach, this neural network for images can be seen as mimicking just the auditory system, but not the ears.

Classifying spectrograms can often work very well, but using visual classification methods for audio also has some drawbacks.

  • Firstly, a pixel in an image always belongs to a single object. Pixels in a spectrogram can have multiple sources. For example, two instruments can play the same tone. It is therefore much harder to classify multiple sources in a spectrogram than to classify multiple objects in an image.
  • Secondly, the x- and y-axis are treated as if equal, while they have very different meanings. In some cases, this hurts performance or makes it difficult to determine a loss function.
  • Lastly, spectrograms need to be extremely accurate in order to be perceived by humans. It is still clear what an image shows if some pixels are misplaced or a section is blocked out. If spectrograms have some noise in them, the audio quality will be destroyed. This limits options for data augmentation and causes bad performance in an audio generation.

Best of both worlds?

Both the classification of sound waves and spectrograms have their respective disadvantages. There is no clear winner between the two and performance really depends on what the problem is and which dataset is used. Therefore, both neural networks are often combined in an ensemble to perform even better.

To train such an ensemble, we first train the individual networks (one for classifying sound waves, one for classifying spectrograms). After these networks have been trained to classify audio, we remove the last layer of each network. These are the layers that make the classification decision based on high-level features.

Instead, we train a new layer that takes features from both the networks as input and makes a decision on all of these features.

Fusing the features in this way shows much better performance than simply taking the average of the predictions of the networks, albeit taking a bit longer to code and train.

Conclusion

Audio Conclusion

Audio classification is not an easy task and unlike some other AI problems, there is not yet a "default" approach to solving it.

In this blog, we explained the two major general approaches for audio classification and their downsides. Also, we showed how performance can be improved by combining these approaches.

With the explained methods, there has been a lot of progress in audio classification over the last few years. We have seen the rise of voice assistants, instrument separation, and many other applications.

However, whether AI will ever be able to hear just as well as humans remains a question.

Let's get in touch