speechbrain.processing.speech_augmentation module

Classes for mutating speech data for data augmentation.

This module provides classes that produce realistic distortions of speech data for the purpose of training speech processing models. The list of distortions includes adding noise, adding reverberation, changing speed, and more. All the classes are of type torch.nn.Module. This gives the possibility to have end-to-end differentiability and backpropagate the gradient through them. In addition, all operations are expected to be performed on the GPU (where available) for efficiency.

Authors
  • Peter Plantinga 2020

Summary

Classes:

AddBabble

Simulate babble noise by mixing the signals in a batch.

AddNoise

This class additively combines a noise signal to the input signal.

AddReverb

This class convolves an audio signal with an impulse response.

DoClip

This function mimics audio clipping by clamping the input tensor.

DropChunk

This class drops portions of the input signal.

DropFreq

This class drops a random frequency from the signal.

Resample

This class resamples an audio signal using sinc-based interpolation.

SpeedPerturb

Slightly speed up or slow down an audio signal.

Reference

class speechbrain.processing.speech_augmentation.AddNoise(csv_file=None, csv_keys=None, sorting='random', num_workers=0, snr_low=0, snr_high=0, pad_noise=False, mix_prob=1.0, start_index=None, normalize=False, replacements={})[source]

Bases: torch.nn.modules.module.Module

This class additively combines a noise signal to the input signal.

Parameters
  • csv_file (str) – The name of a csv file containing the location of the noise audio files. If none is provided, white noise will be used.

  • csv_keys (list, None, optional) – Default: None . One data entry for the noise data should be specified. If None, the csv file is expected to have only one data entry.

  • sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.

  • num_workers (int) – Number of workers in the DataLoader (See PyTorch DataLoader docs).

  • snr_low (int) – The low end of the mixing ratios, in decibels.

  • snr_high (int) – The high end of the mixing ratios, in decibels.

  • pad_noise (bool) – If True, copy noise signals that are shorter than their corresponding clean signals so as to cover the whole clean signal. Otherwise, leave the noise un-padded.

  • mix_prob (float) – The probability that a batch of signals will be mixed with a noise signal. By default, every batch is mixed with noise.

  • start_index (int) – The index in the noise waveforms to start from. By default, chooses a random index in [0, len(noise) - len(waveforms)].

  • normalize (bool) – If True, output noisy signals that exceed [-1,1] will be normalized to [-1,1].

  • replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> noisifier = AddNoise('samples/noise_samples/noise.csv')
>>> noisy = noisifier(clean, torch.ones(1))
forward(waveforms, lengths)[source]
Parameters
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Return type

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.processing.speech_augmentation.AddReverb(csv_file, sorting='random', reverb_prob=1.0, rir_scale_factor=1.0, replacements={})[source]

Bases: torch.nn.modules.module.Module

This class convolves an audio signal with an impulse response.

Parameters
  • csv_file (str) – The name of a csv file containing the location of the impulse response files.

  • sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.

  • reverb_prob (float) – The chance that the audio signal will be reverbed. By default, every batch is reverbed.

  • rir_scale_factor (float) – It compresses or dilates the given impulse response. If 0 < scale_factor < 1, the impulse response is compressed (less reverb), while if scale_factor > 1 it is dilated (more reverb).

  • replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> reverb = AddReverb('samples/rir_samples/rirs.csv')
>>> reverbed = reverb(clean, torch.ones(1))
forward(waveforms, lengths)[source]
Parameters
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Return type

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.processing.speech_augmentation.SpeedPerturb(orig_freq, speeds=[90, 100, 110], perturb_prob=1.0)[source]

Bases: torch.nn.modules.module.Module

Slightly speed up or slow down an audio signal.

Resample the audio signal at a rate that is similar to the original rate, to achieve a slightly slower or slightly faster signal. This technique is outlined in the paper: “Audio Augmentation for Speech Recognition”

Parameters
  • orig_freq (int) – The frequency of the original signal.

  • speeds (list) – The speeds that the signal should be changed to, as a percentage of the original signal (i.e. speeds is divided by 100 to get a ratio).

  • perturb_prob (float) – The chance that the batch will be speed- perturbed. By default, every batch is perturbed.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> perturbator = SpeedPerturb(orig_freq=16000, speeds=[90])
>>> clean = signal.unsqueeze(0)
>>> perturbed = perturbator(clean)
>>> clean.shape
torch.Size([1, 52173])
>>> perturbed.shape
torch.Size([1, 46956])
forward(waveform)[source]
Parameters
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Return type

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.processing.speech_augmentation.Resample(orig_freq=16000, new_freq=16000, lowpass_filter_width=6)[source]

Bases: torch.nn.modules.module.Module

This class resamples an audio signal using sinc-based interpolation.

It is a modification of the resample function from torchaudio (https://pytorch.org/audio/transforms.html#resample)

Parameters
  • orig_freq (int) – the sampling frequency of the input signal.

  • new_freq (int) – the new sampling frequency after this operation is performed.

  • lowpass_filter_width (int) – Controls the sharpness of the filter, larger numbers result in a sharper filter, but they are less efficient. Values from 4 to 10 are allowed.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> resampler = Resample(orig_freq=16000, new_freq=8000)
>>> resampled = resampler(signal)
>>> signal.shape
torch.Size([1, 52173])
>>> resampled.shape
torch.Size([1, 26087])
forward(waveforms)[source]
Parameters
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Return type

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.processing.speech_augmentation.AddBabble(speaker_count=3, snr_low=0, snr_high=0, mix_prob=1)[source]

Bases: torch.nn.modules.module.Module

Simulate babble noise by mixing the signals in a batch.

Parameters
  • speaker_count (int) – The number of signals to mix with the original signal.

  • snr_low (int) – The low end of the mixing ratios, in decibels.

  • snr_high (int) – The high end of the mixing ratios, in decibels.

  • mix_prob (float) – The probability that the batch of signals will be mixed with babble noise. By default, every signal is mixed.

Example

>>> import pytest
>>> babbler = AddBabble()
>>> dataset = ExtendedCSVDataset(
...     csvpath='samples/audio_samples/csv_example3.csv',
... )
>>> loader = make_dataloader(dataset, batch_size=5)
>>> speech, lengths = next(iter(loader)).at_position(0)
>>> noisy = babbler(speech, lengths)
forward(waveforms, lengths)[source]
Parameters
  • waveforms (tensor) – A batch of audio signals to process, with shape [batch, time] or [batch, time, channels].

  • lengths (tensor) – The length of each audio in the batch, with shape [batch].

Return type

Tensor with processed waveforms.

training: bool
class speechbrain.processing.speech_augmentation.DropFreq(drop_freq_low=1e-14, drop_freq_high=1, drop_count_low=1, drop_count_high=2, drop_width=0.05, drop_prob=1)[source]

Bases: torch.nn.modules.module.Module

This class drops a random frequency from the signal.

The purpose of this class is to teach models to learn to rely on all parts of the signal, not just a few frequency bands.

Parameters
  • drop_freq_low (float) – The low end of frequencies that can be dropped, as a fraction of the sampling rate / 2.

  • drop_freq_high (float) – The high end of frequencies that can be dropped, as a fraction of the sampling rate / 2.

  • drop_count_low (int) – The low end of number of frequencies that could be dropped.

  • drop_count_high (int) – The high end of number of frequencies that could be dropped.

  • drop_width (float) – The width of the frequency band to drop, as a fraction of the sampling_rate / 2.

  • drop_prob (float) – The probability that the batch of signals will have a frequency dropped. By default, every batch has frequencies dropped.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropFreq()
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> dropped_signal = dropper(signal.unsqueeze(0))
forward(waveforms)[source]
Parameters

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.processing.speech_augmentation.DropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=10, drop_start=0, drop_end=None, drop_prob=1, noise_factor=0.0)[source]

Bases: torch.nn.modules.module.Module

This class drops portions of the input signal.

Using DropChunk as an augmentation strategy helps a models learn to rely on all parts of the signal, since it can’t expect a given part to be present.

Parameters
  • drop_length_low (int) – The low end of lengths for which to set the signal to zero, in samples.

  • drop_length_high (int) – The high end of lengths for which to set the signal to zero, in samples.

  • drop_count_low (int) – The low end of number of times that the signal can be dropped to zero.

  • drop_count_high (int) – The high end of number of times that the signal can be dropped to zero.

  • drop_start (int) – The first index for which dropping will be allowed.

  • drop_end (int) – The last index for which dropping will be allowed.

  • drop_prob (float) – The probability that the batch of signals will have a portion dropped. By default, every batch has portions dropped.

  • noise_factor (float) – The factor relative to average amplitude of an utterance to use for scaling the white noise inserted. 1 keeps the average amplitude the same, while 0 inserts all 0’s.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropChunk(drop_start=100, drop_end=200, noise_factor=0.)
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> length = torch.ones(1)
>>> dropped_signal = dropper(signal, length)
>>> float(dropped_signal[:, 150])
0.0
forward(waveforms, lengths)[source]
Parameters
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Returns

[batch, time, channels]

Return type

Tensor of shape [batch, time] or

training: bool
class speechbrain.processing.speech_augmentation.DoClip(clip_low=0.5, clip_high=1, clip_prob=1)[source]

Bases: torch.nn.modules.module.Module

This function mimics audio clipping by clamping the input tensor.

Parameters
  • clip_low (float) – The low end of amplitudes for which to clip the signal.

  • clip_high (float) – The high end of amplitudes for which to clip the signal.

  • clip_prob (float) – The probability that the batch of signals will have a portion clipped. By default, every batch has portions clipped.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> clipper = DoClip(clip_low=0.01, clip_high=0.01)
>>> signal = read_audio('samples/audio_samples/example1.wav')
>>> clipped_signal = clipper(signal.unsqueeze(0))
>>> "%.2f" % clipped_signal.max()
'0.01'
training: bool
forward(waveforms)[source]
Parameters

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type

Tensor of shape [batch, time] or [batch, time, channels]