speechbrain.augment.time_domain module

Time-Domain Sequential Data Augmentation Classes

This module contains classes designed for augmenting sequential data in the time domain. It is particularly useful for enhancing the robustness of neural models during training. The available data distortions include adding noise, applying reverberation, adjusting playback speed, and more. All classes are implemented as torch.nn.Module, enabling end-to-end differentiability and gradient backpropagation.

Authors: - Peter Plantinga (2020) - Mirco Ravanelli (2023)

Summary

Classes:

AddNoise

This class additively combines a noise signal to the input signal.

AddReverb

This class convolves an audio signal with an impulse response.

ChannelDrop

This function drops random channels in the multi-channel nput waveform.

ChannelSwap

This function randomly swaps N channels.

CutCat

This function combines segments (with equal length in time) of the time series contained in the batch.

DoClip

This function mimics audio clipping by clamping the input tensor.

DropBitResolution

This class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32.

DropChunk

This class drops portions of the input signal.

DropFreq

This class drops a random frequency from the signal.

FastDropChunk

This class drops portions of the input signal.

RandAmp

This function multiples the signal by a random amplitude.

Resample

This class resamples audio using the torchaudio resampler based on sinc interpolation.

SpeedPerturb

Slightly speed up or slow down an audio signal.

Functions:

pink_noise_like

Creates a sequence of pink noise (also known as 1/f).

Reference

class speechbrain.augment.time_domain.AddNoise(csv_file=None, csv_keys=None, sorting='random', num_workers=0, snr_low=0, snr_high=0, pad_noise=False, start_index=None, normalize=False, noise_funct=<built-in method randn_like of type object>, replacements={}, noise_sample_rate=16000, clean_sample_rate=16000)[source]

Bases: Module

This class additively combines a noise signal to the input signal.

Parameters:
  • csv_file (str) – The name of a csv file containing the location of the noise audio files. If none is provided, white noise will be used.

  • csv_keys (list, None, optional) – Default: None . One data entry for the noise data should be specified. If None, the csv file is expected to have only one data entry.

  • sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.

  • num_workers (int) – Number of workers in the DataLoader (See PyTorch DataLoader docs).

  • snr_low (int) – The low end of the mixing ratios, in decibels.

  • snr_high (int) – The high end of the mixing ratios, in decibels.

  • pad_noise (bool) – If True, copy noise signals that are shorter than their corresponding clean signals so as to cover the whole clean signal. Otherwise, leave the noise un-padded.

  • start_index (int) – The index in the noise waveforms to start from. By default, chooses a random index in [0, len(noise) - len(waveforms)].

  • normalize (bool) – If True, output noisy signals that exceed [-1,1] will be normalized to [-1,1].

  • noise_funct (funct object) – function to use to draw a noisy sample. It is enabled if the csv files containing the noisy sequences are not provided. By default, torch.randn_like is used (to sample white noise). In general, it must be a function that takes in input the original waveform and returns a tensor with the corresponsing noise to add (e.g., see pink_noise_like).

  • replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.

  • noise_sample_rate (int) – The sample rate of the noise audio signals, so noise can be resampled to the clean sample rate if necessary.

  • clean_sample_rate (int) – The sample rate of the clean audio signals, so noise can be resampled to the clean sample rate if necessary.

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> noisifier = AddNoise('tests/samples/annotation/noise.csv',
...                     replacements={'noise_folder': 'tests/samples/noise'})
>>> noisy = noisifier(clean, torch.ones(1))
forward(waveforms, lengths)[source]
Parameters:
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Return type:

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.augment.time_domain.AddReverb(csv_file, sorting='random', num_workers=0, rir_scale_factor=1.0, replacements={}, reverb_sample_rate=16000, clean_sample_rate=16000)[source]

Bases: Module

This class convolves an audio signal with an impulse response.

Parameters:
  • csv_file (str) – The name of a csv file containing the location of the impulse response files.

  • sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.

  • num_workers (int) – Number of workers in the DataLoader (See PyTorch DataLoader docs).

  • rir_scale_factor (float) – It compresses or dilates the given impulse response. If 0 < scale_factor < 1, the impulse response is compressed (less reverb), while if scale_factor > 1 it is dilated (more reverb).

  • replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.

  • reverb_sample_rate (int) – The sample rate of the corruption signals (rirs), so that they can be resampled to clean sample rate if necessary.

  • clean_sample_rate (int) – The sample rate of the clean signals, so that the corruption signals can be resampled to the clean sample rate before convolution.

Example

>>> import pytest
>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clean = signal.unsqueeze(0) # [batch, time, channels]
>>> reverb = AddReverb('tests/samples/annotation/RIRs.csv',
...                     replacements={'rir_folder': 'tests/samples/RIRs'})
>>> reverbed = reverb(clean)
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.augment.time_domain.SpeedPerturb(orig_freq, speeds=[90, 100, 110])[source]

Bases: Module

Slightly speed up or slow down an audio signal.

Resample the audio signal at a rate that is similar to the original rate, to achieve a slightly slower or slightly faster signal. This technique is outlined in the paper: “Audio Augmentation for Speech Recognition”

Parameters:
  • orig_freq (int) – The frequency of the original signal.

  • speeds (list) – The speeds that the signal should be changed to, as a percentage of the original signal (i.e. speeds is divided by 100 to get a ratio).

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> perturbator = SpeedPerturb(orig_freq=16000, speeds=[90])
>>> clean = signal.unsqueeze(0)
>>> perturbed = perturbator(clean)
>>> clean.shape
torch.Size([1, 52173])
>>> perturbed.shape
torch.Size([1, 46956])
forward(waveform)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.augment.time_domain.Resample(orig_freq=16000, new_freq=16000, *args, **kwargs)[source]

Bases: Module

This class resamples audio using the torchaudio resampler based on sinc interpolation.

Parameters:
  • orig_freq (int) – the sampling frequency of the input signal.

  • new_freq (int) – the new sampling frequency after this operation is performed.

  • *args – additional arguments forwarded to the torchaudio.transforms.Resample constructor

  • **kwargs – additional keyword arguments forwarded to the torchaudio.transforms.Resample constructor

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> resampler = Resample(orig_freq=16000, new_freq=8000)
>>> resampled = resampler(signal)
>>> signal.shape
torch.Size([1, 52173])
>>> resampled.shape
torch.Size([1, 26087])
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.augment.time_domain.DropFreq(drop_freq_low=1e-14, drop_freq_high=1, drop_freq_count_low=1, drop_freq_count_high=3, drop_freq_width=0.05)[source]

Bases: Module

This class drops a random frequency from the signal.

The purpose of this class is to teach models to learn to rely on all parts of the signal, not just a few frequency bands.

Parameters:
  • drop_freq_low (float) – The low end of frequencies that can be dropped, as a fraction of the sampling rate / 2.

  • drop_freq_high (float) – The high end of frequencies that can be dropped, as a fraction of the sampling rate / 2.

  • drop_freq_count_low (int) – The low end of number of frequencies that could be dropped.

  • drop_freq_count_high (int) – The high end of number of frequencies that could be dropped.

  • drop_freq_width (float) – The width of the frequency band to drop, as a fraction of the sampling_rate / 2.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropFreq()
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> dropped_signal = dropper(signal.unsqueeze(0))
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels].

training: bool
class speechbrain.augment.time_domain.DropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=3, drop_start=0, drop_end=None, noise_factor=0.0)[source]

Bases: Module

This class drops portions of the input signal.

Using DropChunk as an augmentation strategy helps a models learn to rely on all parts of the signal, since it can’t expect a given part to be present.

Parameters:
  • drop_length_low (int) – The low end of lengths for which to set the signal to zero, in samples.

  • drop_length_high (int) – The high end of lengths for which to set the signal to zero, in samples.

  • drop_count_low (int) – The low end of number of times that the signal can be dropped to zero.

  • drop_count_high (int) – The high end of number of times that the signal can be dropped to zero.

  • drop_start (int) – The first index for which dropping will be allowed.

  • drop_end (int) – The last index for which dropping will be allowed.

  • noise_factor (float) – The factor relative to average amplitude of an utterance to use for scaling the white noise inserted. 1 keeps the average amplitude the same, while 0 inserts all 0’s.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = DropChunk(drop_start=100, drop_end=200, noise_factor=0.)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0) # [batch, time, channels]
>>> length = torch.ones(1)
>>> dropped_signal = dropper(signal, length)
>>> float(dropped_signal[:, 150])
0.0
forward(waveforms, lengths)[source]
Parameters:
  • waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

  • lengths (tensor) – Shape should be a single dimension, [batch].

Returns:

[batch, time, channels]

Return type:

Tensor of shape [batch, time] or

training: bool
class speechbrain.augment.time_domain.FastDropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=10, drop_start=0, drop_end=None, n_masks=1000)[source]

Bases: Module

This class drops portions of the input signal. The difference with DropChunk is that in this case we pre-compute the dropping masks in the first time the forward function is called. For all the other calls, we only shuffle and apply them. This makes the code faster and more suitable for data augmentation of large batches.

It can be used only for fixed-length sequences.

Parameters:
  • drop_length_low (int) – The low end of lengths for which to set the signal to zero, in samples.

  • drop_length_high (int) – The high end of lengths for which to set the signal to zero, in samples.

  • drop_count_low (int) – The low end of number of times that the signal can be dropped to zero.

  • drop_count_high (int) – The high end of number of times that the signal can be dropped to zero.

  • drop_start (int) – The first index for which dropping will be allowed.

  • drop_end (int) – The last index for which dropping will be allowed.

  • n_masks (int) – The number of precomputed masks.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> dropper = FastDropChunk(drop_start=100, drop_end=200)
>>> signal = torch.rand(10, 250, 22)
>>> dropped_signal = dropper(signal)
initialize_masks(waveforms)[source]
waveformstensor

Shape should be [batch, time] or [batch, time, channels].

`.
dropped_masks: tensor

Tensor of size [n_masks, time] with the dropped chunks. Dropped regions are assigned to 0.

forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
class speechbrain.augment.time_domain.DoClip(clip_low=0.5, clip_high=0.5)[source]

Bases: Module

This function mimics audio clipping by clamping the input tensor. First, it normalizes the waveforms from -1 to -1. Then, clipping is applied. Finally, the original amplitude is restored.

Parameters:
  • clip_low (float) – The low end of amplitudes for which to clip the signal.

  • clip_high (float) – The high end of amplitudes for which to clip the signal.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> clipper = DoClip(clip_low=0.01, clip_high=0.01)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> clipped_signal = clipper(signal.unsqueeze(0))
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
class speechbrain.augment.time_domain.RandAmp(amp_low=0.5, amp_high=1.5)[source]

Bases: Module

This function multiples the signal by a random amplitude. Firist, the signal is normalized to have amplitude between -1 and 1. Then it is multiplied with a random number.

Parameters:
  • amp_low (float) – The minumum amplitude multiplication factor.

  • amp_high (float) – The maximum amplitude multiplication factor.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> rand_amp = RandAmp(amp_low=0.25, amp_high=1.75)
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> output_signal = rand_amp(signal.unsqueeze(0))
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
class speechbrain.augment.time_domain.ChannelDrop(drop_rate=0.1)[source]

Bases: Module

This function drops random channels in the multi-channel nput waveform.

Parameters:

drop_rate (float) – The channel droput factor

Example

>>> signal = torch.rand(4, 256, 8)
>>> ch_drop = ChannelDrop(drop_rate=0.5)
>>> output_signal = ch_drop(signal)
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
class speechbrain.augment.time_domain.ChannelSwap(min_swap=0, max_swap=0)[source]

Bases: Module

This function randomly swaps N channels.

Parameters:
  • min_swap (int) – The mininum number of channels to swap.

  • max_swap (int) – The maximum number of channels to swap.

Example

>>> signal = torch.rand(4, 256, 8)
>>> ch_swap = ChannelSwap()
>>> output_signal = ch_swap(signal)
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
class speechbrain.augment.time_domain.CutCat(min_num_segments=2, max_num_segments=10)[source]

Bases: Module

This function combines segments (with equal length in time) of the time series contained in the batch. Proposed for EEG signals in https://doi.org/10.1016/j.neunet.2021.05.032.

Parameters:
  • min_num_segments (int) – The number of segments to combine.

  • max_num_segments (int) – The maximum number of segments to combine. Default is 10.

Example

>>> signal = torch.ones((4, 256, 22)) * torch.arange(4).reshape((4, 1, 1,))
>>> cutcat =  CutCat()
>>> output_signal = cutcat(signal)
forward(waveforms)[source]
Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool
speechbrain.augment.time_domain.pink_noise_like(waveforms, alpha_low=1.0, alpha_high=1.0, sample_rate=50)[source]

Creates a sequence of pink noise (also known as 1/f). The pink noise is obtained by multipling the spectrum of a white noise sequence by a factor (1/f^alpha). The alpha factor controls the decrease factor in the frequnecy domain (alpha=0 adds white noise, alpha>>0 adds low frequnecy noise). It is randomly sampled between alpha_low and alpha_high. With negative alpha this funtion generates blue noise.

Parameters:
  • waveforms (torch.Tensor) – The original waveform. It is just used to infer the shape.

  • alpha_low (float) – The minimum value for the alpha spectral smooting factor.

  • alpha_high (float) – The maximum value for the alpha spectral smooting factor.

  • sample_rate (float) – The sample rate of the original signal.

Example

>>> waveforms = torch.randn(4,257,10)
>>> noise = pink_noise_like(waveforms)
>>> noise.shape
torch.Size([4, 257, 10])
class speechbrain.augment.time_domain.DropBitResolution(target_dtype='random')[source]

Bases: Module

This class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32. This process loses information and can be used for data augmentation.

Arguments:

target_dtype: str

One of “int16”, “int8”, “float16”. If “random”, the bit resolution is randomly selected among the options listed above.

Example:
>>> dropper = DropBitResolution()
>>> signal = torch.rand(4, 16000)
>>> signal_dropped = dropper(signal)
training: bool
forward(float32_tensor)[source]

Arguments:

float32_tensor: torch.Tensor

Float32 tensor with shape [batch, time] or [batch, time, channels].

Returns:

torch.Tensor

Tensor of shape [batch, time] or [batch, time, channels] (Float32)