speechbrain.augment.time_domain moduleο
Time-Domain Sequential Data Augmentation Classes
This module contains classes designed for augmenting sequential data in the time domain.
It is particularly useful for enhancing the robustness of neural models during training.
The available data distortions include adding noise, applying reverberation, adjusting playback speed, and more.
All classes are implemented as torch.nn.Module, enabling end-to-end differentiability and gradient backpropagation.
Authors: - Peter Plantinga (2020) - Mirco Ravanelli (2023) - Gianfranco Dumoulin Bertucci (2025)
Summaryο
Classes:
This class additively combines a noise signal to the input signal. |
|
This class convolves an audio signal with an impulse response. |
|
This function drops random channels in the multi-channel input waveform. |
|
This function randomly swaps N channels. |
|
This function combines segments (with equal length in time) of the time series contained in the batch. |
|
This function mimics audio clipping by clamping the input tensor. |
|
This class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32. |
|
This class drops portions of the input signal. |
|
This class drops a random frequency from the signal. |
|
This class drops portions of the input signal. |
|
This function multiples the signal by a random amplitude. |
|
This class resamples audio using the |
|
Flip the sign of a signal. |
|
Slightly speed up or slow down an audio signal. |
Functions:
Creates a sequence of pink noise (also known as 1/f). |
Referenceο
- class speechbrain.augment.time_domain.AddNoise(csv_file=None, csv_keys=None, sorting='random', num_workers=0, snr_low=0, snr_high=0, pad_noise=False, start_index=None, normalize=False, noise_funct=<built-in method randn_like of type object>, replacements={}, noise_sample_rate=16000, clean_sample_rate=16000)[source]ο
Bases:
ModuleThis class additively combines a noise signal to the input signal.
- Parameters:
csv_file (str) β The name of a csv file containing the location of the noise audio files. If none is provided, white noise will be used.
csv_keys (list, None, optional) β Default: None . One data entry for the noise data should be specified. If None, the csv file is expected to have only one data entry.
sorting (str) β The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.
num_workers (int) β Number of workers in the DataLoader (See PyTorch DataLoader docs).
snr_low (int) β The low end of the mixing ratios, in decibels.
snr_high (int) β The high end of the mixing ratios, in decibels.
pad_noise (bool) β If True, copy noise signals that are shorter than their corresponding clean signals so as to cover the whole clean signal. Otherwise, leave the noise un-padded.
start_index (int) β The index in the noise waveforms to start from. By default, chooses a random index in [0, len(noise) - len(waveforms)].
normalize (bool) β If True, output noisy signals that exceed [-1,1] will be normalized to [-1,1].
noise_funct (funct object) β function to use to draw a noisy sample. It is enabled if the csv files containing the noisy sequences are not provided. By default, torch.randn_like is used (to sample white noise). In general, it must be a function that takes in input the original waveform and returns a tensor with the corresponding noise to add (e.g., see pink_noise_like).
replacements (dict) β A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.
noise_sample_rate (int) β The sample rate of the noise audio signals, so noise can be resampled to the clean sample rate if necessary.
clean_sample_rate (int) β The sample rate of the clean audio signals, so noise can be resampled to the clean sample rate if necessary.
Example
>>> import pytest >>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> clean = signal.unsqueeze(0) # [batch, time, channels] >>> noisifier = AddNoise( ... "tests/samples/annotation/noise.csv", ... replacements={"noise_folder": "tests/samples/noise"}, ... ) >>> noisy = noisifier(clean, torch.ones(1))
- forward(waveforms, lengths)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].lengths (torch.Tensor) β Shape should be a single dimension,
[batch].
- Return type:
Tensor of shape
[batch, time]or[batch, time, channels].
- class speechbrain.augment.time_domain.AddReverb(csv_file, sorting='random', num_workers=0, rir_scale_factor=1.0, replacements={}, reverb_sample_rate=16000, clean_sample_rate=16000)[source]ο
Bases:
ModuleThis class convolves an audio signal with an impulse response.
- Parameters:
csv_file (str) β The name of a csv file containing the location of the impulse response files.
sorting (str) β The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.
num_workers (int) β Number of workers in the DataLoader (See PyTorch DataLoader docs).
rir_scale_factor (float) β It compresses or dilates the given impulse response. If 0 < scale_factor < 1, the impulse response is compressed (less reverb), while if scale_factor > 1 it is dilated (more reverb).
replacements (dict) β A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.
reverb_sample_rate (int) β The sample rate of the corruption signals (rirs), so that they can be resampled to clean sample rate if necessary.
clean_sample_rate (int) β The sample rate of the clean signals, so that the corruption signals can be resampled to the clean sample rate before convolution.
Example
>>> import pytest >>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> clean = signal.unsqueeze(0) # [batch, time, channels] >>> reverb = AddReverb( ... "tests/samples/annotation/RIRs.csv", ... replacements={"rir_folder": "tests/samples/RIRs"}, ... ) >>> reverbed = reverb(clean)
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels].
- class speechbrain.augment.time_domain.SpeedPerturb(orig_freq, speeds=[90, 100, 110], device='cpu')[source]ο
Bases:
ModuleSlightly speed up or slow down an audio signal.
Resample the audio signal at a rate that is similar to the original rate, to achieve a slightly slower or slightly faster signal. This technique is outlined in the paper: βAudio Augmentation for Speech Recognitionβ
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> perturbator = SpeedPerturb(orig_freq=16000, speeds=[90]) >>> clean = signal.unsqueeze(0) >>> perturbed = perturbator(clean) >>> clean.shape torch.Size([1, 52173]) >>> perturbed.shape torch.Size([1, 57971])
- forward(waveform)[source]ο
- Parameters:
waveform (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
torch.Tensor of shape
[batch, time]or[batch, time, channels].
- class speechbrain.augment.time_domain.Resample(orig_freq=16000, new_freq=16000, *args, **kwargs)[source]ο
Bases:
ModuleThis class resamples audio using the
torchaudio resamplerbased on sinc interpolation.- Parameters:
orig_freq (int) β the sampling frequency of the input signal.
new_freq (int) β the new sampling frequency after this operation is performed.
*args β additional arguments forwarded to the
torchaudio.transforms.Resampleconstructor**kwargs β additional keyword arguments forwarded to the
torchaudio.transforms.Resampleconstructor
Example
>>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> signal = signal.unsqueeze(0) # [batch, time, channels] >>> resampler = Resample(orig_freq=16000, new_freq=8000) >>> resampled = resampler(signal) >>> signal.shape torch.Size([1, 52173]) >>> resampled.shape torch.Size([1, 26087])
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels].
- class speechbrain.augment.time_domain.DropFreq(drop_freq_low=1e-14, drop_freq_high=1, drop_freq_count_low=1, drop_freq_count_high=3, drop_freq_width=0.05, epsilon=1e-12)[source]ο
Bases:
ModuleThis class drops a random frequency from the signal.
The purpose of this class is to teach models to learn to rely on all parts of the signal, not just a few frequency bands.
- Parameters:
drop_freq_low (float) β The low end of frequencies that can be dropped, as a fraction of the sampling rate / 2.
drop_freq_high (float) β The high end of frequencies that can be dropped, as a fraction of the sampling rate / 2.
drop_freq_count_low (int) β The low end of number of frequencies that could be dropped.
drop_freq_count_high (int) β The high end of number of frequencies that could be dropped.
drop_freq_width (float) β The width of the frequency band to drop, as a fraction of the sampling_rate / 2.
epsilon (float) β A small positive value to prevent issues such as filtering 0 Hz, division by zero, or other numerical instabilities. This value sets the absolute minimum for normalized frequencies used in the filter. The default value is 1e-12.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = DropFreq() >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> dropped_signal = dropper(signal.unsqueeze(0))
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels].
- class speechbrain.augment.time_domain.DropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=3, drop_start=0, drop_end=None, noise_factor=0.0)[source]ο
Bases:
ModuleThis class drops portions of the input signal.
Using
DropChunkas an augmentation strategy helps a models learn to rely on all parts of the signal, since it canβt expect a given part to be present.- Parameters:
drop_length_low (int) β The low end of lengths for which to set the signal to zero, in samples.
drop_length_high (int) β The high end of lengths for which to set the signal to zero, in samples.
drop_count_low (int) β The low end of number of times that the signal can be dropped to zero.
drop_count_high (int) β The high end of number of times that the signal can be dropped to zero.
drop_start (int) β The first index for which dropping will be allowed.
drop_end (int) β The last index for which dropping will be allowed.
noise_factor (float) β The factor relative to average amplitude of an utterance to use for scaling the white noise inserted. 1 keeps the average amplitude the same, while 0 inserts all 0βs.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = DropChunk(drop_start=100, drop_end=200, noise_factor=0.0) >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> signal = signal.unsqueeze(0) # [batch, time, channels] >>> length = torch.ones(1) >>> dropped_signal = dropper(signal, length) >>> float(dropped_signal[:, 150]) 0.0
- forward(waveforms, lengths)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].lengths (torch.Tensor) β Shape should be a single dimension,
[batch].
- Returns:
[batch, time, channels]- Return type:
Tensor of shape
[batch, time]or
- class speechbrain.augment.time_domain.FastDropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=10, drop_start=0, drop_end=None, n_masks=1000)[source]ο
Bases:
ModuleThis class drops portions of the input signal. The difference with DropChunk is that in this case we pre-compute the dropping masks in the first time the forward function is called. For all the other calls, we only shuffle and apply them. This makes the code faster and more suitable for data augmentation of large batches.
It can be used only for fixed-length sequences.
- Parameters:
drop_length_low (int) β The low end of lengths for which to set the signal to zero, in samples.
drop_length_high (int) β The high end of lengths for which to set the signal to zero, in samples.
drop_count_low (int) β The low end of number of times that the signal can be dropped to zero.
drop_count_high (int) β The high end of number of times that the signal can be dropped to zero.
drop_start (int) β The first index for which dropping will be allowed.
drop_end (int) β The last index for which dropping will be allowed.
n_masks (int) β The number of precomputed masks.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = FastDropChunk(drop_start=100, drop_end=200) >>> signal = torch.rand(10, 250, 22) >>> dropped_signal = dropper(signal)
- initialize_masks(waveforms)[source]ο
- waveformstorch.Tensor
Shape should be
[batch, time]or[batch, time, channels].
- `.
- dropped_maskstorch.Tensor
Tensor of size
[n_masks, time]with the dropped chunks. Dropped regions are assigned to 0.
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- class speechbrain.augment.time_domain.DoClip(clip_low=0.5, clip_high=0.5)[source]ο
Bases:
ModuleThis function mimics audio clipping by clamping the input tensor. First, it normalizes the waveforms from -1 to -1. Then, clipping is applied. Finally, the original amplitude is restored.
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> clipper = DoClip(clip_low=0.01, clip_high=0.01) >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> clipped_signal = clipper(signal.unsqueeze(0))
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- class speechbrain.augment.time_domain.RandAmp(amp_low=0.5, amp_high=1.5)[source]ο
Bases:
ModuleThis function multiples the signal by a random amplitude. First, the signal is normalized to have amplitude between -1 and 1. Then it is multiplied with a random number.
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> rand_amp = RandAmp(amp_low=0.25, amp_high=1.75) >>> signal = read_audio("tests/samples/single-mic/example1.wav") >>> output_signal = rand_amp(signal.unsqueeze(0))
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- class speechbrain.augment.time_domain.ChannelDrop(drop_rate=0.1)[source]ο
Bases:
ModuleThis function drops random channels in the multi-channel input waveform.
- Parameters:
drop_rate (float) β The channel dropout factor
Example
>>> signal = torch.rand(4, 256, 8) >>> ch_drop = ChannelDrop(drop_rate=0.5) >>> output_signal = ch_drop(signal)
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- class speechbrain.augment.time_domain.ChannelSwap(min_swap=0, max_swap=0)[source]ο
Bases:
ModuleThis function randomly swaps N channels.
- Parameters:
Example
>>> signal = torch.rand(4, 256, 8) >>> ch_swap = ChannelSwap() >>> output_signal = ch_swap(signal)
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- class speechbrain.augment.time_domain.CutCat(min_num_segments=2, max_num_segments=10)[source]ο
Bases:
ModuleThis function combines segments (with equal length in time) of the time series contained in the batch. Proposed for EEG signals in https://doi.org/10.1016/j.neunet.2021.05.032.
- Parameters:
Example
>>> signal = torch.ones((4, 256, 22)) * torch.arange(4).reshape( ... ( ... 4, ... 1, ... 1, ... ) ... ) >>> cutcat = CutCat() >>> output_signal = cutcat(signal)
- forward(waveforms)[source]ο
- Parameters:
waveforms (torch.Tensor) β Shape should be
[batch, time]or[batch, time, channels].- Return type:
Tensor of shape
[batch, time]or[batch, time, channels]
- speechbrain.augment.time_domain.pink_noise_like(waveforms, alpha_low=1.0, alpha_high=1.0, sample_rate=50)[source]ο
Creates a sequence of pink noise (also known as 1/f). The pink noise is obtained by multiplying the spectrum of a white noise sequence by a factor (1/f^alpha). The alpha factor controls the decrease factor in the frequency domain (alpha=0 adds white noise, alpha>>0 adds low frequency noise). It is randomly sampled between alpha_low and alpha_high. With negative alpha this function generates blue noise.
- Parameters:
waveforms (torch.Tensor) β The original waveform. It is just used to infer the shape.
alpha_low (float) β The minimum value for the alpha spectral smoothing factor.
alpha_high (float) β The maximum value for the alpha spectral smoothing factor.
sample_rate (float) β The sample rate of the original signal.
- Returns:
pink_noise β Pink noise in the shape of the input tensor.
- Return type:
Example
>>> waveforms = torch.randn(4, 257, 10) >>> noise = pink_noise_like(waveforms) >>> noise.shape torch.Size([4, 257, 10])
- class speechbrain.augment.time_domain.DropBitResolution(target_dtype='random')[source]ο
Bases:
ModuleThis class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32. This process loses information and can be used for data augmentation.
Arguments:ο
- target_dtype: str
One of βint16β, βint8β, βfloat16β. If βrandomβ, the bit resolution is randomly selected among the options listed above.
- Example:
>>> dropper = DropBitResolution() >>> signal = torch.rand(4, 16000) >>> signal_dropped = dropper(signal)
- class speechbrain.augment.time_domain.SignFlip(flip_prob=0.5)[source]ο
Bases:
ModuleFlip the sign of a signal.
This module negates all the values in a tensor with a given probability. If the sign is not flipped, the original signal is returned unchanged. This technique is outlined in the paper: βCADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signalsβ https://arxiv.org/pdf/2106.13695
- Parameters:
flip_prob (float) β The probability with which to flip the sign of the signal. Default is 0.5.
Example
>>> import torch >>> x = torch.tensor([1, 2, 3, 4, 5]) >>> flip = SignFlip(flip_prob=1) # 100% chance to flip sign >>> flip(x) tensor([-1, -2, -3, -4, -5])
- forward(waveform)[source]ο
- Parameters:
waveform (torch.Tensor) β Input tensor representaing waveform, shape does not matter.
- Returns:
The output tensor with same shape as the input, where the sign of all values in the tensor has been flipped with probability
flip_prob.- Return type: