speechbrain.augment.time_domain module
Time-Domain Sequential Data Augmentation Classes
This module contains classes designed for augmenting sequential data in the time domain.
It is particularly useful for enhancing the robustness of neural models during training.
The available data distortions include adding noise, applying reverberation, adjusting playback speed, and more.
All classes are implemented as torch.nn.Module
, enabling end-to-end differentiability and gradient backpropagation.
Authors: - Peter Plantinga (2020) - Mirco Ravanelli (2023)
Summary
Classes:
This class additively combines a noise signal to the input signal. |
|
This class convolves an audio signal with an impulse response. |
|
This function drops random channels in the multi-channel nput waveform. |
|
This function randomly swaps N channels. |
|
This function combines segments (with equal length in time) of the time series contained in the batch. |
|
This function mimics audio clipping by clamping the input tensor. |
|
This class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32. |
|
This class drops portions of the input signal. |
|
This class drops a random frequency from the signal. |
|
This class drops portions of the input signal. |
|
This function multiples the signal by a random amplitude. |
|
This class resamples audio using the |
|
Slightly speed up or slow down an audio signal. |
Functions:
Creates a sequence of pink noise (also known as 1/f). |
Reference
- class speechbrain.augment.time_domain.AddNoise(csv_file=None, csv_keys=None, sorting='random', num_workers=0, snr_low=0, snr_high=0, pad_noise=False, start_index=None, normalize=False, noise_funct=<built-in method randn_like of type object>, replacements={}, noise_sample_rate=16000, clean_sample_rate=16000)[source]
Bases:
Module
This class additively combines a noise signal to the input signal.
- Parameters:
csv_file (str) – The name of a csv file containing the location of the noise audio files. If none is provided, white noise will be used.
csv_keys (list, None, optional) – Default: None . One data entry for the noise data should be specified. If None, the csv file is expected to have only one data entry.
sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.
num_workers (int) – Number of workers in the DataLoader (See PyTorch DataLoader docs).
snr_low (int) – The low end of the mixing ratios, in decibels.
snr_high (int) – The high end of the mixing ratios, in decibels.
pad_noise (bool) – If True, copy noise signals that are shorter than their corresponding clean signals so as to cover the whole clean signal. Otherwise, leave the noise un-padded.
start_index (int) – The index in the noise waveforms to start from. By default, chooses a random index in [0, len(noise) - len(waveforms)].
normalize (bool) – If True, output noisy signals that exceed [-1,1] will be normalized to [-1,1].
noise_funct (funct object) – function to use to draw a noisy sample. It is enabled if the csv files containing the noisy sequences are not provided. By default, torch.randn_like is used (to sample white noise). In general, it must be a function that takes in input the original waveform and returns a tensor with the corresponsing noise to add (e.g., see pink_noise_like).
replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.
noise_sample_rate (int) – The sample rate of the noise audio signals, so noise can be resampled to the clean sample rate if necessary.
clean_sample_rate (int) – The sample rate of the clean audio signals, so noise can be resampled to the clean sample rate if necessary.
Example
>>> import pytest >>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> clean = signal.unsqueeze(0) # [batch, time, channels] >>> noisifier = AddNoise('tests/samples/annotation/noise.csv', ... replacements={'noise_folder': 'tests/samples/noise'}) >>> noisy = noisifier(clean, torch.ones(1))
- class speechbrain.augment.time_domain.AddReverb(csv_file, sorting='random', num_workers=0, rir_scale_factor=1.0, replacements={}, reverb_sample_rate=16000, clean_sample_rate=16000)[source]
Bases:
Module
This class convolves an audio signal with an impulse response.
- Parameters:
csv_file (str) – The name of a csv file containing the location of the impulse response files.
sorting (str) – The order to iterate the csv file, from one of the following options: random, original, ascending, and descending.
num_workers (int) – Number of workers in the DataLoader (See PyTorch DataLoader docs).
rir_scale_factor (float) – It compresses or dilates the given impulse response. If 0 < scale_factor < 1, the impulse response is compressed (less reverb), while if scale_factor > 1 it is dilated (more reverb).
replacements (dict) – A set of string replacements to carry out in the csv file. Each time a key is found in the text, it will be replaced with the corresponding value.
reverb_sample_rate (int) – The sample rate of the corruption signals (rirs), so that they can be resampled to clean sample rate if necessary.
clean_sample_rate (int) – The sample rate of the clean signals, so that the corruption signals can be resampled to the clean sample rate before convolution.
Example
>>> import pytest >>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> clean = signal.unsqueeze(0) # [batch, time, channels] >>> reverb = AddReverb('tests/samples/annotation/RIRs.csv', ... replacements={'rir_folder': 'tests/samples/RIRs'}) >>> reverbed = reverb(clean)
- class speechbrain.augment.time_domain.SpeedPerturb(orig_freq, speeds=[90, 100, 110])[source]
Bases:
Module
Slightly speed up or slow down an audio signal.
Resample the audio signal at a rate that is similar to the original rate, to achieve a slightly slower or slightly faster signal. This technique is outlined in the paper: “Audio Augmentation for Speech Recognition”
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> perturbator = SpeedPerturb(orig_freq=16000, speeds=[90]) >>> clean = signal.unsqueeze(0) >>> perturbed = perturbator(clean) >>> clean.shape torch.Size([1, 52173]) >>> perturbed.shape torch.Size([1, 46956])
- class speechbrain.augment.time_domain.Resample(orig_freq=16000, new_freq=16000, *args, **kwargs)[source]
Bases:
Module
This class resamples audio using the
torchaudio resampler
based on sinc interpolation.- Parameters:
orig_freq (int) – the sampling frequency of the input signal.
new_freq (int) – the new sampling frequency after this operation is performed.
*args – additional arguments forwarded to the
torchaudio.transforms.Resample
constructor**kwargs – additional keyword arguments forwarded to the
torchaudio.transforms.Resample
constructor
Example
>>> from speechbrain.dataio.dataio import read_audio >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> signal = signal.unsqueeze(0) # [batch, time, channels] >>> resampler = Resample(orig_freq=16000, new_freq=8000) >>> resampled = resampler(signal) >>> signal.shape torch.Size([1, 52173]) >>> resampled.shape torch.Size([1, 26087])
- class speechbrain.augment.time_domain.DropFreq(drop_freq_low=1e-14, drop_freq_high=1, drop_freq_count_low=1, drop_freq_count_high=3, drop_freq_width=0.05)[source]
Bases:
Module
This class drops a random frequency from the signal.
The purpose of this class is to teach models to learn to rely on all parts of the signal, not just a few frequency bands.
- Parameters:
drop_freq_low (float) – The low end of frequencies that can be dropped, as a fraction of the sampling rate / 2.
drop_freq_high (float) – The high end of frequencies that can be dropped, as a fraction of the sampling rate / 2.
drop_freq_count_low (int) – The low end of number of frequencies that could be dropped.
drop_freq_count_high (int) – The high end of number of frequencies that could be dropped.
drop_freq_width (float) – The width of the frequency band to drop, as a fraction of the sampling_rate / 2.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = DropFreq() >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> dropped_signal = dropper(signal.unsqueeze(0))
- class speechbrain.augment.time_domain.DropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=3, drop_start=0, drop_end=None, noise_factor=0.0)[source]
Bases:
Module
This class drops portions of the input signal.
Using
DropChunk
as an augmentation strategy helps a models learn to rely on all parts of the signal, since it can’t expect a given part to be present.- Parameters:
drop_length_low (int) – The low end of lengths for which to set the signal to zero, in samples.
drop_length_high (int) – The high end of lengths for which to set the signal to zero, in samples.
drop_count_low (int) – The low end of number of times that the signal can be dropped to zero.
drop_count_high (int) – The high end of number of times that the signal can be dropped to zero.
drop_start (int) – The first index for which dropping will be allowed.
drop_end (int) – The last index for which dropping will be allowed.
noise_factor (float) – The factor relative to average amplitude of an utterance to use for scaling the white noise inserted. 1 keeps the average amplitude the same, while 0 inserts all 0’s.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = DropChunk(drop_start=100, drop_end=200, noise_factor=0.) >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> signal = signal.unsqueeze(0) # [batch, time, channels] >>> length = torch.ones(1) >>> dropped_signal = dropper(signal, length) >>> float(dropped_signal[:, 150]) 0.0
- class speechbrain.augment.time_domain.FastDropChunk(drop_length_low=100, drop_length_high=1000, drop_count_low=1, drop_count_high=10, drop_start=0, drop_end=None, n_masks=1000)[source]
Bases:
Module
This class drops portions of the input signal. The difference with DropChunk is that in this case we pre-compute the dropping masks in the first time the forward function is called. For all the other calls, we only shuffle and apply them. This makes the code faster and more suitable for data augmentation of large batches.
It can be used only for fixed-length sequences.
- Parameters:
drop_length_low (int) – The low end of lengths for which to set the signal to zero, in samples.
drop_length_high (int) – The high end of lengths for which to set the signal to zero, in samples.
drop_count_low (int) – The low end of number of times that the signal can be dropped to zero.
drop_count_high (int) – The high end of number of times that the signal can be dropped to zero.
drop_start (int) – The first index for which dropping will be allowed.
drop_end (int) – The last index for which dropping will be allowed.
n_masks (int) – The number of precomputed masks.
Example
>>> from speechbrain.dataio.dataio import read_audio >>> dropper = FastDropChunk(drop_start=100, drop_end=200) >>> signal = torch.rand(10, 250, 22) >>> dropped_signal = dropper(signal)
- initialize_masks(waveforms)[source]
- waveformstensor
Shape should be
[batch, time]
or[batch, time, channels]
.
- `.
- dropped_masks: tensor
Tensor of size
[n_masks, time]
with the dropped chunks. Dropped regions are assigned to 0.
- class speechbrain.augment.time_domain.DoClip(clip_low=0.5, clip_high=0.5)[source]
Bases:
Module
This function mimics audio clipping by clamping the input tensor. First, it normalizes the waveforms from -1 to -1. Then, clipping is applied. Finally, the original amplitude is restored.
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> clipper = DoClip(clip_low=0.01, clip_high=0.01) >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> clipped_signal = clipper(signal.unsqueeze(0))
- class speechbrain.augment.time_domain.RandAmp(amp_low=0.5, amp_high=1.5)[source]
Bases:
Module
This function multiples the signal by a random amplitude. Firist, the signal is normalized to have amplitude between -1 and 1. Then it is multiplied with a random number.
- Parameters:
Example
>>> from speechbrain.dataio.dataio import read_audio >>> rand_amp = RandAmp(amp_low=0.25, amp_high=1.75) >>> signal = read_audio('tests/samples/single-mic/example1.wav') >>> output_signal = rand_amp(signal.unsqueeze(0))
- class speechbrain.augment.time_domain.ChannelDrop(drop_rate=0.1)[source]
Bases:
Module
This function drops random channels in the multi-channel nput waveform.
- Parameters:
drop_rate (float) – The channel droput factor
Example
>>> signal = torch.rand(4, 256, 8) >>> ch_drop = ChannelDrop(drop_rate=0.5) >>> output_signal = ch_drop(signal)
- class speechbrain.augment.time_domain.ChannelSwap(min_swap=0, max_swap=0)[source]
Bases:
Module
This function randomly swaps N channels.
- Parameters:
Example
>>> signal = torch.rand(4, 256, 8) >>> ch_swap = ChannelSwap() >>> output_signal = ch_swap(signal)
- class speechbrain.augment.time_domain.CutCat(min_num_segments=2, max_num_segments=10)[source]
Bases:
Module
This function combines segments (with equal length in time) of the time series contained in the batch. Proposed for EEG signals in https://doi.org/10.1016/j.neunet.2021.05.032.
- Parameters:
Example
>>> signal = torch.ones((4, 256, 22)) * torch.arange(4).reshape((4, 1, 1,)) >>> cutcat = CutCat() >>> output_signal = cutcat(signal)
- speechbrain.augment.time_domain.pink_noise_like(waveforms, alpha_low=1.0, alpha_high=1.0, sample_rate=50)[source]
Creates a sequence of pink noise (also known as 1/f). The pink noise is obtained by multipling the spectrum of a white noise sequence by a factor (1/f^alpha). The alpha factor controls the decrease factor in the frequnecy domain (alpha=0 adds white noise, alpha>>0 adds low frequnecy noise). It is randomly sampled between alpha_low and alpha_high. With negative alpha this funtion generates blue noise.
- Parameters:
waveforms (torch.Tensor) – The original waveform. It is just used to infer the shape.
alpha_low (float) – The minimum value for the alpha spectral smooting factor.
alpha_high (float) – The maximum value for the alpha spectral smooting factor.
sample_rate (float) – The sample rate of the original signal.
Example
>>> waveforms = torch.randn(4,257,10) >>> noise = pink_noise_like(waveforms) >>> noise.shape torch.Size([4, 257, 10])
- class speechbrain.augment.time_domain.DropBitResolution(target_dtype='random')[source]
Bases:
Module
This class transforms a float32 tensor into a lower resolution one (e.g., int16, int8, float16) and then converts it back to a float32. This process loses information and can be used for data augmentation.
Arguments:
- target_dtype: str
One of “int16”, “int8”, “float16”. If “random”, the bit resolution is randomly selected among the options listed above.
- Example:
>>> dropper = DropBitResolution() >>> signal = torch.rand(4, 16000) >>> signal_dropped = dropper(signal)