speechbrain.augment.freq_domain module
Frequency-Domain Sequential Data Augmentation Classes
This module comprises classes tailored for augmenting sequential data in the frequency domain, such as spectrograms and mel spectrograms. Its primary purpose is to enhance the resilience of neural models during the training process.
Authors: - Peter Plantinga (2020) - Mirco Ravanelli (2023)
Summary
Classes:
Shifts the input tensor by a random amount, allowing for either a time or frequency (or channel) shift depending on the specified axis. |
|
This class drops slices of the input spectrogram. |
|
Apply time or frequency warping to a spectrogram. |
Reference
- class speechbrain.augment.freq_domain.SpectrogramDrop(drop_length_low=5, drop_length_high=15, drop_count_low=1, drop_count_high=3, replace='zeros', dim=1)[source]
Bases:
Module
This class drops slices of the input spectrogram.
Using
SpectrogramDrop
as an augmentation strategy helps a models learn to rely on all parts of the signal, since it can’t expect a given part to be present.- Reference:
- Parameters:
drop_length_low (int) – The low end of lengths for which to drop the spectrogram, in samples.
drop_length_high (int) – The high end of lengths for which to drop the signal, in samples.
drop_count_low (int) – The low end of number of times that the signal can be dropped.
drop_count_high (int) – The high end of number of times that the signal can be dropped.
replace (str) –
‘zeros’: Masked values are replaced with zeros.
’mean’: Masked values are replaced with the mean value of the spectrogram.
- ’rand’: Masked values are replaced with random numbers ranging between
the maximum and minimum values of the spectrogram.
’cutcat’: Masked values are replaced with chunks from other signals in the batch.
’swap’: Masked values are replaced with other chunks from the same sentence.
’random_selection’: A random selection among the approaches above.
dim (int) – Corresponding dimension to mask. If dim=1, we apply time masking. If dim=2, we apply frequency masking.
Example
>>> # time-masking >>> drop = SpectrogramDrop(dim=1) >>> spectrogram = torch.rand(4, 150, 40) >>> print(spectrogram.shape) torch.Size([4, 150, 40]) >>> out = drop(spectrogram) >>> print(out.shape) torch.Size([4, 150, 40]) >>> # frequency-masking >>> drop = SpectrogramDrop(dim=2) >>> spectrogram = torch.rand(4, 150, 40) >>> print(spectrogram.shape) torch.Size([4, 150, 40]) >>> out = drop(spectrogram) >>> print(out.shape) torch.Size([4, 150, 40])
- forward(spectrogram)[source]
Apply the DropChunk augmentation to the input spectrogram.
This method randomly drops chunks of the input spectrogram to augment the data.
- Parameters:
spectrogram (torch.Tensor) – Input spectrogram of shape
[batch, time, fea]
.- Returns:
Augmented spectrogram of shape
[batch, time, fea]
.- Return type:
torch.Tensor
- class speechbrain.augment.freq_domain.Warping(warp_window=5, warp_mode='bicubic', dim=1)[source]
Bases:
Module
Apply time or frequency warping to a spectrogram.
If
dim=1
, time warping is applied; ifdim=2
, frequency warping is applied. This implementation selects a center and a window length to perform warping. It ensures that the temporal dimension remains unchanged by upsampling or downsampling the affected regions accordingly.- Reference:
- Parameters:
Example
>>> # Time-warping >>> warp = Warping() >>> spectrogram = torch.rand(4, 150, 40) >>> print(spectrogram.shape) torch.Size([4, 150, 40]) >>> out = warp(spectrogram) >>> print(out.shape) torch.Size([4, 150, 40]) >>> # Frequency-warping >>> warp = Warping(dim=2) >>> spectrogram = torch.rand(4, 150, 40) >>> print(spectrogram.shape) torch.Size([4, 150, 40]) >>> out = warp(spectrogram) >>> print(out.shape) torch.Size([4, 150, 40])
- class speechbrain.augment.freq_domain.RandomShift(min_shift=0, max_shift=0, dim=1)[source]
Bases:
Module
Shifts the input tensor by a random amount, allowing for either a time or frequency (or channel) shift depending on the specified axis. It is crucial to calibrate the minimum and maximum shifts according to the requirements of your specific task. We recommend using small shifts to preserve information integrity. Using large shifts may result in the loss of significant data and could potentially lead to misalignments with corresponding labels.
- Parameters:
Example
>>> # time shift >>> signal = torch.zeros(4, 100, 80) >>> signal[0,50,:] = 1 >>> rand_shift = RandomShift(dim=1, min_shift=-10, max_shift=10) >>> lengths = torch.tensor([0.2, 0.8, 0.9,1.0]) >>> output_signal, lengths = rand_shift(signal,lengths)
>>> # frequency shift >>> signal = torch.zeros(4, 100, 80) >>> signal[0,:,40] = 1 >>> rand_shift = RandomShift(dim=2, min_shift=-10, max_shift=10) >>> lengths = torch.tensor([0.2, 0.8, 0.9,1.0]) >>> output_signal, lengths = rand_shift(signal,lengths)