speechbrain.augment.freq_domain module

Frequency-Domain Sequential Data Augmentation Classes

This module comprises classes tailored for augmenting sequential data in the frequency domain, such as spectrograms and mel spectrograms. Its primary purpose is to enhance the resilience of neural models during the training process.

Authors: - Peter Plantinga (2020) - Mirco Ravanelli (2023)

Summary

Classes:

`RandomShift`	Shifts the input tensor by a random amount, allowing for either a time or frequency (or channel) shift depending on the specified axis.
`SpectrogramDrop`	This class drops slices of the input spectrogram.
`Warping`	Apply time or frequency warping to a spectrogram.

Reference

class speechbrain.augment.freq_domain.SpectrogramDrop(drop_length_low=5, drop_length_high=15, drop_count_low=1, drop_count_high=3, replace='zeros', dim=1)[source]

Bases: Module

This class drops slices of the input spectrogram.

Using SpectrogramDrop as an augmentation strategy helps a models learn to rely on all parts of the signal, since it can’t expect a given part to be present.

Reference:: https://arxiv.org/abs/1904.08779

Parameters:

drop_length_low (int) – The low end of lengths for which to drop the spectrogram, in samples.
drop_length_high (int) – The high end of lengths for which to drop the signal, in samples.
drop_count_low (int) – The low end of number of times that the signal can be dropped.
drop_count_high (int) – The high end of number of times that the signal can be dropped.
replace (str) –
- ‘zeros’: Masked values are replaced with zeros.
- ’mean’: Masked values are replaced with the mean value of the spectrogram.
- ’rand’: Masked values are replaced with random numbers ranging between
  the maximum and minimum values of the spectrogram.
- ’cutcat’: Masked values are replaced with chunks from other signals in the batch.
- ’swap’: Masked values are replaced with other chunks from the same sentence.
- ’random_selection’: A random selection among the approaches above.
dim (int) – Corresponding dimension to mask. If dim=1, we apply time masking. If dim=2, we apply frequency masking.

Example

>>> # time-masking
>>> drop = SpectrogramDrop(dim=1)
>>> spectrogram = torch.rand(4, 150, 40)
>>> print(spectrogram.shape)
torch.Size([4, 150, 40])
>>> out = drop(spectrogram)
>>> print(out.shape)
torch.Size([4, 150, 40])
>>> # frequency-masking
>>> drop = SpectrogramDrop(dim=2)
>>> spectrogram = torch.rand(4, 150, 40)
>>> print(spectrogram.shape)
torch.Size([4, 150, 40])
>>> out = drop(spectrogram)
>>> print(out.shape)
torch.Size([4, 150, 40])

forward(spectrogram)[source]

Apply the DropChunk augmentation to the input spectrogram.

This method randomly drops chunks of the input spectrogram to augment the data.

Parameters:: spectrogram (torch.Tensor) – Input spectrogram of shape [batch, time, fea].
Returns:: Augmented spectrogram of shape [batch, time, fea].
Return type:: torch.Tensor

training: bool

class speechbrain.augment.freq_domain.Warping(warp_window=5, warp_mode='bicubic', dim=1)[source]

Bases: Module

Apply time or frequency warping to a spectrogram.

If dim=1, time warping is applied; if dim=2, frequency warping is applied. This implementation selects a center and a window length to perform warping. It ensures that the temporal dimension remains unchanged by upsampling or downsampling the affected regions accordingly.

Reference:: https://arxiv.org/abs/1904.08779

Parameters:

warp_window (int, optional) – The width of the warping window. Default is 5.
warp_mode (str, optional) – The interpolation mode for time warping. Default is “bicubic.”
dim (int, optional) – Dimension along which to apply warping (1 for time, 2 for frequency). Default is 1.

Example

>>> # Time-warping
>>> warp = Warping()
>>> spectrogram = torch.rand(4, 150, 40)
>>> print(spectrogram.shape)
torch.Size([4, 150, 40])
>>> out = warp(spectrogram)
>>> print(out.shape)
torch.Size([4, 150, 40])
>>> # Frequency-warping
>>> warp = Warping(dim=2)
>>> spectrogram = torch.rand(4, 150, 40)
>>> print(spectrogram.shape)
torch.Size([4, 150, 40])
>>> out = warp(spectrogram)
>>> print(out.shape)
torch.Size([4, 150, 40])

forward(spectrogram)[source]

Apply warping to the input spectrogram.

Parameters:: spectrogram (torch.Tensor) – Input spectrogram with shape [batch, time, fea].
Returns:: Augmented spectrogram with shape [batch, time, fea].
Return type:: torch.Tensor

training: bool

class speechbrain.augment.freq_domain.RandomShift(min_shift=0, max_shift=0, dim=1)[source]

Bases: Module

Shifts the input tensor by a random amount, allowing for either a time or frequency (or channel) shift depending on the specified axis. It is crucial to calibrate the minimum and maximum shifts according to the requirements of your specific task. We recommend using small shifts to preserve information integrity. Using large shifts may result in the loss of significant data and could potentially lead to misalignments with corresponding labels.

Parameters:

min_shift (int) – The mininum channel shift.
max_shift (int) – The maximum channel shift.
dim (int) – The dimension to shift.

Example

>>> # time shift
>>> signal = torch.zeros(4, 100, 80)
>>> signal[0,50,:] = 1
>>> rand_shift =  RandomShift(dim=1, min_shift=-10, max_shift=10)
>>> lenghts = torch.tensor([0.2, 0.8, 0.9,1.0])
>>> output_signal, lenghts = rand_shift(signal,lenghts)

>>> # frequency shift
>>> signal = torch.zeros(4, 100, 80)
>>> signal[0,:,40] = 1
>>> rand_shift =  RandomShift(dim=2, min_shift=-10, max_shift=10)
>>> lenghts = torch.tensor([0.2, 0.8, 0.9,1.0])
>>> output_signal, lenghts = rand_shift(signal,lenghts)

forward(waveforms, lengths)[source]

Parameters:

waveforms (tensor) – Shape should be [batch, time] or [batch, time, channels].
lengths (tensor) – Shape should be a single dimension, [batch].

Return type:

Tensor of shape [batch, time] or [batch, time, channels]

training: bool