speechbrain.lobes.augment module
Combinations of processing algorithms to implement common augmentations.
- Examples:
SpecAugment
Environmental corruption (noise, reverberation)
- Authors
Peter Plantinga 2020
Jianyuan Zhong 2020
Summary
Classes:
Environmental Corruptions for speech signals: noise, reverb, babble. |
|
An implementation of the SpecAugment algorithm. |
|
A time-domain approximation of the SpecAugment algorithm. |
Reference
- class speechbrain.lobes.augment.SpecAugment(time_warp=True, time_warp_window=5, time_warp_mode='bicubic', freq_mask=True, freq_mask_width=(0, 20), n_freq_mask=2, time_mask=True, time_mask_width=(0, 100), n_time_mask=2, replace_with_zero=True)[source]
Bases:
Module
An implementation of the SpecAugment algorithm.
- Reference:
- Parameters
time_warp (bool) – Whether applying time warping.
time_warp_window (int) – Time warp window.
time_warp_mode (str) – Interpolation mode for time warping (default “bicubic”).
freq_mask (bool1) – Whether applying freq mask.
n_freq_mask (int) – Number of freq mask.
time_mask (int) – Whether applying time mask.
n_time_mask (int) – Number of time mask.
replace_with_zero (bool) – If True, replace masked value with 0, else replace masked value with mean of the input tensor.
Example
>>> aug = SpecAugment() >>> a = torch.rand([8, 120, 80]) >>> a = aug(a) >>> print(a.shape) torch.Size([8, 120, 80])
- class speechbrain.lobes.augment.TimeDomainSpecAugment(perturb_prob=1.0, drop_freq_prob=1.0, drop_chunk_prob=1.0, speeds=[95, 100, 105], sample_rate=16000, drop_freq_count_low=0, drop_freq_count_high=3, drop_chunk_count_low=0, drop_chunk_count_high=5, drop_chunk_length_low=1000, drop_chunk_length_high=2000, drop_chunk_noise_factor=0)[source]
Bases:
Module
A time-domain approximation of the SpecAugment algorithm.
This augmentation module implements three augmentations in the time-domain.
Drop chunks of the audio (zero amplitude or white noise)
Drop frequency bands (with band-drop filters)
Speed peturbation (via resampling to slightly different rate)
- Parameters
perturb_prob (float from 0 to 1) – The probability that a batch will have speed perturbation applied.
drop_freq_prob (float from 0 to 1) – The probability that a batch will have frequencies dropped.
drop_chunk_prob (float from 0 to 1) – The probability that a batch will have chunks dropped.
speeds (list of ints) – A set of different speeds to use to perturb each batch. See
speechbrain.processing.speech_augmentation.SpeedPerturb
sample_rate (int) – Sampling rate of the input waveforms.
drop_freq_count_low (int) – Lowest number of frequencies that could be dropped.
drop_freq_count_high (int) – Highest number of frequencies that could be dropped.
drop_chunk_count_low (int) – Lowest number of chunks that could be dropped.
drop_chunk_count_high (int) – Highest number of chunks that could be dropped.
drop_chunk_length_low (int) – Lowest length of chunks that could be dropped.
drop_chunk_length_high (int) – Highest length of chunks that could be dropped.
drop_chunk_noise_factor (float) – The noise factor used to scale the white noise inserted, relative to the average amplitude of the utterance. Default 0 (no noise inserted).
Example
>>> inputs = torch.randn([10, 16000]) >>> feature_maker = TimeDomainSpecAugment(speeds=[80]) >>> feats = feature_maker(inputs, torch.ones(10)) >>> feats.shape torch.Size([10, 12800])
- forward(waveforms, lengths)[source]
Returns the distorted waveforms.
- Parameters
waveforms (torch.Tensor) – The waveforms to distort
- class speechbrain.lobes.augment.EnvCorrupt(reverb_prob=1.0, babble_prob=1.0, noise_prob=1.0, openrir_folder=None, openrir_max_noise_len=None, reverb_csv=None, noise_csv=None, noise_num_workers=0, babble_speaker_count=0, babble_snr_low=0, babble_snr_high=0, noise_snr_low=0, noise_snr_high=0, rir_scale_factor=1.0, reverb_sample_rate=16000, noise_sample_rate=16000, clean_sample_rate=16000)[source]
Bases:
Module
Environmental Corruptions for speech signals: noise, reverb, babble.
- Parameters
reverb_prob (float from 0 to 1) – The probability that each batch will have reverberation applied.
babble_prob (float from 0 to 1) – The probability that each batch will have babble added.
noise_prob (float from 0 to 1) – The probability that each batch will have noise added.
openrir_folder (str) – If provided, download and prepare openrir to this location. The reverberation csv and noise csv will come from here unless overridden by the
reverb_csv
ornoise_csv
arguments.openrir_max_noise_len (float) – The maximum length in seconds for a noise segment from openrir. Only takes effect if
openrir_folder
is used for noises. Cuts longer noises into segments equal to or less than this length.reverb_csv (str) – A prepared csv file for loading room impulse responses.
noise_csv (str) – A prepared csv file for loading noise data.
noise_num_workers (int) – Number of workers to use for loading noises.
babble_speaker_count (int) – Number of speakers to use for babble. Must be less than batch size.
babble_snr_low (int) – Lowest generated SNR of reverbed signal to babble.
babble_snr_high (int) – Highest generated SNR of reverbed signal to babble.
noise_snr_low (int) – Lowest generated SNR of babbled signal to noise.
noise_snr_high (int) – Highest generated SNR of babbled signal to noise.
rir_scale_factor (float) – It compresses or dilates the given impulse response. If
0 < rir_scale_factor < 1
, the impulse response is compressed (less reverb), while ifrir_scale_factor > 1
it is dilated (more reverb).reverb_sample_rate (int) – Sample rate of input audio signals (rirs) used for reverberation.
noise_sample_rate (int) – Sample rate of input audio signals used for adding noise.
clean_sample_rate (int) – Sample rate of original (clean) audio signals.
Example
>>> inputs = torch.randn([10, 16000]) >>> corrupter = EnvCorrupt(babble_speaker_count=9) >>> feats = corrupter(inputs, torch.ones(10))
- forward(waveforms, lengths)[source]
Returns the distorted waveforms.
- Parameters
waveforms (torch.Tensor) – The waveforms to distort.