speechbrain.lobes.features module
Basic feature pipelines.
- Authors
Mirco Ravanelli 2020
Peter Plantinga 2020
Sarthak Yadav 2020
Summary
Classes:
Generate features for input to the speech pipeline. |
|
This class implements the LEAF audio frontend from |
|
Generate features for input to the speech pipeline. |
Reference
- class speechbrain.lobes.features.Fbank(deltas=False, context=False, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=40, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]
Bases:
Module
Generate features for input to the speech pipeline.
- Parameters
deltas (bool (default: False)) – Whether or not to append derivatives and second derivatives to the features.
context (bool (default: False)) – Whether or not to append forward and backward contexts to the features.
requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.
sample_rate (int (default: 160000)) – Sampling rate for the input waveforms.
f_min (int (default: 0)) – Lowest frequency for the Mel filters.
f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.
win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.
hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.
n_fft (int (default: 400)) – Number of samples to use in each stft.
n_mels (int (default: 40)) – Number of Mel filters.
filter_shape (str (default: triangular)) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).
param_change_factor (float (default: 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.
param_rand_factor (float (default: 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).
left_frames (int (default: 5)) – Number of frames of left context to add.
right_frames (int (default: 5)) – Number of frames of right context to add.
Example
>>> import torch >>> inputs = torch.randn([10, 16000]) >>> feature_maker = Fbank() >>> feats = feature_maker(inputs) >>> feats.shape torch.Size([10, 101, 40])
- class speechbrain.lobes.features.MFCC(deltas=True, context=True, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=23, n_mfcc=20, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]
Bases:
Module
Generate features for input to the speech pipeline.
- Parameters
deltas (bool (default: True)) – Whether or not to append derivatives and second derivatives to the features.
context (bool (default: True)) – Whether or not to append forward and backward contexts to the features.
requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.
sample_rate (int (default: 16000)) – Sampling rate for the input waveforms.
f_min (int (default: 0)) – Lowest frequency for the Mel filters.
f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.
win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.
hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.
n_fft (int (default: 400)) – Number of samples to use in each stft.
n_mels (int (default: 23)) – Number of filters to use for creating filterbank.
n_mfcc (int (default: 20)) – Number of output coefficients
filter_shape (str (default 'triangular')) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).
param_change_factor (bool (default 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.
param_rand_factor (float (default 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).
left_frames (int (default 5)) – Number of frames of left context to add.
right_frames (int (default 5)) – Number of frames of right context to add.
Example
>>> import torch >>> inputs = torch.randn([10, 16000]) >>> feature_maker = MFCC() >>> feats = feature_maker(inputs) >>> feats.shape torch.Size([10, 101, 660])
- class speechbrain.lobes.features.Leaf(out_channels, window_len: float = 25.0, window_stride: float = 10.0, sample_rate: int = 16000, input_shape=None, in_channels=None, min_freq=60.0, max_freq=None, use_pcen=True, learnable_pcen=True, use_legacy_complex=False, skip_transpose=False, n_fft=512)[source]
Bases:
Module
This class implements the LEAF audio frontend from
Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc. of ICLR 2021 (https://arxiv.org/abs/2101.08596)
- Parameters
out_channels (int) – It is the number of output channels.
window_len (float) – length of filter window in milliseconds
window_stride (float) – Stride factor of the filters in milliseconds
sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.
min_freq (float) – Lowest possible frequency (in Hz) for a filter
max_freq (float) – Highest possible frequency (in Hz) for a filter
use_pcen (bool) – If True (default), a per-channel energy normalization layer is used
learnable_pcen (bool:) – If True (default), the per-channel energy normalization layer is learnable
use_legacy_complex (bool) – If False, torch.complex64 data type is used for gabor impulse responses If True, computation is performed on two real-valued tensors
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.
Example
>>> inp_tensor = torch.rand([10, 8000]) >>> leaf = Leaf( ... out_channels=40, window_len=25., window_stride=10., in_channels=1 ... ) >>> out_tensor = leaf(inp_tensor) >>> out_tensor.shape torch.Size([10, 50, 40])
- forward(x)[source]
Returns the learned LEAF features
- Parameters
x (torch.Tensor of shape (batch, time, 1) or (batch, time)) – batch of input signals. 2d or 3d tensors are expected.