speechbrain.lobes.features module

Basic feature pipelines.

  • Mirco Ravanelli 2020

  • Peter Plantinga 2020

  • Sarthak Yadav 2020




Generate features for input to the speech pipeline.


This class implements the LEAF audio frontend from


Generate features for input to the speech pipeline.


class speechbrain.lobes.features.Fbank(deltas=False, context=False, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=40, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]

Bases: Module

Generate features for input to the speech pipeline.

  • deltas (bool (default: False)) – Whether or not to append derivatives and second derivatives to the features.

  • context (bool (default: False)) – Whether or not to append forward and backward contexts to the features.

  • requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.

  • sample_rate (int (default: 160000)) – Sampling rate for the input waveforms.

  • f_min (int (default: 0)) – Lowest frequency for the Mel filters.

  • f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.

  • win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.

  • hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.

  • n_fft (int (default: 400)) – Number of samples to use in each stft.

  • n_mels (int (default: 40)) – Number of Mel filters.

  • filter_shape (str (default: triangular)) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).

  • param_change_factor (float (default: 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.

  • param_rand_factor (float (default: 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).

  • left_frames (int (default: 5)) – Number of frames of left context to add.

  • right_frames (int (default: 5)) – Number of frames of right context to add.


>>> import torch
>>> inputs = torch.randn([10, 16000])
>>> feature_maker = Fbank()
>>> feats = feature_maker(inputs)
>>> feats.shape
torch.Size([10, 101, 40])

Returns a set of features generated from the input waveforms.


wav (tensor) – A batch of audio signals to transform to features.

training: bool
class speechbrain.lobes.features.MFCC(deltas=True, context=True, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=23, n_mfcc=20, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]

Bases: Module

Generate features for input to the speech pipeline.

  • deltas (bool (default: True)) – Whether or not to append derivatives and second derivatives to the features.

  • context (bool (default: True)) – Whether or not to append forward and backward contexts to the features.

  • requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.

  • sample_rate (int (default: 16000)) – Sampling rate for the input waveforms.

  • f_min (int (default: 0)) – Lowest frequency for the Mel filters.

  • f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.

  • win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.

  • hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.

  • n_fft (int (default: 400)) – Number of samples to use in each stft.

  • n_mels (int (default: 23)) – Number of filters to use for creating filterbank.

  • n_mfcc (int (default: 20)) – Number of output coefficients

  • filter_shape (str (default 'triangular')) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).

  • param_change_factor (bool (default 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.

  • param_rand_factor (float (default 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).

  • left_frames (int (default 5)) – Number of frames of left context to add.

  • right_frames (int (default 5)) – Number of frames of right context to add.


>>> import torch
>>> inputs = torch.randn([10, 16000])
>>> feature_maker = MFCC()
>>> feats = feature_maker(inputs)
>>> feats.shape
torch.Size([10, 101, 660])

Returns a set of mfccs generated from the input waveforms.


wav (tensor) – A batch of audio signals to transform to features.

training: bool
class speechbrain.lobes.features.Leaf(out_channels, window_len: float = 25.0, window_stride: float = 10.0, sample_rate: int = 16000, input_shape=None, in_channels=None, min_freq=60.0, max_freq=None, use_pcen=True, learnable_pcen=True, use_legacy_complex=False, skip_transpose=False, n_fft=512)[source]

Bases: Module

This class implements the LEAF audio frontend from

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc. of ICLR 2021 (https://arxiv.org/abs/2101.08596)

  • out_channels (int) – It is the number of output channels.

  • window_len (float) – length of filter window in milliseconds

  • window_stride (float) – Stride factor of the filters in milliseconds

  • sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.

  • min_freq (float) – Lowest possible frequency (in Hz) for a filter

  • max_freq (float) – Highest possible frequency (in Hz) for a filter

  • use_pcen (bool) – If True (default), a per-channel energy normalization layer is used

  • learnable_pcen (bool:) – If True (default), the per-channel energy normalization layer is learnable

  • use_legacy_complex (bool) – If False, torch.complex64 data type is used for gabor impulse responses If True, computation is performed on two real-valued tensors

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.


>>> inp_tensor = torch.rand([10, 8000])
>>> leaf = Leaf(
...     out_channels=40, window_len=25., window_stride=10., in_channels=1
... )
>>> out_tensor = leaf(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])

Returns the learned LEAF features


x (torch.Tensor of shape (batch, time, 1) or (batch, time)) – batch of input signals. 2d or 3d tensors are expected.

training: bool