speechbrain.lobes.features module

Basic feature pipelines.

Authors

Mirco Ravanelli 2020
Peter Plantinga 2020
Sarthak Yadav 2020
Sylvain de Langen 2024

Summary

Classes:

`Fbank`	Generate features for input to the speech pipeline.
`Leaf`	This class implements the LEAF audio frontend from
`MFCC`	Generate features for input to the speech pipeline.
`StreamingFeatureWrapper`	Wraps an arbitrary filter so that it can be used in a streaming fashion (i.e. on a per-chunk basis), by remembering context and making "clever" use of padding.
`StreamingFeatureWrapperContext`	Streaming metadata for the feature extractor.
`VocalFeatures`	Estimates the vocal characteristics of a signal in four categories of features:

Functions:

`moving_average`	Computes moving average on a given dimension.
`upalign_value`	If `x` cannot evenly divide `to`, round it up to the next value that can.

Reference

class speechbrain.lobes.features.Fbank(deltas=False, context=False, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=40, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]

Bases: Module

Generate features for input to the speech pipeline.

Parameters:

deltas (bool (default: False)) – Whether or not to append derivatives and second derivatives to the features.
context (bool (default: False)) – Whether or not to append forward and backward contexts to the features.
requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.
sample_rate (int (default: 160000)) – Sampling rate for the input waveforms.
f_min (int (default: 0)) – Lowest frequency for the Mel filters.
f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.
n_fft (int (default: 400)) – Number of samples to use in each stft.
n_mels (int (default: 40)) – Number of Mel filters.
filter_shape (str (default: triangular)) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).
param_change_factor (float (default: 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.
param_rand_factor (float (default: 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).
left_frames (int (default: 5)) – Number of frames of left context to add.
right_frames (int (default: 5)) – Number of frames of right context to add.
win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.
hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.

Example

>>> import torch
>>> inputs = torch.randn([10, 16000])
>>> feature_maker = Fbank()
>>> feats = feature_maker(inputs)
>>> feats.shape
torch.Size([10, 101, 40])

forward(wav)[source]

Returns a set of features generated from the input waveforms.

Parameters:: wav (torch.Tensor) – A batch of audio signals to transform to features.
Returns:: fbanks
Return type:: torch.Tensor

get_filter_properties() → FilterProperties[source]

class speechbrain.lobes.features.MFCC(deltas=True, context=True, requires_grad=False, sample_rate=16000, f_min=0, f_max=None, n_fft=400, n_mels=23, n_mfcc=20, filter_shape='triangular', param_change_factor=1.0, param_rand_factor=0.0, left_frames=5, right_frames=5, win_length=25, hop_length=10)[source]

Bases: Module

Generate features for input to the speech pipeline.

Parameters:

deltas (bool (default: True)) – Whether or not to append derivatives and second derivatives to the features.
context (bool (default: True)) – Whether or not to append forward and backward contexts to the features.
requires_grad (bool (default: False)) – Whether to allow parameters (i.e. fbank centers and spreads) to update during training.
sample_rate (int (default: 16000)) – Sampling rate for the input waveforms.
f_min (int (default: 0)) – Lowest frequency for the Mel filters.
f_max (int (default: None)) – Highest frequency for the Mel filters. Note that if f_max is not specified it will be set to sample_rate // 2.
n_fft (int (default: 400)) – Number of samples to use in each stft.
n_mels (int (default: 23)) – Number of filters to use for creating filterbank.
n_mfcc (int (default: 20)) – Number of output coefficients
filter_shape (str (default 'triangular')) – Shape of the filters (‘triangular’, ‘rectangular’, ‘gaussian’).
param_change_factor (bool (default 1.0)) – If freeze=False, this parameter affects the speed at which the filter parameters (i.e., central_freqs and bands) can be changed. When high (e.g., param_change_factor=1) the filters change a lot during training. When low (e.g. param_change_factor=0.1) the filter parameters are more stable during training.
param_rand_factor (float (default 0.0)) – This parameter can be used to randomly change the filter parameters (i.e, central frequencies and bands) during training. It is thus a sort of regularization. param_rand_factor=0 does not affect, while param_rand_factor=0.15 allows random variations within +-15% of the standard values of the filter parameters (e.g., if the central freq is 100 Hz, we can randomly change it from 85 Hz to 115 Hz).
left_frames (int (default 5)) – Number of frames of left context to add.
right_frames (int (default 5)) – Number of frames of right context to add.
win_length (float (default: 25)) – Length (in ms) of the sliding window used to compute the STFT.
hop_length (float (default: 10)) – Length (in ms) of the hop of the sliding window used to compute the STFT.

Example

>>> import torch
>>> inputs = torch.randn([10, 16000])
>>> feature_maker = MFCC()
>>> feats = feature_maker(inputs)
>>> feats.shape
torch.Size([10, 101, 660])

forward(wav)[source]

Returns a set of mfccs generated from the input waveforms.

Parameters:: wav (torch.Tensor) – A batch of audio signals to transform to features.
Returns:: mfccs
Return type:: torch.Tensor

class speechbrain.lobes.features.Leaf(out_channels, window_len: float = 25.0, window_stride: float = 10.0, sample_rate: int = 16000, input_shape=None, in_channels=None, min_freq=60.0, max_freq=None, use_pcen=True, learnable_pcen=True, use_legacy_complex=False, skip_transpose=False, n_fft=512)[source]

Bases: Module

This class implements the LEAF audio frontend from

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc. of ICLR 2021 (https://arxiv.org/abs/2101.08596)

Parameters:

out_channels (int) – It is the number of output channels.
window_len (float) – length of filter window in milliseconds
window_stride (float) – Stride factor of the filters in milliseconds
sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.
input_shape (tuple) – Expected shape of the inputs.
in_channels (int) – Expected number of input channels.
min_freq (float) – Lowest possible frequency (in Hz) for a filter
max_freq (float) – Highest possible frequency (in Hz) for a filter
use_pcen (bool) – If True (default), a per-channel energy normalization layer is used
learnable_pcen (bool:) – If True (default), the per-channel energy normalization layer is learnable
use_legacy_complex (bool) – If False, torch.complex64 data type is used for gabor impulse responses If True, computation is performed on two real-valued torch.Tensors
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.
n_fft (int) – Number of FFT bins

Example

>>> inp_tensor = torch.rand([10, 8000])
>>> leaf = Leaf(
...     out_channels=40, window_len=25.0, window_stride=10.0, in_channels=1
... )
>>> out_tensor = leaf(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])

forward(x)[source]

Returns the learned LEAF features

Parameters:: x (torch.Tensor of shape (batch, time, 1) or (batch, time)) – batch of input signals. 2d or 3d tensors are expected.
Returns:: outputs
Return type:: torch.Tensor

speechbrain.lobes.features.upalign_value(x, to: int) → int[source]: If x cannot evenly divide to, round it up to the next value that can.

class speechbrain.lobes.features.StreamingFeatureWrapperContext(left_context: Tensor | None)[source]

Bases: object

Streaming metadata for the feature extractor. Holds some past context frames.

left_context: Tensor | None: Cached left frames to be inserted as left padding for the next chunk. Initially None then gets updated from the last frames of the current chunk. See the relevant forward function for details.

class speechbrain.lobes.features.StreamingFeatureWrapper(module: Module, properties: FilterProperties)[source]

Bases: Module

Wraps an arbitrary filter so that it can be used in a streaming fashion (i.e. on a per-chunk basis), by remembering context and making “clever” use of padding.

Parameters:

module (torch.nn.Module) – The filter to wrap; e.g. a module list that constitutes a sequential feature extraction pipeline. The module is assumed to pad its inputs, e.g. the output of a convolution with a stride of 1 would end up with the same frame count as the input.
properties (FilterProperties) – The effective filter properties of the provided module. This is used to determine padding and caching.

get_required_padding() → int[source]: Computes the number of padding/context frames that need to be injected at the past and future of the input signal in the forward pass.

get_output_count_per_pad_frame() → int[source]: Computes the exact number of produced frames (along the time dimension) per input pad frame.

get_recommended_final_chunk_count(frames_per_chunk: int) → int[source]

Get the recommended number of zero chunks to inject at the end of an input stream depending on the filter properties of the extractor.

The number of injected chunks is chosen to ensure that the filter has output frames centered on the last input frames. See also forward().

Parameters:: frames_per_chunk (int) – The number of frames per chunk, i.e. the size of the time dimension passed to forward().
Return type:: Recommended number of chunks.

forward(chunk: Tensor, context: StreamingFeatureWrapperContext, *extra_args, **extra_kwargs) → Tensor[source]

Forward pass for the streaming feature wrapper.

For the first chunk, 0-padding is inserted at the past of the input. For any chunk (including the first), some future frames get truncated and cached to be inserted as left context for the next chunk in time.

For further explanations, see the comments in the code.

Note that due to how the padding is implemented, you may want to call this with a chunk worth full of zeros (potentially more for filters with large windows) at the end of your input so that the final frames have a chance to get processed by the filter. See get_recommended_final_chunk_count(). This is not really an issue when processing endless streams, but when processing files, it could otherwise result in truncated outputs.

Parameters:

chunk (torch.Tensor) – Chunk of input of shape [batch size, time]; typically a raw waveform. Normally, in a chunkwise streaming scenario, time = (stride-1) * chunk_size where chunk_size is the desired output frame count.
context (StreamingFeatureWrapperContext) – Mutable streaming context object; should be reused for subsequent calls in the same streaming session.
*extra_args (tuple)
**extra_kwargs (dict) – Args to be passed to he module.

Returns:

Processed chunk of shape [batch size, output frames]. This shape is equivalent to the shape of module(chunk).

Return type:

torch.Tensor

get_filter_properties() → FilterProperties[source]

make_streaming_context() → StreamingFeatureWrapperContext[source]

class speechbrain.lobes.features.VocalFeatures(min_f0_Hz: int = 80, max_f0_Hz: int = 300, step_size: float = 0.01, window_size: float = 0.05, sample_rate: int = 16000, log_scores: bool = True, eps: float = 0.001, sma_neighbors: int = 3, n_mels: int = 23, n_mfcc: int = 4)[source]

Bases: Module

Estimates the vocal characteristics of a signal in four categories of features:

Autocorrelation-based
Period-based (jitter/shimmer)
Spectrum-based
MFCCs

Parameters:

min_f0_Hz (int) – The minimum allowed fundamental frequency, to reduce octave errors. Default is 80 Hz, based on human voice standard frequency range.
max_f0_Hz (int) – The maximum allowed fundamental frequency, to reduce octave errors. Default is 300 Hz, based on human voice standard frequency range.
step_size (float) – The time between analysis windows (in seconds).
window_size (float) – The size of the analysis window (in seconds). Must be long enough to contain at least 4 periods at the minimum frequency.
sample_rate (int) – The number of samples in a second.
log_scores (bool) – Whether to represent the jitter/shimmer/hnr/gne on a log scale, as these features are typically close to zero.
eps (float) – The minimum value before log transformation, default of 1e-3 results in a maximum value of 30 dB.
sma_neighbors (int) – Number of frames to average – default 3
n_mels (int (default: 23)) – Number of filters to use for creating filterbank.
n_mfcc (int (default: 4)) – Number of output coefficients

Example

>>> audio = torch.rand(1, 16000)
>>> feature_maker = VocalFeatures()
>>> vocal_features = feature_maker(audio)
>>> vocal_features.shape
torch.Size([1, 96, 17])

forward(audio: Tensor)[source]

Compute voice features.

Parameters:

audio (torch.Tensor) – The audio signal to be converted to voice features.

Returns:

features –

A [batch, frame, 13+n_mfcc] tensor with the following features per-frame.

autocorr_f0: A per-frame estimate of the f0 in Hz.
autocorr_hnr: harmonicity-to-noise ratio for each frame.
periodic_jitter: Average deviation in period length.
periodic_shimmer: Average deviation in amplitude per period.
gne: The glottal-to-noise-excitation ratio.
spectral_centroid: “center-of-mass” for spectral frames.
spectral_spread: avg distance from centroid for spectral frames.
spectral_skew: asymmetry of spectrum about the centroid.
spectral_kurtosis: tailedness of spectrum.
spectral_entropy: The peakiness of the spectrum.
spectral_flatness: The ratio of geometric mean to arithmetic mean.
spectral_crest: The ratio of spectral maximum to arithmetic mean.
spectral_flux: The 2-normed diff between successive spectral values.
mfcc_{0-n_mfcc}: The mel cepstral coefficients.

Return type:

torch.Tensor

speechbrain.lobes.features.moving_average(features, dim=1, n=3)[source]

Computes moving average on a given dimension.

Parameters:

features (torch.Tensor) – The feature tensor to smooth out.
dim (int) – The time dimension (for smoothing).
n (int) – The number of points in the moving average

Returns:

smoothed_features – The features after the moving average is applied.

Return type:

torch.Tensor

Example

>>> feats = torch.tensor([[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]])
>>> moving_average(feats)
tensor([[0.5000, 0.3333, 0.6667, 0.3333, 0.6667, 0.3333, 0.5000]])