speechbrain.processing.signal_processing module

Low level signal processing utilities

Authors

Peter Plantinga 2020
Francois Grondin 2020
William Aris 2020
Samuele Cornell 2020
Sarthak Yadav 2022

Summary

Functions:

`compute_amplitude`	Compute amplitude of a batch of waveforms.
`convolve1d`	Use torch.nn.functional to perform 1d padding and conv.
`dB_to_amplitude`	Returns the amplitude ratio, converted from decibels.
`gabor_impulse_response`	Function for generating gabor impulse responses as used by GaborConv1d proposed in
`gabor_impulse_response_legacy_complex`	Function for generating gabor impulse responses, but without using complex64 dtype as used by GaborConv1d proposed in
`mean_std_norm`	This function normalizes the mean and std of the input
`normalize`	This function normalizes a signal to unitary average or peak amplitude.
`notch_filter`	Returns a notch filter constructed from a high-pass and low-pass filter.
`overlap_and_add`	Taken from https://github.com/kaituoxu/Conv-TasNet/blob/master/src/utils.py
`rescale`	This functions performs signal rescaling to a target level.
`resynthesize`	Function for resynthesizing waveforms from enhanced mags.
`reverberate`	General function to contaminate a given signal with reverberation given a Room Impulse Response (RIR).

Reference

speechbrain.processing.signal_processing.compute_amplitude(waveforms, lengths=None, amp_type='avg', scale='linear')[source]

Compute amplitude of a batch of waveforms.

Parameters:

waveform (tensor) – The waveforms used for computing amplitude. Shape should be [time] or [batch, time] or [batch, time, channels].
lengths (tensor) – The lengths of the waveforms excluding the padding. Shape should be a single dimension, [batch].
amp_type (str) – Whether to compute “avg” average or “peak” amplitude. Choose between [“avg”, “peak”].
scale (str) – Whether to compute amplitude in “dB” or “linear” scale. Choose between [“linear”, “dB”].

Return type:

The average amplitude of the waveforms.

Example

>>> signal = torch.sin(torch.arange(16000.0)).unsqueeze(0)
>>> compute_amplitude(signal, signal.size(1))
tensor([[0.6366]])

speechbrain.processing.signal_processing.normalize(waveforms, lengths=None, amp_type='avg', eps=1e-14)[source]

This function normalizes a signal to unitary average or peak amplitude.

Parameters:

waveforms (tensor) – The waveforms to normalize. Shape should be [batch, time] or [batch, time, channels].
lengths (tensor) – The lengths of the waveforms excluding the padding. Shape should be a single dimension, [batch].
amp_type (str) – Whether one wants to normalize with respect to “avg” or “peak” amplitude. Choose between [“avg”, “peak”]. Note: for “avg” clipping is not prevented and can occur.
eps (float) – A small number to add to the denominator to prevent NaN.

Returns:

waveforms – Normalized level waveform.

Return type:

tensor

speechbrain.processing.signal_processing.mean_std_norm(waveforms, dims=1, eps=1e-06)[source]

This function normalizes the mean and std of the input: waveform (along the specified axis).

Parameters:

waveforms (tensor) – The waveforms to normalize. Shape should be [batch, time] or [batch, time, channels].
dim (int or tuple) – The dimension(s) on which mean and std are computed
eps (float) – A small number to add to the denominator to prevent NaN.

Returns:

waveforms – Normalized level waveform.

Return type:

tensor

speechbrain.processing.signal_processing.rescale(waveforms, lengths, target_lvl, amp_type='avg', scale='linear')[source]

This functions performs signal rescaling to a target level.

Parameters:

waveforms (tensor) – The waveforms to normalize. Shape should be [batch, time] or [batch, time, channels].
lengths (tensor) – The lengths of the waveforms excluding the padding. Shape should be a single dimension, [batch].
target_lvl (float) – Target lvl in dB or linear scale.
amp_type (str) – Whether one wants to rescale with respect to “avg” or “peak” amplitude. Choose between [“avg”, “peak”].
scale (str) – whether target_lvl belongs to linear or dB scale. Choose between [“linear”, “dB”].

Returns:

waveforms – Rescaled waveforms.

Return type:

tensor

speechbrain.processing.signal_processing.convolve1d(waveform, kernel, padding=0, pad_type='constant', stride=1, groups=1, use_fft=False, rotation_index=0)[source]

Use torch.nn.functional to perform 1d padding and conv.

Parameters:

waveform (tensor) – The tensor to perform operations on.
kernel (tensor) – The filter to apply during convolution.
padding (int or tuple) – The padding (pad_left, pad_right) to apply. If an integer is passed instead, this is passed to the conv1d function and pad_type is ignored.
pad_type (str) – The type of padding to use. Passed directly to torch.nn.functional.pad, see PyTorch documentation for available options.
stride (int) – The number of units to move each time convolution is applied. Passed to conv1d. Has no effect if use_fft is True.
groups (int) – This option is passed to conv1d to split the input into groups for convolution. Input channels should be divisible by the number of groups.
use_fft (bool) – When use_fft is passed True, then compute the convolution in the spectral domain using complex multiply. This is more efficient on CPU when the size of the kernel is large (e.g. reverberation). WARNING: Without padding, circular convolution occurs. This makes little difference in the case of reverberation, but may make more difference with different kernels.
rotation_index (int) – This option only applies if use_fft is true. If so, the kernel is rolled by this amount before convolution to shift the output location.

Return type:

The convolved waveform.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0).unsqueeze(2)
>>> kernel = torch.rand(1, 10, 1)
>>> signal = convolve1d(signal, kernel, padding=(9, 0))

speechbrain.processing.signal_processing.reverberate(waveforms, rir_waveform, rescale_amp='avg')[source]

General function to contaminate a given signal with reverberation given a Room Impulse Response (RIR). It performs convolution between RIR and signal, but without changing the original amplitude of the signal.

Parameters:

waveforms (tensor) – The waveforms to normalize. Shape should be [batch, time] or [batch, time, channels].
rir_waveform (tensor) – RIR tensor, shape should be [time, channels].
rescale_amp (str) – Whether reverberated signal is rescaled (None) and with respect either to original signal “peak” amplitude or “avg” average amplitude. Choose between [None, “avg”, “peak”].

Returns:

waveforms – Reverberated signal.

Return type:

tensor

speechbrain.processing.signal_processing.dB_to_amplitude(SNR)[source]

Returns the amplitude ratio, converted from decibels.

Parameters:: SNR (float) – The ratio in decibels to convert.

Example

>>> round(dB_to_amplitude(SNR=10), 3)
3.162
>>> dB_to_amplitude(SNR=0)
1.0

speechbrain.processing.signal_processing.notch_filter(notch_freq, filter_width=101, notch_width=0.05)[source]

Returns a notch filter constructed from a high-pass and low-pass filter.

(from https://tomroelandts.com/articles/ how-to-create-simple-band-pass-and-band-reject-filters)

Parameters:

notch_freq (float) – frequency to put notch as a fraction of the sampling rate / 2. The range of possible inputs is 0 to 1.
filter_width (int) – Filter width in samples. Longer filters have smaller transition bands, but are more inefficient.
notch_width (float) – Width of the notch, as a fraction of the sampling_rate / 2.

Example

>>> from speechbrain.dataio.dataio import read_audio
>>> signal = read_audio('tests/samples/single-mic/example1.wav')
>>> signal = signal.unsqueeze(0).unsqueeze(2)
>>> kernel = notch_filter(0.25)
>>> notched_signal = convolve1d(signal, kernel)

speechbrain.processing.signal_processing.overlap_and_add(signal, frame_step)[source]

Taken from https://github.com/kaituoxu/Conv-TasNet/blob/master/src/utils.py

Reconstructs a signal from a framed representation. Adds potentially overlapping frames of a signal with shape [..., frames, frame_length], offsetting subsequent frames by frame_step. The resulting tensor has shape [..., output_size] where

output_size = (frames - 1) * frame_step + frame_length

Args:: signal: A […, frames, frame_length] Tensor. All dimensions may be unknown, and rank must be at least 2. frame_step: An integer denoting overlap offsets. Must be less than or equal to frame_length.
Returns:: A Tensor with shape […, output_size] containing the overlap-added frames of signal’s inner-most two dimensions. output_size = (frames - 1) * frame_step + frame_length

Based on https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/signal/python/ops/reconstruction_ops.py

Example

>>> signal = torch.randn(5, 20)
>>> overlapped = overlap_and_add(signal, 20)
>>> overlapped.shape
torch.Size([100])

speechbrain.processing.signal_processing.resynthesize(enhanced_mag, noisy_inputs, stft, istft, normalize_wavs=True)[source]

Function for resynthesizing waveforms from enhanced mags.

Parameters:

enhanced_mag (torch.Tensor) – Predicted spectral magnitude, should be three dimensional.
noisy_inputs (torch.Tensor) – The noisy waveforms before any processing, to extract phase.
lengths (torch.Tensor) – The length of each waveform for normalization.
stft (torch.nn.Module) – Module for computing the STFT for extracting phase.
istft (torch.nn.Module) – Module for computing the iSTFT for resynthesis.
normalize_wavs (bool) – Whether to normalize the output wavs before returning them.

Returns:

enhanced_wav – The resynthesized waveforms of the enhanced magnitudes with noisy phase.

Return type:

torch.Tensor

speechbrain.processing.signal_processing.gabor_impulse_response(t, center, fwhm)[source]

Function for generating gabor impulse responses as used by GaborConv1d proposed in

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc of ICLR 2021 (https://arxiv.org/abs/2101.08596)

speechbrain.processing.signal_processing.gabor_impulse_response_legacy_complex(t, center, fwhm)[source]

Function for generating gabor impulse responses, but without using complex64 dtype as used by GaborConv1d proposed in

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc of ICLR 2021 (https://arxiv.org/abs/2101.08596)