speechbrain.lobes.models.Tacotron2 module
Neural network modules for the Tacotron2 end-to-end neural Text-to-Speech (TTS) model
Authors * Georges Abous-Rjeili 2021 * Artem Ploujnikov 2021
Summary
Classes:
The Tacotron attention layer. |
|
A 1D convolution layer with Xavier initialization |
|
The Tacotron decoder |
|
The Tacotron2 encoder module, consisting of a sequence of 1-d convolution banks (3 by default) and a bidirectional LSTM |
|
A linear layer with Xavier initialization |
|
A location-based attention layer consisting of a Xavier-initialized convolutional layer followed by a dense layer |
|
The Tacotron loss implementation |
|
alias of |
|
The Tacotron postnet consists of a number of 1-d convolutional layers with Xavier initialization and a tanh activation, with batch normalization. |
|
The Tacotron pre-net module consisting of a specified number of normalized (Xavier-initialized) linear layers |
|
The Tactron2 text-to-speech model, based on the NVIDIA implementation. |
|
Zero-pads model inputs and targets based on number of frames per step |
Functions:
Dynamic range compression for audio signals |
|
Creates a mask from a tensor of lengths |
|
An inference hook for pretrained synthesizers |
|
calculates MelSpectrogram for a raw audio signal |
Reference
- class speechbrain.lobes.models.Tacotron2.LinearNorm(in_dim, out_dim, bias=True, w_init_gain='linear')[source]
Bases:
Module
A linear layer with Xavier initialization
- Parameters
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import Tacotron2 >>> layer = LinearNorm(in_dim=5, out_dim=3) >>> x = torch.randn(3, 5) >>> y = layer(x) >>> y.shape torch.Size([3, 3])
- forward(x)[source]
Computes the forward pass
- Parameters
x (torch.Tensor) – a (batch, features) input tensor
- Returns
output – the linear layer output
- Return type
- class speechbrain.lobes.models.Tacotron2.ConvNorm(in_channels, out_channels, kernel_size=1, stride=1, padding=None, dilation=1, bias=True, w_init_gain='linear')[source]
Bases:
Module
A 1D convolution layer with Xavier initialization
- Parameters
in_channels (int) – the number of input channels
out_channels (int) – the number of output channels
kernel_size (int) – the kernel size
stride (int) – the convolutional stride
padding (int) – the amount of padding to include. If not provided, it will be calculated as dilation * (kernel_size - 1) / 2
dilation (int) – the dilation of the convolution
bias (bool) – whether or not to use a bias
w_init_gain (linear) – the weight initialization gain type (see torch.nn.init.calculate_gain)
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import ConvNorm >>> layer = ConvNorm(in_channels=10, out_channels=5, kernel_size=3) >>> x = torch.randn(3, 10, 5) >>> y = layer(x) >>> y.shape torch.Size([3, 5, 5])
- forward(signal)[source]
Computes the forward pass
- Parameters
signal (torch.Tensor) – the input to the convolutional layer
- Returns
output – the output
- Return type
- class speechbrain.lobes.models.Tacotron2.LocationLayer(attention_n_filters=32, attention_kernel_size=31, attention_dim=128)[source]
Bases:
Module
A location-based attention layer consisting of a Xavier-initialized convolutional layer followed by a dense layer
- Parameters
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import LocationLayer >>> layer = LocationLayer() >>> attention_weights_cat = torch.randn(3, 2, 64) >>> processed_attention = layer(attention_weights_cat) >>> processed_attention.shape torch.Size([3, 64, 128])
- forward(attention_weights_cat)[source]
Performs the forward pass for the attention layer
- Parameters
attention_weights_cat (torch.Tensor) – the concatenating attention weights
Results –
------- –
processed_attention (torch.Tensor) – the attention layer output
- class speechbrain.lobes.models.Tacotron2.Attention(attention_rnn_dim=1024, embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31)[source]
Bases:
Module
The Tacotron attention layer. Location-based attention is used.
- Parameters
attention_rnn_dim (int) – the dimension of the RNN to which the attention layer is applied
embedding_dim (int) – the embedding dimension
attention_dim (int) – the dimension of the memory cell
attenion_location_n_filters (int) – the number of location filters
attention_location_kernel_size (int) – the kernel size of the location layer
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import ( ... Attention, get_mask_from_lengths) >>> layer = Attention() >>> attention_hidden_state = torch.randn(2, 1024) >>> memory = torch.randn(2, 173, 512) >>> processed_memory = torch.randn(2, 173, 128) >>> attention_weights_cat = torch.randn(2, 2, 173) >>> memory_lengths = torch.tensor([173, 91]) >>> mask = get_mask_from_lengths(memory_lengths) >>> attention_context, attention_weights = layer( ... attention_hidden_state, ... memory, ... processed_memory, ... attention_weights_cat, ... mask ... ) >>> attention_context.shape, attention_weights.shape (torch.Size([2, 512]), torch.Size([2, 173]))
- get_alignment_energies(query, processed_memory, attention_weights_cat)[source]
Computes the alignment energies
- Parameters
query (torch.Tensor) – decoder output (batch, n_mel_channels * n_frames_per_step)
processed_memory (torch.Tensor) – processed encoder outputs (B, T_in, attention_dim)
attention_weights_cat (torch.Tensor) – cumulative and prev. att weights (B, 2, max_time)
- Returns
alignment – (batch, max_time)
- Return type
- forward(attention_hidden_state, memory, processed_memory, attention_weights_cat, mask)[source]
Computes the forward pass
- Parameters
attention_hidden_state (torch.Tensor) – attention rnn last output
memory (torch.Tensor) – encoder outputs
processed_memory (torch.Tensor) – processed encoder outputs
attention_weights_cat (torch.Tensor) – previous and cummulative attention weights
mask (torch.Tensor) – binary mask for padded data
- Returns
result – a (attention_context, attention_weights) tuple
- Return type
- class speechbrain.lobes.models.Tacotron2.Prenet(in_dim=80, sizes=[256, 256], dropout=0.5)[source]
Bases:
Module
The Tacotron pre-net module consisting of a specified number of normalized (Xavier-initialized) linear layers
- Parameters
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import Prenet >>> layer = Prenet() >>> x = torch.randn(862, 2, 80) >>> output = layer(x) >>> output.shape torch.Size([862, 2, 256])
- forward(x)[source]
Computes the forward pass for the prenet
- Parameters
x (torch.Tensor) – the prenet inputs
- Returns
output – the output
- Return type
- class speechbrain.lobes.models.Tacotron2.Postnet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5)[source]
Bases:
Module
The Tacotron postnet consists of a number of 1-d convolutional layers with Xavier initialization and a tanh activation, with batch normalization. Depending on configuration, the postnet may either refine the MEL spectrogram or upsample it to a linear spectrogram
- Parameters
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import Postnet >>> layer = Postnet() >>> x = torch.randn(2, 80, 861) >>> output = layer(x) >>> output.shape torch.Size([2, 80, 861])
- forward(x)[source]
Computes the forward pass of the postnet
- Parameters
x (torch.Tensor) – the postnet input (usually a MEL spectrogram)
- Returns
output – the postnet output (a refined MEL spectrogram or a linear spectrogram depending on how the model is configured)
- Return type
- class speechbrain.lobes.models.Tacotron2.Encoder(encoder_n_convolutions=3, encoder_embedding_dim=512, encoder_kernel_size=5)[source]
Bases:
Module
The Tacotron2 encoder module, consisting of a sequence of 1-d convolution banks (3 by default) and a bidirectional LSTM
- Parameters
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import Encoder >>> layer = Encoder() >>> x = torch.randn(2, 512, 128) >>> input_lengths = torch.tensor([128, 83]) >>> outputs = layer(x, input_lengths) >>> outputs.shape torch.Size([2, 128, 512])
- forward(x, input_lengths)[source]
Computes the encoder forward pass
- Parameters
x (torch.Tensor) – a batch of inputs (sequence embeddings)
input_lengths (torch.Tensor) – a tensor of input lengths
- Returns
outputs – the encoder output
- Return type
- infer(x, input_lengths)[source]
Performs a forward stap in the inference context
- Parameters
x (torch.Tensor) – a batch of inputs (sequence embeddings)
input_lengths (torch.Tensor) – a tensor of input lengths
- Returns
outputs – the encoder output
- Return type
- class speechbrain.lobes.models.Tacotron2.Decoder(n_mel_channels=80, n_frames_per_step=1, encoder_embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, attention_rnn_dim=1024, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, early_stopping=True)[source]
Bases:
Module
The Tacotron decoder
- Parameters
n_mel_channels (int) – the number of channels in the MEL sepctrogram
n_frames_per_step – the number of frames in the spectrogram for each time step of the decoder
encoder_embedding_dim (int) – the dimension of the encoder embedding
attention_location_n_filters (int) – the number of filters in location-based attention
attention_location_kernel_size (int) – the kernel size of location-based attention
attention_rnn_dim (int) – RNN dimension for the attention layer
decoder_rnn_dim (int) – the encoder RNN dimension
prenet_dim (int) – the dimension of the prenet (inner and output layers)
max_decoder_steps (int) – the maximum number of decoder steps for the longest utterance expected for the model
gate_threshold (float) – the fixed threshold to which the outputs of the decoders will be compared
p_attention_dropout (float) – dropout probability for attention layers
Example
>>> import torch >>> from speechbrain.lobes.models.Tacotron2 import Decoder >>> layer = Decoder() >>> memory = torch.randn(2, 173, 512) >>> decoder_inputs = torch.randn(2, 80, 173) >>> memory_lengths = torch.tensor([173, 91]) >>> mel_outputs, gate_outputs, alignments = layer( ... memory, decoder_inputs, memory_lengths) >>> mel_outputs.shape, gate_outputs.shape, alignments.shape (torch.Size([2, 80, 173]), torch.Size([2, 173]), torch.Size([2, 173, 173]))
- get_go_frame(memory)[source]
Gets all zeros frames to use as first decoder input
- Parameters
memory (torch.Tensor) – decoder outputs
- Returns
decoder_input – all zeros frames
- Return type
- initialize_decoder_states(memory)[source]
Initializes attention rnn states, decoder rnn states, attention weights, attention cumulative weights, attention context, stores memory and stores processed memory
- Parameters
memory (torch.Tensor) – Encoder outputs
mask (torch.Tensor) – Mask for padded data if training, expects None for inference
- Returns
result – A tuple of tensors (
attention_hidden, attention_cell, decoder_hidden, decoder_cell, attention_weights, attention_weights_cum, attention_context, processed_memory,
)
- Return type
- parse_decoder_inputs(decoder_inputs)[source]
Prepares decoder inputs, i.e. mel outputs :param decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs :type decoder_inputs: torch.Tensor
- Returns
decoder_inputs – processed decoder inputs
- Return type
- parse_decoder_outputs(mel_outputs, gate_outputs, alignments)[source]
Prepares decoder outputs for output
- Parameters
mel_outputs (torch.Tensor) – MEL-scale spectrogram outputs
gate_outputs (torch.Tensor) – gate output energies
alignments (torch.Tensor) – the alignment tensor
- Returns
mel_outputs (torch.Tensor) – MEL-scale spectrogram outputs
gate_outputs (torch.Tensor) – gate output energies
alignments (torch.Tensor) – the alignment tensor
- decode(decoder_input, attention_hidden, attention_cell, decoder_hidden, decoder_cell, attention_weights, attention_weights_cum, attention_context, memory, processed_memory, mask)[source]
Decoder step using stored states, attention and memory :param decoder_input: previous mel output :type decoder_input: torch.Tensor :param attention_hidden: the hidden state of the attention module :type attention_hidden: torch.Tensor :param attention_cell: the attention cell state :type attention_cell: torch.Tensor :param decoder_hidden: the decoder hidden state :type decoder_hidden: torch.Tensor :param decoder_cell: the decoder cell state :type decoder_cell: torch.Tensor :param attention_weights: the attention weights :type attention_weights: torch.Tensor :param attention_weights_cum: cumulative attention weights :type attention_weights_cum: torch.Tensor :param attention_context: the attention context tensor :type attention_context: torch.Tensor :param memory: the memory tensor :type memory: torch.Tensor :param processed_memory: the processed memory tensor :type processed_memory: torch.Tensor :param mask: :type mask: torch.Tensor
- Returns
mel_output (torch.Tensor) – the MEL-scale outputs
gate_output (torch.Tensor) – gate output energies
attention_weights (torch.Tensor) – attention weights
- forward(memory, decoder_inputs, memory_lengths)[source]
Decoder forward pass for training
- Parameters
memory (torch.Tensor) – Encoder outputs
decoder_inputs (torch.Tensor) – Decoder inputs for teacher forcing. i.e. mel-specs
memory_lengths (torch.Tensor) – Encoder output lengths for attention masking.
- Returns
mel_outputs (torch.Tensor) – mel outputs from the decoder
gate_outputs (torch.Tensor) – gate outputs from the decoder
alignments (torch.Tensor) – sequence of attention weights from the decoder
- infer(memory, memory_lengths)[source]
Decoder inference
- Parameters
memory (torch.Tensor) – Encoder outputs
- Returns
mel_outputs (torch.Tensor) – mel outputs from the decoder
gate_outputs (torch.Tensor) – gate outputs from the decoder
alignments (torch.Tensor) – sequence of attention weights from the decoder
mel_lengths (torch.Tensor) – the length of MEL spectrograms
- class speechbrain.lobes.models.Tacotron2.Tacotron2(mask_padding=True, n_mel_channels=80, n_symbols=148, symbols_embedding_dim=512, encoder_kernel_size=5, encoder_n_convolutions=3, encoder_embedding_dim=512, attention_rnn_dim=1024, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, n_frames_per_step=1, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, decoder_no_early_stopping=False)[source]
Bases:
Module
The Tactron2 text-to-speech model, based on the NVIDIA implementation.
This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers
Simplified STRUCTURE: input->word embedding ->encoder ->attention ->decoder(+prenet) -> postnet ->output
prenet(input is decoder previous time step) output is input to decoder concatenanted with the attention output
- Parameters
mask_padding (bool) – whether or not to mask pad-outputs of tacotron
io (#mel generation parameter in data) –
n_mel_channels (int) – number of mel channels for constructing spectrogram
#symbols –
n_symbols (int=128) – number of accepted char symbols defined in textToSequence
symbols_embedding_dim (int) – number of embeding dimension for symbols fed to nn.Embedding
parameters (# Decoder) –
encoder_kernel_size (int) – size of kernel processing the embeddings
encoder_n_convolutions (int) – number of convolution layers in encoder
encoder_embedding_dim (int) – number of kernels in encoder, this is also the dimension of the bidirectional LSTM in the encoder
parameters –
attention_rnn_dim (int) – input dimension
attention_dim (int) – number of hidden represetation in attention
parameters –
attention_location_n_filters (int) – number of 1-D convulation filters in attention
attention_location_kernel_size (int) – length of the 1-D convolution filters
parameters –
n_frames_per_step (int=1) – only 1 generated mel-frame per step is supported for the decoder as of now.
decoder_rnn_dim (int) – number of 2 unidirectionnal stacked LSTM units
prenet_dim (int) – dimension of linear prenet layers
max_decoder_steps (int) – maximum number of steps/frames the decoder generates before stopping
p_attention_dropout (float) – attention drop out probability
p_decoder_dropout (float) – decoder drop out probability
gate_threshold (int) – cut off level any output probabilty above that is considered complete and stops genration so we have variable length outputs
decoder_no_early_stopping (bool) – determines early stopping of decoder along with gate_threshold . The logical inverse of this is fed to the decoder
#Mel-post processing network parameters postnet_embedding_dim: int
number os postnet dfilters
- postnet_kernel_size: int
1d size of posnet kernel
- postnet_n_convolutions: int
number of convolution layers in postnet
Example
>>> import torch >>> _ = torch.manual_seed(213312) >>> from speechbrain.lobes.models.Tacotron2 import Tacotron2 >>> model = Tacotron2( ... mask_padding=True, ... n_mel_channels=80, ... n_symbols=148, ... symbols_embedding_dim=512, ... encoder_kernel_size=5, ... encoder_n_convolutions=3, ... encoder_embedding_dim=512, ... attention_rnn_dim=1024, ... attention_dim=128, ... attention_location_n_filters=32, ... attention_location_kernel_size=31, ... n_frames_per_step=1, ... decoder_rnn_dim=1024, ... prenet_dim=256, ... max_decoder_steps=32, ... gate_threshold=0.5, ... p_attention_dropout=0.1, ... p_decoder_dropout=0.1, ... postnet_embedding_dim=512, ... postnet_kernel_size=5, ... postnet_n_convolutions=5, ... decoder_no_early_stopping=False ... ) >>> _ = model.eval() >>> inputs = torch.tensor([ ... [13, 12, 31, 14, 19], ... [31, 16, 30, 31, 0], ... ]) >>> input_lengths = torch.tensor([5, 4]) >>> outputs, output_lengths, alignments = model.infer(inputs, input_lengths) >>> outputs.shape, output_lengths.shape, alignments.shape (torch.Size([2, 80, 1]), torch.Size([2]), torch.Size([2, 1, 5]))
- parse_output(outputs, output_lengths, alignments_dim=None)[source]
Masks the padded part of output
- Parameters
outputs (list) – a list of tensors - raw outputs
outputs_lengths (torch.Tensor) – a tensor representing the lengths of all outputs
alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training
- Returns
result – a (mel_outputs, mel_outputs_postnet, gate_outputs, alignments) tuple with the original outputs - with the mask applied
- Return type
- forward(inputs, alignments_dim=None)[source]
Decoder forward pass for training
- Parameters
- Returns
mel_outputs (torch.Tensor) – mel outputs from the decoder
mel_outputs_postnet (torch.Tensor) – mel outputs from postnet
gate_outputs (torch.Tensor) – gate outputs from the decoder
alignments (torch.Tensor) – sequence of attention weights from the decoder
output_legnths (torch.Tensor) – length of the output without padding
- infer(inputs, input_lengths)[source]
Produces outputs
- Parameters
inputs (torch.tensor) – text or phonemes converted
input_lengths (torch.tensor) – the lengths of input parameters
- Returns
mel_outputs_postnet (torch.Tensor) – final mel output of tacotron 2
mel_lengths (torch.Tensor) – length of mels
alignments (torch.Tensor) – sequence of attention weights
- speechbrain.lobes.models.Tacotron2.get_mask_from_lengths(lengths, max_len=None)[source]
Creates a mask from a tensor of lengths
- Parameters
lengths (torch.Tensor) – a tensor of sequence lengths
- Returns
mask (torch.Tensor) – the mask
max_len (int) – The maximum length, i.e. the last dimension of the mask tensor. If not provided, it will be calculated automatically
- speechbrain.lobes.models.Tacotron2.infer(model, text_sequences, input_lengths)[source]
An inference hook for pretrained synthesizers
- Parameters
model (Tacotron2) – the tacotron model
text_sequences (torch.Tensor) – encoded text sequences
input_lengths (torch.Tensor) – input lengths
- Returns
result – (mel_outputs_postnet, mel_lengths, alignments) - the exact model output
- Return type
- speechbrain.lobes.models.Tacotron2.LossStats
alias of
TacotronLoss
- class speechbrain.lobes.models.Tacotron2.Loss(guided_attention_sigma=None, gate_loss_weight=1.0, guided_attention_weight=1.0, guided_attention_scheduler=None, guided_attention_hard_stop=None)[source]
Bases:
Module
The Tacotron loss implementation
The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal
The output of the moduel is a LossStats tuple, which includes both the total loss
- Parameters
guided_attention_sigma (float) – The guided attention sigma factor, controling the “width” of the mask
gate_loss_weight (float) – The constant by which the hate loss will be multiplied
guided_attention_weight (float) – The weight for the guided attention
guided_attention_scheduler (callable) – The scheduler class for the guided attention loss
guided_attention_hard_stop (int) – The number of epochs after which guided attention will be compeltely turned off
Example –
torch (>>> import) –
torch.manual_seed(42) (>>> _ =) –
Loss (>>> from speechbrain.lobes.models.Tacotron2 import) –
Loss(guided_attention_sigma=0.2) (>>> loss =) –
torch.randn(2 (>>> alignments =) –
80 –
861) –
torch.randn(1722 (>>> gate_target =) –
1) –
torch.randn(2 –
80 –
861) –
torch.randn(2 –
80 –
861) –
torch.randn(2 –
861) –
torch.randn(2 –
861 –
173) –
mel_target (>>> targets =) –
gate_target –
mel_out (>>> model_outputs =) –
mel_out_postnet –
gate_out –
alignments –
torch.tensor([173 (>>> input_lengths =) –
91]) –
torch.tensor([861 (>>> target_lengths =) –
438]) –
loss(model_outputs (>>>) –
targets –
input_lengths –
target_lengths –
1) –
TacotronLoss(loss=tensor(4.8566) –
mel_loss=tensor(4.0097) –
gate_loss=tensor(0.8460) –
attn_loss=tensor(0.0010) –
attn_weight=tensor(1.)) –
- forward(model_output, targets, input_lengths, target_lengths, epoch)[source]
Computes the loss
- Parameters
model_output (tuple) – the output of the model’s forward(): (mel_outputs, mel_outputs_postnet, gate_outputs, alignments)
targets (tuple) – the targets
input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths
target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths
epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used
- Returns
result – the total loss - and individual losses (mel and gate)
- Return type
LossStats
- get_attention_loss(alignments, input_lengths, target_lengths, epoch)[source]
Computes the attention loss
- Parameters
alignments (torch.Tensor) – the aligment matrix from the model
input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths
target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths
epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used
- Returns
attn_loss – the attention loss value
- Return type
- class speechbrain.lobes.models.Tacotron2.TextMelCollate(n_frames_per_step=1)[source]
Bases:
object
Zero-pads model inputs and targets based on number of frames per step
- Parameters
n_frames_per_step (int) – the number of output frames per step
- Returns
result – a tuple of tensors to be used as inputs/targets (
text_padded, input_lengths, mel_padded, gate_padded, output_lengths, len_x
)
- Return type
- speechbrain.lobes.models.Tacotron2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]
Dynamic range compression for audio signals
- speechbrain.lobes.models.Tacotron2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]
calculates MelSpectrogram for a raw audio signal
- Parameters
sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.tensor) – input audio signal