speechbrain.lobes.models.ECAPA_TDNN moduleο
A popular speaker recognition and diarization model.
- Authors
Hwidong Na 2020
Summaryο
Classes:
This class implements an attentive statistic pooling layer for each channel. |
|
1D batch normalization. |
|
This class implements the cosine similarity on the top of features. |
|
1D convolution. |
|
An implementation of the speaker embedding model in a paper. |
|
An implementation of Res2NetBlock w/ dilation. |
|
An implementation of squeeze-and-excitation block. |
|
An implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock. |
|
An implementation of TDNN. |
Referenceο
- class speechbrain.lobes.models.ECAPA_TDNN.Conv1d(*args, **kwargs)[source]ο
Bases:
Conv1d1D convolution. Skip transpose is used to improve efficiency.
- class speechbrain.lobes.models.ECAPA_TDNN.BatchNorm1d(*args, **kwargs)[source]ο
Bases:
BatchNorm1d1D batch normalization. Skip transpose is used to improve efficiency.
- class speechbrain.lobes.models.ECAPA_TDNN.TDNNBlock(in_channels, out_channels, kernel_size, dilation, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]ο
Bases:
ModuleAn implementation of TDNN.
- Parameters:
in_channels (int) β Number of input channels.
out_channels (int) β The number of output channels.
kernel_size (int) β The kernel size of the TDNN blocks.
dilation (int) β The dilation of the TDNN block.
activation (torch class) β A class for constructing the activation layers.
groups (int) β The groups size of the TDNN blocks.
dropout (float) β Rate of channel dropout during training.
Example
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) >>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1) >>> out_tensor = layer(inp_tensor).transpose(1, 2) >>> out_tensor.shape torch.Size([8, 120, 64])
- class speechbrain.lobes.models.ECAPA_TDNN.Res2NetBlock(in_channels, out_channels, scale=8, kernel_size=3, dilation=1, dropout=0.0)[source]ο
Bases:
ModuleAn implementation of Res2NetBlock w/ dilation.
- Parameters:
in_channels (int) β The number of channels expected in the input.
out_channels (int) β The number of output channels.
scale (int) β The scale of the Res2Net block.
kernel_size (int) β The kernel size of the Res2Net block.
dilation (int) β The dilation of the Res2Net block.
dropout (float) β Rate of channel dropout during training.
Example
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) >>> layer = Res2NetBlock(64, 64, scale=4, dilation=3) >>> out_tensor = layer(inp_tensor).transpose(1, 2) >>> out_tensor.shape torch.Size([8, 120, 64])
- class speechbrain.lobes.models.ECAPA_TDNN.SEBlock(in_channels, se_channels, out_channels)[source]ο
Bases:
ModuleAn implementation of squeeze-and-excitation block.
- Parameters:
Example
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) >>> se_layer = SEBlock(64, 16, 64) >>> lengths = torch.rand((8,)) >>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2) >>> out_tensor.shape torch.Size([8, 120, 64])
- class speechbrain.lobes.models.ECAPA_TDNN.AttentiveStatisticsPooling(channels, attention_channels=128, global_context=True)[source]ο
Bases:
ModuleThis class implements an attentive statistic pooling layer for each channel. It returns the concatenated mean and std of the input tensor.
- Parameters:
Example
>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2) >>> asp_layer = AttentiveStatisticsPooling(64) >>> lengths = torch.rand((8,)) >>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2) >>> out_tensor.shape torch.Size([8, 1, 128])
- forward(x, lengths=None)[source]ο
Calculates mean and std for a batch (input tensor).
- Parameters:
x (torch.Tensor) β Tensor of shape [N, C, L].
lengths (torch.Tensor) β The corresponding relative lengths of the inputs.
- Returns:
pooled_stats β mean and std of batch
- Return type:
- class speechbrain.lobes.models.ECAPA_TDNN.SERes2NetBlock(in_channels, out_channels, res2net_scale=8, se_channels=128, kernel_size=1, dilation=1, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]ο
Bases:
ModuleAn implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock.
- Parameters:
in_channels (int) β Expected size of input channels.
out_channels (int) β The number of output channels.
res2net_scale (int) β The scale of the Res2Net block.
se_channels (int) β The number of output channels after squeeze.
kernel_size (int) β The kernel size of the TDNN blocks.
dilation (int) β The dilation of the Res2Net block.
activation (torch class) β A class for constructing the activation layers.
groups (int) β Number of blocked connections from input channels to output channels.
dropout (float) β Rate of channel dropout during training.
Example
>>> x = torch.rand(8, 120, 64).transpose(1, 2) >>> conv = SERes2NetBlock(64, 64, res2net_scale=4) >>> out = conv(x).transpose(1, 2) >>> out.shape torch.Size([8, 120, 64])
- class speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN(input_size, device='cpu', lin_neurons=192, activation=<class 'torch.nn.modules.activation.ReLU'>, channels=[512, 512, 512, 512, 1536], kernel_sizes=[5, 3, 3, 3, 1], dilations=[1, 2, 3, 4, 1], attention_channels=128, res2net_scale=8, se_channels=128, global_context=True, groups=[1, 1, 1, 1, 1], dropout=0.0)[source]ο
Bases:
ModuleAn implementation of the speaker embedding model in a paper. βECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verificationβ (https://arxiv.org/abs/2005.07143).
- Parameters:
input_size (int) β Expected size of the input dimension.
device (str) β Device used, e.g., βcpuβ or βcudaβ.
lin_neurons (int) β Number of neurons in linear layers.
activation (torch class) β A class for constructing the activation layers.
channels (list of ints) β Output channels for TDNN/SERes2Net layer.
kernel_sizes (list of ints) β List of kernel sizes for each layer.
dilations (list of ints) β List of dilations for kernels in each layer.
attention_channels (int) β The number of attention channels.
res2net_scale (int) β The scale of the Res2Net block.
se_channels (int) β The number of output channels after squeeze.
global_context (bool) β Whether to use global context.
groups (list of ints) β List of groups for kernels in each layer.
dropout (float) β Rate of channel dropout during training.
Example
>>> input_feats = torch.rand([5, 120, 80]) >>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192) >>> outputs = compute_embedding(input_feats) >>> outputs.shape torch.Size([5, 1, 192])
- forward(x, lengths=None)[source]ο
Returns the embedding vector.
- Parameters:
x (torch.Tensor) β Tensor of shape (batch, time, channel).
lengths (torch.Tensor) β Corresponding relative lengths of inputs.
- Returns:
x β Embedding vector.
- Return type:
- class speechbrain.lobes.models.ECAPA_TDNN.Classifier(input_size, device='cpu', lin_blocks=0, lin_neurons=192, out_neurons=1211)[source]ο
Bases:
ModuleThis class implements the cosine similarity on the top of features.
- Parameters:
Example
>>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2) >>> outputs = torch.tensor( ... [[1.0, -1.0], [-9.0, 1.0], [0.9, 0.1], [0.1, 0.9]] ... ) >>> outputs = outputs.unsqueeze(1) >>> cos = classify(outputs) >>> (cos < -1.0).long().sum() tensor(0) >>> (cos > 1.0).long().sum() tensor(0)
- forward(x)[source]ο
Returns the output probabilities over speakers.
- Parameters:
x (torch.Tensor) β Torch tensor.
- Returns:
out β Output probabilities over speakers.
- Return type: