speechbrain.lobes.models.ECAPA_TDNN module

A popular speaker recognition and diarization model.

Authors

Hwidong Na 2020

Summary

Classes:

`AttentiveStatisticsPooling`	This class implements an attentive statistic pooling layer for each channel.
`BatchNorm1d`	1D batch normalization.
`Classifier`	This class implements the cosine similarity on the top of features.
`Conv1d`	1D convolution.
`ECAPA_TDNN`	An implementation of the speaker embedding model in a paper.
`Res2NetBlock`	An implementation of Res2NetBlock w/ dilation.
`SEBlock`	An implementation of squeeze-and-excitation block.
`SERes2NetBlock`	An implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock.
`TDNNBlock`	An implementation of TDNN.

Reference

class speechbrain.lobes.models.ECAPA_TDNN.Conv1d(*args, **kwargs)[source]

Bases: Conv1d

1D convolution. Skip transpose is used to improve efficiency.

class speechbrain.lobes.models.ECAPA_TDNN.BatchNorm1d(*args, **kwargs)[source]

Bases: BatchNorm1d

1D batch normalization. Skip transpose is used to improve efficiency.

class speechbrain.lobes.models.ECAPA_TDNN.TDNNBlock(in_channels, out_channels, kernel_size, dilation, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]

Bases: Module

An implementation of TDNN.

Parameters:

in_channels (int) – Number of input channels.
out_channels (int) – The number of output channels.
kernel_size (int) – The kernel size of the TDNN blocks.
dilation (int) – The dilation of the TDNN block.
activation (torch class) – A class for constructing the activation layers.
groups (int) – The groups size of the TDNN blocks.
dropout (float) – Rate of channel dropout during training.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x)[source]: Processes the input tensor x and returns an output tensor.

class speechbrain.lobes.models.ECAPA_TDNN.Res2NetBlock(in_channels, out_channels, scale=8, kernel_size=3, dilation=1, dropout=0.0)[source]

Bases: Module

An implementation of Res2NetBlock w/ dilation.

Parameters:

in_channels (int) – The number of channels expected in the input.
out_channels (int) – The number of output channels.
scale (int) – The scale of the Res2Net block.
kernel_size (int) – The kernel size of the Res2Net block.
dilation (int) – The dilation of the Res2Net block.
dropout (float) – Rate of channel dropout during training.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x)[source]: Processes the input tensor x and returns an output tensor.

class speechbrain.lobes.models.ECAPA_TDNN.SEBlock(in_channels, se_channels, out_channels)[source]

Bases: Module

An implementation of squeeze-and-excitation block.

Parameters:

in_channels (int) – The number of input channels.
se_channels (int) – The number of output channels after squeeze.
out_channels (int) – The number of output channels.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> se_layer = SEBlock(64, 16, 64)
>>> lengths = torch.rand((8,))
>>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])

forward(x, lengths=None)[source]: Processes the input tensor x and returns an output tensor.

class speechbrain.lobes.models.ECAPA_TDNN.AttentiveStatisticsPooling(channels, attention_channels=128, global_context=True)[source]

Bases: Module

This class implements an attentive statistic pooling layer for each channel. It returns the concatenated mean and std of the input tensor.

Parameters:

channels (int) – The number of input channels.
attention_channels (int) – The number of attention channels.
global_context (bool) – Whether to use global context.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> asp_layer = AttentiveStatisticsPooling(64)
>>> lengths = torch.rand((8,))
>>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 1, 128])

forward(x, lengths=None)[source]

Calculates mean and std for a batch (input tensor).

Parameters:

x (torch.Tensor) – Tensor of shape [N, C, L].
lengths (torch.Tensor) – The corresponding relative lengths of the inputs.

Returns:

pooled_stats – mean and std of batch

Return type:

torch.Tensor

class speechbrain.lobes.models.ECAPA_TDNN.SERes2NetBlock(in_channels, out_channels, res2net_scale=8, se_channels=128, kernel_size=1, dilation=1, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1, dropout=0.0)[source]

Bases: Module

An implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock.

Parameters:

in_channels (int) – Expected size of input channels.
out_channels (int) – The number of output channels.
res2net_scale (int) – The scale of the Res2Net block.
se_channels (int) – The number of output channels after squeeze.
kernel_size (int) – The kernel size of the TDNN blocks.
dilation (int) – The dilation of the Res2Net block.
activation (torch class) – A class for constructing the activation layers.
groups (int) – Number of blocked connections from input channels to output channels.
dropout (float) – Rate of channel dropout during training.

Example

>>> x = torch.rand(8, 120, 64).transpose(1, 2)
>>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
>>> out = conv(x).transpose(1, 2)
>>> out.shape
torch.Size([8, 120, 64])

forward(x, lengths=None)[source]: Processes the input tensor x and returns an output tensor.

class speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN(input_size, device='cpu', lin_neurons=192, activation=<class 'torch.nn.modules.activation.ReLU'>, channels=[512, 512, 512, 512, 1536], kernel_sizes=[5, 3, 3, 3, 1], dilations=[1, 2, 3, 4, 1], attention_channels=128, res2net_scale=8, se_channels=128, global_context=True, groups=[1, 1, 1, 1, 1], dropout=0.0)[source]

Bases: Module

An implementation of the speaker embedding model in a paper. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification” (https://arxiv.org/abs/2005.07143).

Parameters:

input_size (int) – Expected size of the input dimension.
device (str) – Device used, e.g., “cpu” or “cuda”.
lin_neurons (int) – Number of neurons in linear layers.
activation (torch class) – A class for constructing the activation layers.
channels (list of ints) – Output channels for TDNN/SERes2Net layer.
kernel_sizes (list of ints) – List of kernel sizes for each layer.
dilations (list of ints) – List of dilations for kernels in each layer.
attention_channels (int) – The number of attention channels.
res2net_scale (int) – The scale of the Res2Net block.
se_channels (int) – The number of output channels after squeeze.
global_context (bool) – Whether to use global context.
groups (list of ints) – List of groups for kernels in each layer.
dropout (float) – Rate of channel dropout during training.

Example

>>> input_feats = torch.rand([5, 120, 80])
>>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
>>> outputs = compute_embedding(input_feats)
>>> outputs.shape
torch.Size([5, 1, 192])

forward(x, lengths=None)[source]

Returns the embedding vector.

Parameters:

x (torch.Tensor) – Tensor of shape (batch, time, channel).
lengths (torch.Tensor) – Corresponding relative lengths of inputs.

Returns:

x – Embedding vector.

Return type:

torch.Tensor

class speechbrain.lobes.models.ECAPA_TDNN.Classifier(input_size, device='cpu', lin_blocks=0, lin_neurons=192, out_neurons=1211)[source]

Bases: Module

This class implements the cosine similarity on the top of features.

Parameters:

input_size (int) – Expected size of input dimension.
device (str) – Device used, e.g., “cpu” or “cuda”.
lin_blocks (int) – Number of linear layers.
lin_neurons (int) – Number of neurons in linear layers.
out_neurons (int) – Number of classes.

Example

>>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
>>> outputs = torch.tensor(
...     [[1.0, -1.0], [-9.0, 1.0], [0.9, 0.1], [0.1, 0.9]]
... )
>>> outputs = outputs.unsqueeze(1)
>>> cos = classify(outputs)
>>> (cos < -1.0).long().sum()
tensor(0)
>>> (cos > 1.0).long().sum()
tensor(0)

forward(x)[source]

Returns the output probabilities over speakers.

Parameters:: x (torch.Tensor) – Torch tensor.
Returns:: out – Output probabilities over speakers.
Return type:: torch.Tensor