speechbrain.lobes.models.ECAPA_TDNN module

A popular speaker recognition and diarization model.

Authors
  • Hwidong Na 2020

Summary

Classes:

AttentiveStatisticsPooling

This class implements an attentive statistic pooling layer for each channel.

BatchNorm1d

1D batch normalization.

Classifier

This class implements the cosine similarity on the top of features.

Conv1d

1D convolution.

ECAPA_TDNN

An implementation of the speaker embedding model in a paper.

Res2NetBlock

An implementation of Res2NetBlock w/ dilation.

SEBlock

An implementation of squeeze-and-excitation block.

SERes2NetBlock

An implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock.

TDNNBlock

An implementation of TDNN.

Reference

class speechbrain.lobes.models.ECAPA_TDNN.Conv1d(*args, **kwargs)[source]

Bases: Conv1d

1D convolution. Skip transpose is used to improve efficiency.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.BatchNorm1d(*args, **kwargs)[source]

Bases: BatchNorm1d

1D batch normalization. Skip transpose is used to improve efficiency.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.TDNNBlock(in_channels, out_channels, kernel_size, dilation, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1)[source]

Bases: Module

An implementation of TDNN.

Parameters
  • in_channels (int) – Number of input channels.

  • out_channels (int) – The number of output channels.

  • kernel_size (int) – The kernel size of the TDNN blocks.

  • dilation (int) – The dilation of the TDNN block.

  • activation (torch class) – A class for constructing the activation layers.

  • groups (int) – The groups size of the TDNN blocks.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
forward(x)[source]

Processes the input tensor x and returns an output tensor.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.Res2NetBlock(in_channels, out_channels, scale=8, kernel_size=3, dilation=1)[source]

Bases: Module

An implementation of Res2NetBlock w/ dilation.

Parameters
  • in_channels (int) – The number of channels expected in the input.

  • out_channels (int) – The number of output channels.

  • scale (int) – The scale of the Res2Net block.

  • kernel_size (int) – The kernel size of the Res2Net block.

  • dilation (int) – The dilation of the Res2Net block.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
>>> out_tensor = layer(inp_tensor).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
forward(x)[source]

Processes the input tensor x and returns an output tensor.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.SEBlock(in_channels, se_channels, out_channels)[source]

Bases: Module

An implementation of squeeze-and-excitation block.

Parameters
  • in_channels (int) – The number of input channels.

  • se_channels (int) – The number of output channels after squeeze.

  • out_channels (int) – The number of output channels.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> se_layer = SEBlock(64, 16, 64)
>>> lengths = torch.rand((8,))
>>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 120, 64])
forward(x, lengths=None)[source]

Processes the input tensor x and returns an output tensor.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.AttentiveStatisticsPooling(channels, attention_channels=128, global_context=True)[source]

Bases: Module

This class implements an attentive statistic pooling layer for each channel. It returns the concatenated mean and std of the input tensor.

Parameters
  • channels (int) – The number of input channels.

  • attention_channels (int) – The number of attention channels.

Example

>>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
>>> asp_layer = AttentiveStatisticsPooling(64)
>>> lengths = torch.rand((8,))
>>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
>>> out_tensor.shape
torch.Size([8, 1, 128])
forward(x, lengths=None)[source]

Calculates mean and std for a batch (input tensor).

Parameters

x (torch.Tensor) – Tensor of shape [N, C, L].

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.SERes2NetBlock(in_channels, out_channels, res2net_scale=8, se_channels=128, kernel_size=1, dilation=1, activation=<class 'torch.nn.modules.activation.ReLU'>, groups=1)[source]

Bases: Module

An implementation of building block in ECAPA-TDNN, i.e., TDNN-Res2Net-TDNN-SEBlock.

Parameters
  • out_channels (int) – The number of output channels.

  • res2net_scale (int) – The scale of the Res2Net block.

  • kernel_size (int) – The kernel size of the TDNN blocks.

  • dilation (int) – The dilation of the Res2Net block.

  • activation (torch class) – A class for constructing the activation layers.

  • groups (int) –

  • channels. (Number of blocked connections from input channels to output) –

Example

>>> x = torch.rand(8, 120, 64).transpose(1, 2)
>>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
>>> out = conv(x).transpose(1, 2)
>>> out.shape
torch.Size([8, 120, 64])
forward(x, lengths=None)[source]

Processes the input tensor x and returns an output tensor.

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.ECAPA_TDNN(input_size, device='cpu', lin_neurons=192, activation=<class 'torch.nn.modules.activation.ReLU'>, channels=[512, 512, 512, 512, 1536], kernel_sizes=[5, 3, 3, 3, 1], dilations=[1, 2, 3, 4, 1], attention_channels=128, res2net_scale=8, se_channels=128, global_context=True, groups=[1, 1, 1, 1, 1])[source]

Bases: Module

An implementation of the speaker embedding model in a paper. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification” (https://arxiv.org/abs/2005.07143).

Parameters
  • device (str) – Device used, e.g., “cpu” or “cuda”.

  • activation (torch class) – A class for constructing the activation layers.

  • channels (list of ints) – Output channels for TDNN/SERes2Net layer.

  • kernel_sizes (list of ints) – List of kernel sizes for each layer.

  • dilations (list of ints) – List of dilations for kernels in each layer.

  • lin_neurons (int) – Number of neurons in linear layers.

  • groups (list of ints) – List of groups for kernels in each layer.

Example

>>> input_feats = torch.rand([5, 120, 80])
>>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
>>> outputs = compute_embedding(input_feats)
>>> outputs.shape
torch.Size([5, 1, 192])
forward(x, lengths=None)[source]

Returns the embedding vector.

Parameters

x (torch.Tensor) – Tensor of shape (batch, time, channel).

training: bool
class speechbrain.lobes.models.ECAPA_TDNN.Classifier(input_size, device='cpu', lin_blocks=0, lin_neurons=192, out_neurons=1211)[source]

Bases: Module

This class implements the cosine similarity on the top of features.

Parameters
  • device (str) – Device used, e.g., “cpu” or “cuda”.

  • lin_blocks (int) – Number of linear layers.

  • lin_neurons (int) – Number of neurons in linear layers.

  • out_neurons (int) – Number of classes.

Example

>>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
>>> outputs = torch.tensor([ [1., -1.], [-9., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> outupts = outputs.unsqueeze(1)
>>> cos = classify(outputs)
>>> (cos < -1.0).long().sum()
tensor(0)
>>> (cos > 1.0).long().sum()
tensor(0)
training: bool
forward(x)[source]

Returns the output probabilities over speakers.

Parameters

x (torch.Tensor) – Torch tensor.