speechbrain.inference.speaker module

Specifies the inference interfaces for speaker recognition modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

SpeakerRecognition

A ready-to-use model for speaker recognition.

Reference

class speechbrain.inference.speaker.SpeakerRecognition(*args, **kwargs)[source]

Bases: EncoderClassifier

A ready-to-use model for speaker recognition. It can be used to perform speaker verification with verify_batch().

``` .. rubric:: Example

>>> import torchaudio
>>> from speechbrain.inference.speaker import SpeakerRecognition
>>> # Model is downloaded from the speechbrain HuggingFace repo
>>> tmpdir = getfixture("tmpdir")
>>> verification = SpeakerRecognition.from_hparams(
...     source="speechbrain/spkrec-ecapa-voxceleb",
...     savedir=tmpdir,
... )
>>> # Perform verification
>>> signal, fs = torchaudio.load("tests/samples/single-mic/example1.wav")
>>> signal2, fs = torchaudio.load("tests/samples/single-mic/example2.flac")
>>> score, prediction = verification.verify_batch(signal, signal2)
MODULES_NEEDED = ['compute_features', 'mean_var_norm', 'embedding_model', 'mean_var_norm_emb']
verify_batch(wavs1, wavs2, wav1_lens=None, wav2_lens=None, threshold=0.25)[source]

Performs speaker verification with cosine distance.

It returns the score and the decision (0 different speakers, 1 same speakers).

Parameters:
  • wavs1 (Torch.Tensor) – Tensor containing the speech waveform1 (batch, time). Make sure the sample rate is fs=16000 Hz.

  • wavs2 (Torch.Tensor) – Tensor containing the speech waveform2 (batch, time). Make sure the sample rate is fs=16000 Hz.

  • wav1_lens (Torch.Tensor) – Tensor containing the relative length for each sentence in the length (e.g., [0.8 0.6 1.0])

  • wav2_lens (Torch.Tensor) – Tensor containing the relative length for each sentence in the length (e.g., [0.8 0.6 1.0])

  • threshold (Float) – Threshold applied to the cosine distance to decide if the speaker is different (0) or the same (1).

Returns:

  • score – The score associated to the binary verification output (cosine distance).

  • prediction – The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise.

verify_files(path_x, path_y, **kwargs)[source]

Speaker verification with cosine distance

Returns the score and the decision (0 different speakers, 1 same speakers).

Returns:

  • score – The score associated to the binary verification output (cosine distance).

  • prediction – The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise.

training: bool