speechbrain.inference.speaker module

Specifies the inference interfaces for speaker recognition modules.

Authors:

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023

Summary

Classes:

SpeakerRecognition

A ready-to-use model for speaker recognition.

Reference

class speechbrain.inference.speaker.SpeakerRecognition(*args, **kwargs)[source]

Bases: EncoderClassifier

A ready-to-use model for speaker recognition. It can be used to perform speaker verification with verify_batch().

``` .. rubric:: Example

>>> import torchaudio
>>> from speechbrain.inference.speaker import SpeakerRecognition
>>> # Model is downloaded from the speechbrain HuggingFace repo
>>> tmpdir = getfixture("tmpdir")
>>> verification = SpeakerRecognition.from_hparams(
...     source="speechbrain/spkrec-ecapa-voxceleb",
...     savedir=tmpdir,
... )

>>> # Perform verification
>>> signal, fs = torchaudio.load("tests/samples/single-mic/example1.wav")
>>> signal2, fs = torchaudio.load("tests/samples/single-mic/example2.flac")
>>> score, prediction = verification.verify_batch(signal, signal2)

MODULES_NEEDED = ['compute_features', 'mean_var_norm', 'embedding_model', 'mean_var_norm_emb']

verify_batch(wavs1, wavs2, wav1_lens=None, wav2_lens=None, threshold=0.25)[source]

Performs speaker verification with cosine distance.

It returns the score and the decision (0 different speakers, 1 same speakers).

Parameters:

wavs1 (Torch.Tensor) – Tensor containing the speech waveform1 (batch, time). Make sure the sample rate is fs=16000 Hz.
wavs2 (Torch.Tensor) – Tensor containing the speech waveform2 (batch, time). Make sure the sample rate is fs=16000 Hz.
wav1_lens (Torch.Tensor) – Tensor containing the relative length for each sentence in the length (e.g., [0.8 0.6 1.0])
wav2_lens (Torch.Tensor) – Tensor containing the relative length for each sentence in the length (e.g., [0.8 0.6 1.0])
threshold (Float) – Threshold applied to the cosine distance to decide if the speaker is different (0) or the same (1).

Returns:

score – The score associated to the binary verification output (cosine distance).
prediction – The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise.

verify_files(path_x, path_y, **kwargs)[source]

Speaker verification with cosine distance

Returns the score and the decision (0 different speakers, 1 same speakers).

Returns:

score – The score associated to the binary verification output (cosine distance).
prediction – The prediction is 1 if the two signals in input are from the same speaker and 0 otherwise.

training: bool