speechbrain.inference.diarization module

Specifies the inference interfaces for diarization modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

Speech_Emotion_Diarization

A ready-to-use SED interface (audio -> emotions and their durations)

Reference

class speechbrain.inference.diarization.Speech_Emotion_Diarization(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use SED interface (audio -> emotions and their durations)

Parameters:

hparams – Hyperparameters (from HyperPyYAML)

Example

>>> from speechbrain.inference.diarization import Speech_Emotion_Diarization
>>> tmpdir = getfixture("tmpdir")
>>> sed_model = Speech_Emotion_Diarization.from_hparams(source="speechbrain/emotion-diarization-wavlm-large", savedir=tmpdir,) 
>>> sed_model.diarize_file("speechbrain/emotion-diarization-wavlm-large/example.wav") 
MODULES_NEEDED = ['input_norm', 'wav2vec', 'output_mlp']
diarize_file(path)[source]

Get emotion diarization of a spoken utterance.

Parameters:

path (str) – Path to audio file which to diarize.

Returns:

list of dictionary – The emotions and their temporal boundaries.

Return type:

List[Dict[List]]

encode_batch(wavs, wav_lens)[source]

Encodes audios into fine-grained emotional embeddings

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.tensor

diarize_batch(wavs, wav_lens, batch_id)[source]

Get emotion diarization of a batch of waveforms.

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

  • batch_id (torch.tensor) – id of each batch (file names etc.)

Returns:

list of dictionary – The emotions and their temporal boundaries.

Return type:

List[Dict[List]]

preds_to_diarization(prediction, batch_id)[source]

Convert frame-wise predictions into a dictionary of diarization results.

Returns:

A dictionary with the start/end of each emotion

Return type:

dictionary

forward(wavs, wav_lens, batch_id)[source]

Get emotion diarization for a batch of waveforms.

is_overlapped(end1, start2)[source]

Returns True if segments are overlapping.

Parameters:
  • end1 (float) – End time of the first segment.

  • start2 (float) – Start time of the second segment.

Returns:

overlapped – True of segments overlapped else False.

Return type:

bool

Example

>>> from speechbrain.processing import diarization as diar
>>> diar.is_overlapped(5.5, 3.4)
True
>>> diar.is_overlapped(5.5, 6.4)
False
merge_ssegs_same_emotion_adjacent(lol)[source]

Merge adjacent sub-segs if they are the same emotion. :param lol: Each list contains [utt_id, sseg_start, sseg_end, emo_label]. :type lol: list of list

Returns:

new_lol – new_lol contains adjacent segments merged from the same emotion ID.

Return type:

list of list

Example

>>> from speechbrain.utils.EDER import merge_ssegs_same_emotion_adjacent
>>> lol=[['u1', 0.0, 7.0, 'a'],
... ['u1', 7.0, 9.0, 'a'],
... ['u1', 9.0, 11.0, 'n'],
... ['u1', 11.0, 13.0, 'n'],
... ['u1', 13.0, 15.0, 'n'],
... ['u1', 15.0, 16.0, 'a']]
>>> merge_ssegs_same_emotion_adjacent(lol)
[['u1', 0.0, 9.0, 'a'], ['u1', 9.0, 15.0, 'n'], ['u1', 15.0, 16.0, 'a']]
training: bool