speechbrain.processing.diarization module

This script contains basic functions used for speaker diarization. This script has an optional dependency on open source scikit-learn (sklearn) library. A few scikit-learn functions are modified in this script as per requirement.


This code is written using the following:

  • Nauman Dawalatabad 2020




This class implements the spectral clustering with unnormalized affinity matrix.


Performs spectral clustering using sklearn on embeddings.



Distributes the overlapped speech equally among the adjacent segments with different speakers.


Performs Agglomerative Hierarchical Clustering on embeddings.


Performs kmeans clustering on embeddings.


Performs spectral clustering on embeddings.


Returns actual number of speakers in a recording from the ground-truth.


Returns True if segments are overlapping.


Merge adjacent sub-segs from the same speaker.


Prepares csv for a given recording ID.


Reads and returns RTTM in list format.


Performs spectral clustering.


Returns spectral embeddings.


Write the final DERs for individual recording.


Writes the segment list in RTTM format (A standard NIST format).



Reads and returns RTTM in list format.


rttm_file_path (str) – Path to the RTTM file to be read.


rttm – List containing rows of RTTM file.

Return type:


speechbrain.processing.diarization.write_ders_file(ref_rttm, DER, out_der_file)[source]

Write the final DERs for individual recording.

  • ref_rttm (str) – Reference RTTM file.

  • DER (array) – Array containing DER values of each recording.

  • out_der_file (str) – File to write the DERs.

speechbrain.processing.diarization.prepare_subset_csv(full_diary_csv, rec_id, out_csv_file)[source]

Prepares csv for a given recording ID.

  • full_diary_csv (csv) – Full csv containing all the recordings

  • rec_id (str) – The recording ID for which csv has to be prepared

  • out_csv_file (str) – Path of the output csv file.

speechbrain.processing.diarization.is_overlapped(end1, start2)[source]

Returns True if segments are overlapping.

  • end1 (float) – End time of the first segment.

  • start2 (float) – Start time of the second segment.


overlapped – True of segments overlapped else False.

Return type:



>>> from speechbrain.processing import diarization as diar
>>> diar.is_overlapped(5.5, 3.4)
>>> diar.is_overlapped(5.5, 6.4)

Merge adjacent sub-segs from the same speaker.


lol (list of list) – Each list contains [rec_id, sseg_start, sseg_end, spkr_id].


new_lol – new_lol contains adjacent segments merged from the same speaker ID.

Return type:

list of list


>>> from speechbrain.processing import diarization as diar
>>> lol=[['r1', 5.5, 7.0, 's1'],
... ['r1', 6.5, 9.0, 's1'],
... ['r1', 8.0, 11.0, 's1'],
... ['r1', 11.5, 13.0, 's2'],
... ['r1', 14.0, 15.0, 's2'],
... ['r1', 14.5, 15.0, 's1']]
>>> diar.merge_ssegs_same_speaker(lol)
[['r1', 5.5, 11.0, 's1'], ['r1', 11.5, 13.0, 's2'], ['r1', 14.0, 15.0, 's2'], ['r1', 14.5, 15.0, 's1']]

Distributes the overlapped speech equally among the adjacent segments with different speakers.


lol (list of list) – It has each list structure as [rec_id, sseg_start, sseg_end, spkr_id].


new_lol – It contains the overlapped part equally divided among the adjacent segments with different speaker IDs.

Return type:

list of list


>>> from speechbrain.processing import diarization as diar
>>> lol = [['r1', 5.5, 9.0, 's1'],
... ['r1', 8.0, 11.0, 's2'],
... ['r1', 11.5, 13.0, 's2'],
... ['r1', 12.0, 15.0, 's1']]
>>> diar.distribute_overlap(lol)
[['r1', 5.5, 8.5, 's1'], ['r1', 8.5, 11.0, 's2'], ['r1', 11.5, 12.5, 's2'], ['r1', 12.5, 15.0, 's1']]
speechbrain.processing.diarization.write_rttm(segs_list, out_rttm_file)[source]

Writes the segment list in RTTM format (A standard NIST format).

  • segs_list (list of list) – Each list contains [rec_id, sseg_start, sseg_end, spkr_id].

  • out_rttm_file (str) – Path of the output RTTM file.

speechbrain.processing.diarization.get_oracle_num_spkrs(rec_id, spkr_info)[source]

Returns actual number of speakers in a recording from the ground-truth. This can be used when the condition is oracle number of speakers.

  • rec_id (str) – Recording ID for which the number of speakers have to be obtained.

  • spkr_info (list) – Header of the RTTM file. Starting with SPKR-INFO.


>>> from speechbrain.processing import diarization as diar
>>> spkr_info = ['SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.A <NA> <NA>',
... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.B <NA> <NA>',
... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.C <NA> <NA>',
... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.D <NA> <NA>',
... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.A <NA> <NA>',
... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.B <NA> <NA>',
... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.C <NA> <NA>']
>>> diar.get_oracle_num_spkrs('ES2011a', spkr_info)
>>> diar.get_oracle_num_spkrs('ES2011b', spkr_info)
speechbrain.processing.diarization.spectral_embedding_sb(adjacency, n_components=8, norm_laplacian=True, drop_first=True)[source]

Returns spectral embeddings.

  • adjacency (array-like or sparse graph) – shape - (n_samples, n_samples) The adjacency matrix of the graph to embed.

  • n_components (int) – The dimension of the projection subspace.

  • norm_laplacian (bool) – If True, then compute normalized Laplacian.

  • drop_first (bool) – Whether to drop the first eigenvector.


embedding – Spectral embeddings for each sample.

Return type:



>>> import numpy as np
>>> from speechbrain.processing import diarization as diar
>>> affinity = np.array([[1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5],
... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
... [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
>>> embs = diar.spectral_embedding_sb(affinity, 3)
>>> # Notice similar embeddings
>>> print(np.around(embs , decimals=3))
[[ 0.075  0.244  0.285]
 [ 0.083  0.356 -0.203]
 [ 0.083  0.356 -0.203]
 [ 0.26  -0.149  0.154]
 [ 0.29  -0.218 -0.11 ]
 [ 0.29  -0.218 -0.11 ]
 [-0.198 -0.084 -0.122]
 [-0.198 -0.084 -0.122]
 [-0.198 -0.084 -0.122]
 [-0.167 -0.044  0.316]]
speechbrain.processing.diarization.spectral_clustering_sb(affinity, n_clusters=8, n_components=None, random_state=None, n_init=10)[source]

Performs spectral clustering.

  • affinity (matrix) – Affinity matrix.

  • n_clusters (int) – Number of clusters for kmeans.

  • n_components (int) – Number of components to retain while estimating spectral embeddings.

  • random_state (int) –

    A pseudo random number generator used by kmeans. n_init : int

    Number of time the k-means algorithm will be run with different centroid seeds.


labels – Cluster label for each sample.

Return type:



>>> import numpy as np
>>> from speechbrain.processing import diarization as diar
>>> affinity = np.array([[1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5],
... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
... [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
... [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
>>> labs = diar.spectral_clustering_sb(affinity, 3)
>>> # print (labs) # [2 2 2 1 1 1 0 0 0 0]
class speechbrain.processing.diarization.Spec_Cluster(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol='auto', assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)[source]

Bases: SpectralClustering

Performs spectral clustering using sklearn on embeddings.

perform_sc(X, n_neighbors=10)[source]

Performs spectral clustering using sklearn on embeddings.

  • X (array (n_samples, n_features)) – Embeddings to be clustered.

  • n_neighbors (int) – Number of neighbors in estimating affinity matrix.

  • Reference

  • ---------

  • https (//github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/cluster/_spectral.py) –

class speechbrain.processing.diarization.Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10)[source]

Bases: object

This class implements the spectral clustering with unnormalized affinity matrix. Useful when affinity matrix is based on cosine similarities.


Von Luxburg, U. A tutorial on spectral clustering. Stat Comput 17, 395–416 (2007). https://doi.org/10.1007/s11222-007-9033-z


>>> from speechbrain.processing import diarization as diar
>>> clust = diar.Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10)
>>> emb = [[ 2.1, 3.1, 4.1, 4.2, 3.1],
... [ 2.2, 3.1, 4.2, 4.2, 3.2],
... [ 2.0, 3.0, 4.0, 4.1, 3.0],
... [ 8.0, 7.0, 7.0, 8.1, 9.0],
... [ 8.1, 7.1, 7.2, 8.1, 9.2],
... [ 8.3, 7.4, 7.0, 8.4, 9.0],
... [ 0.3, 0.4, 0.4, 0.5, 0.8],
... [ 0.4, 0.3, 0.6, 0.7, 0.8],
... [ 0.2, 0.3, 0.2, 0.3, 0.7],
... [ 0.3, 0.4, 0.4, 0.4, 0.7],]
>>> # Estimating similarity matrix
>>> sim_mat = clust.get_sim_mat(emb)
>>> print (np.around(sim_mat[5:,5:], decimals=3))
[[1.    0.957 0.961 0.904 0.966]
 [0.957 1.    0.977 0.982 0.997]
 [0.961 0.977 1.    0.928 0.972]
 [0.904 0.982 0.928 1.    0.976]
 [0.966 0.997 0.972 0.976 1.   ]]
>>> # Prunning
>>> pruned_sim_mat = clust.p_pruning(sim_mat, 0.3)
>>> print (np.around(pruned_sim_mat[5:,5:], decimals=3))
[[1.    0.    0.    0.    0.   ]
 [0.    1.    0.    0.982 0.997]
 [0.    0.977 1.    0.    0.972]
 [0.    0.982 0.    1.    0.976]
 [0.    0.997 0.    0.976 1.   ]]
>>> # Symmetrization
>>> sym_pruned_sim_mat = 0.5 * (pruned_sim_mat + pruned_sim_mat.T)
>>> print (np.around(sym_pruned_sim_mat[5:,5:], decimals=3))
[[1.    0.    0.    0.    0.   ]
 [0.    1.    0.489 0.982 0.997]
 [0.    0.489 1.    0.    0.486]
 [0.    0.982 0.    1.    0.976]
 [0.    0.997 0.486 0.976 1.   ]]
>>> # Laplacian
>>> laplacian = clust.get_laplacian(sym_pruned_sim_mat)
>>> print (np.around(laplacian[5:,5:], decimals=3))
[[ 1.999  0.     0.     0.     0.   ]
 [ 0.     2.468 -0.489 -0.982 -0.997]
 [ 0.    -0.489  0.975  0.    -0.486]
 [ 0.    -0.982  0.     1.958 -0.976]
 [ 0.    -0.997 -0.486 -0.976  2.458]]
>>> # Spectral Embeddings
>>> spec_emb, num_of_spk = clust.get_spec_embs(laplacian, 3)
>>> print(num_of_spk)
>>> # Clustering
>>> clust.cluster_embs(spec_emb, num_of_spk)
>>> # print (clust.labels_) # [0 0 0 2 2 2 1 1 1 1]
>>> # Complete spectral clustering
>>> clust.do_spec_clust(emb, k_oracle=3, p_val=0.3)
>>> # print(clust.labels_) # [0 0 0 2 2 2 1 1 1 1]
do_spec_clust(X, k_oracle, p_val)[source]

Function for spectral clustering.

  • X (array) – (n_samples, n_features). Embeddings extracted from the model.

  • k_oracle (int) – Number of speakers (when oracle number of speakers).

  • p_val (float) – p percent value to prune the affinity matrix.


Returns the similarity matrix based on cosine similarities.


X (array) – (n_samples, n_features). Embeddings extracted from the model.


M – (n_samples, n_samples). Similarity matrix with cosine similarities between each pair of embedding.

Return type:


p_pruning(A, pval)[source]

Refine the affinity matrix by zeroing less similar values.

  • A (array) – (n_samples, n_samples). Affinity matrix.

  • pval (float) – p-value to be retained in each row of the affinity matrix.


A – (n_samples, n_samples). pruned affinity matrix based on p_val.

Return type:



Returns the un-normalized laplacian for the given affinity matrix.


M (array) – (n_samples, n_samples) Affinity matrix.


L – (n_samples, n_samples) Laplacian matrix.

Return type:


get_spec_embs(L, k_oracle=4)[source]

Returns spectral embeddings and estimates the number of speakers using maximum Eigen gap.

  • L (array (n_samples, n_samples)) – Laplacian matrix.

  • k_oracle (int) – Number of speakers when the condition is oracle number of speakers, else None.


  • emb (array (n_samples, n_components)) – Spectral embedding for each sample with n Eigen components.

  • num_of_spk (int) – Estimated number of speakers. If the condition is set to the oracle number of speakers then returns k_oracle.

cluster_embs(emb, k)[source]

Clusters the embeddings using kmeans.

  • emb (array (n_samples, n_components)) – Spectral embedding for each sample with n Eigen components.

  • k (int) – Number of clusters to kmeans.


self.labels_ – Labels for each sample embedding.

Return type:



Returns the difference (gaps) between the Eigen values.


eig_vals (list) – List of eigen values


eig_vals_gap_list – List of differences (gaps) between adjacent Eigen values.

Return type:


speechbrain.processing.diarization.do_spec_clustering(diary_obj, out_rttm_file, rec_id, k, pval, affinity_type, n_neighbors)[source]

Performs spectral clustering on embeddings. This function calls specific clustering algorithms as per affinity.

  • diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.

  • out_rttm_file (str) – Path of the output RTTM file.

  • rec_id (str) – Recording ID for the recording under processing.

  • k (int) – Number of speaker (None, if it has to be estimated).

  • pval (float) – pval for prunning affinity matrix.

  • affinity_type (str) – Type of similarity to be used to get affinity matrix (cos or nn).

speechbrain.processing.diarization.do_kmeans_clustering(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]

Performs kmeans clustering on embeddings.

  • diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.

  • out_rttm_file (str) – Path of the output RTTM file.

  • rec_id (str) – Recording ID for the recording under processing.

  • k (int) – Number of speaker (None, if it has to be estimated).

  • pval (float) – pval for prunning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.

speechbrain.processing.diarization.do_AHC(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]

Performs Agglomerative Hierarchical Clustering on embeddings.

  • diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.

  • out_rttm_file (str) – Path of the output RTTM file.

  • rec_id (str) – Recording ID for the recording under processing.

  • k (int) – Number of speaker (None, if it has to be estimated).

  • pval (float) – pval for prunning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.