speechbrain.integrations.alignment.diarization module

This script contains basic functions used for speaker diarization. This script has a dependency on open source scikit-learn (sklearn) library. A few scikit-learn functions are modified in this script as per requirement.

Reference

This code is written using the following:

Von Luxburg, U. A tutorial on spectral clustering. Stat Comput 17, 395–416 (2007). https://doi.org/10.1007/s11222-007-9033-z
https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/cluster/_spectral.py
https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/spectral_opt.py

Authors

Nauman Dawalatabad 2020

Summary

Classes:

`Spec_Clust_unorm`	This class implements the spectral clustering with unnormalized affinity matrix.
`Spec_Cluster`	Performs spectral clustering using sklearn on embeddings.

Functions:

`distribute_overlap`	Distributes the overlapped speech equally among the adjacent segments with different speakers.
`do_AHC`	Performs Agglomerative Hierarchical Clustering on embeddings.
`do_kmeans_clustering`	Performs kmeans clustering on embeddings.
`do_spec_clustering`	Performs spectral clustering on embeddings.
`get_oracle_num_spkrs`	Returns actual number of speakers in a recording from the ground-truth.
`is_overlapped`	Returns True if segments are overlapping.
`merge_ssegs_same_speaker`	Merge adjacent sub-segs from the same speaker.
`prepare_subset_csv`	Prepares csv for a given recording ID.
`read_rttm`	Reads and returns RTTM in list format.
`spectral_clustering_sb`	Performs spectral clustering.
`spectral_embedding_sb`	Returns spectral embeddings.
`write_ders_file`	Write the final DERs for individual recording.
`write_rttm`	Writes the segment list in RTTM format (A standard NIST format).

Reference

speechbrain.integrations.alignment.diarization.read_rttm(rttm_file_path)[source]

Reads and returns RTTM in list format.

Parameters:: rttm_file_path (str) – Path to the RTTM file to be read.
Returns:: rttm – List containing rows of RTTM file.
Return type:: list

speechbrain.integrations.alignment.diarization.write_ders_file(ref_rttm, DER, out_der_file)[source]

Write the final DERs for individual recording.

Parameters:

ref_rttm (str) – Reference RTTM file.
DER (array) – Array containing DER values of each recording.
out_der_file (str) – File to write the DERs.

Example

>>> rttm_file = getfixture("tmpdir").join("testfile.rttm")
>>> der_file = getfixture("tmpdir").join("der.txt")
>>> segs_list = [["recording_0", 0.0, 1.0, "speaker_0"]]
>>> write_rttm(segs_list, rttm_file)
>>> rttm = read_rttm(rttm_file)
>>> print(rttm)
['SPEAKER recording_0 0 0.0 1.0 <NA> <NA> speaker_0 <NA> <NA>']
>>> write_ders_file(rttm_file, [23.5], der_file)
>>> der_text = der_file.read()
>>> print(der_text.strip())
OVERALL  23.5

speechbrain.integrations.alignment.diarization.prepare_subset_csv(full_diary_csv, rec_id, out_csv_file)[source]

Prepares csv for a given recording ID.

Parameters:

full_diary_csv (csv) – Full csv containing all the recordings
rec_id (str) – The recording ID for which csv has to be prepared
out_csv_file (str) – Path of the output csv file.

speechbrain.integrations.alignment.diarization.is_overlapped(end1, start2)[source]

Returns True if segments are overlapping.

Parameters:

end1 (float) – End time of the first segment.
start2 (float) – Start time of the second segment.

Returns:

overlapped – True of segments overlapped else False.

Return type:

bool

Example

>>> is_overlapped(5.5, 3.4)
True
>>> is_overlapped(5.5, 6.4)
False

speechbrain.integrations.alignment.diarization.merge_ssegs_same_speaker(lol)[source]

Merge adjacent sub-segs from the same speaker.

Parameters:: lol (list of list) – Each list contains [rec_id, sseg_start, sseg_end, spkr_id].
Returns:: new_lol – new_lol contains adjacent segments merged from the same speaker ID.
Return type:: list of list

Example

>>> lol = [
...     ["r1", 5.5, 7.0, "s1"],
...     ["r1", 6.5, 9.0, "s1"],
...     ["r1", 8.0, 11.0, "s1"],
...     ["r1", 11.5, 13.0, "s2"],
...     ["r1", 14.0, 15.0, "s2"],
...     ["r1", 14.5, 15.0, "s1"],
... ]
>>> merge_ssegs_same_speaker(lol)
[['r1', 5.5, 11.0, 's1'], ['r1', 11.5, 13.0, 's2'], ['r1', 14.0, 15.0, 's2'], ['r1', 14.5, 15.0, 's1']]

speechbrain.integrations.alignment.diarization.distribute_overlap(lol)[source]

Distributes the overlapped speech equally among the adjacent segments with different speakers.

Parameters:: lol (list of list) – It has each list structure as [rec_id, sseg_start, sseg_end, spkr_id].
Returns:: new_lol – It contains the overlapped part equally divided among the adjacent segments with different speaker IDs.
Return type:: list of list

Example

>>> lol = [
...     ["r1", 5.5, 9.0, "s1"],
...     ["r1", 8.0, 11.0, "s2"],
...     ["r1", 11.5, 13.0, "s2"],
...     ["r1", 12.0, 15.0, "s1"],
... ]
>>> distribute_overlap(lol)
[['r1', 5.5, 8.5, 's1'], ['r1', 8.5, 11.0, 's2'], ['r1', 11.5, 12.5, 's2'], ['r1', 12.5, 15.0, 's1']]

speechbrain.integrations.alignment.diarization.write_rttm(segs_list, out_rttm_file)[source]

Writes the segment list in RTTM format (A standard NIST format).

Parameters:

segs_list (list of list) – Each list contains [rec_id, sseg_start, sseg_end, spkr_id].
out_rttm_file (str) – Path of the output RTTM file.

speechbrain.integrations.alignment.diarization.get_oracle_num_spkrs(rec_id, spkr_info)[source]

Returns actual number of speakers in a recording from the ground-truth. This can be used when the condition is oracle number of speakers.

Parameters:

rec_id (str) – Recording ID for which the number of speakers have to be obtained.
spkr_info (list) – Header of the RTTM file. Starting with SPKR-INFO.

Returns:

num_spkrs

Return type:

int

Example

>>> spkr_info = [
...     "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.A <NA> <NA>",
...     "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.B <NA> <NA>",
...     "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.C <NA> <NA>",
...     "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.D <NA> <NA>",
...     "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.A <NA> <NA>",
...     "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.B <NA> <NA>",
...     "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.C <NA> <NA>",
... ]
>>> get_oracle_num_spkrs("ES2011a", spkr_info)
4
>>> get_oracle_num_spkrs("ES2011b", spkr_info)
3

speechbrain.integrations.alignment.diarization.spectral_embedding_sb(adjacency, n_components=8, norm_laplacian=True, drop_first=True)[source]

Returns spectral embeddings.

Parameters:

adjacency (array-like or sparse graph) – shape - (n_samples, n_samples) The adjacency matrix of the graph to embed.
n_components (int) – The dimension of the projection subspace.
norm_laplacian (bool) – If True, then compute normalized Laplacian.
drop_first (bool) – Whether to drop the first eigenvector.

Returns:

embedding – Spectral embeddings for each sample.

Return type:

array

Example

>>> affinity = np.array(
...     [
...         [1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5],
...         [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
...         [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
...         [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...     ]
... )
>>> embs = spectral_embedding_sb(affinity, 3)
>>> # Notice similar embeddings
>>> print(np.around(embs, decimals=3))
[[ 0.075  0.244  0.285]
 [ 0.083  0.356 -0.203]
 [ 0.083  0.356 -0.203]
 [ 0.26  -0.149  0.154]
 [ 0.29  -0.218 -0.11 ]
 [ 0.29  -0.218 -0.11 ]
 [-0.198 -0.084 -0.122]
 [-0.198 -0.084 -0.122]
 [-0.198 -0.084 -0.122]
 [-0.167 -0.044  0.316]]

speechbrain.integrations.alignment.diarization.spectral_clustering_sb(affinity, n_clusters=8, n_components=None, random_state=None, n_init=10)[source]

Performs spectral clustering.

Parameters:

affinity (matrix) – Affinity matrix.
n_clusters (int) – Number of clusters for kmeans.
n_components (int) – Number of components to retain while estimating spectral embeddings.
random_state (int) – A pseudo random number generator used by kmeans.
n_init (int) – Number of time the k-means algorithm will be run with different centroid seeds.

Returns:

labels – Cluster label for each sample.

Return type:

array

Example

>>> affinity = np.array(
...     [
...         [1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5],
...         [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
...         [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
...         [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...         [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1],
...     ]
... )
>>> labs = spectral_clustering_sb(affinity, 3)
>>> print(labs)
[1 1 1 0 0 0 2 2 2 2]

class speechbrain.integrations.alignment.diarization.Spec_Cluster(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol='auto', assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)[source]

Bases: SpectralClustering

Performs spectral clustering using sklearn on embeddings.

perform_sc(X, n_neighbors=10)[source]

Performs spectral clustering using sklearn on embeddings.

Parameters:

X (array (n_samples, n_features)) – Embeddings to be clustered.
n_neighbors (int) – Number of neighbors in estimating affinity matrix.

Returns:

Spec_Cluster
Reference
———
https (//github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/cluster/_spectral.py)

class speechbrain.integrations.alignment.diarization.Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10)[source]

Bases: object

This class implements the spectral clustering with unnormalized affinity matrix. Useful when affinity matrix is based on cosine similarities.

Parameters:

min_num_spkrs (int) – Minimum number of expected speakers.
max_num_spkrs (int) – Maximum number of expected speakers.
Reference
---------
Luxburg (Von)
17 (U. A tutorial on spectral clustering. Stat Comput)
(2007). (395–416)
https (//doi.org/10.1007/s11222-007-9033-z)

Example

>>> clust = Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10)
>>> emb = [
...     [2.1, 3.1, 4.1, 4.2, 3.1],
...     [2.2, 3.1, 4.2, 4.2, 3.2],
...     [2.0, 3.0, 4.0, 4.1, 3.0],
...     [8.0, 7.0, 7.0, 8.1, 9.0],
...     [8.1, 7.1, 7.2, 8.1, 9.2],
...     [8.3, 7.4, 7.0, 8.4, 9.0],
...     [0.3, 0.4, 0.4, 0.5, 0.8],
...     [0.4, 0.3, 0.6, 0.7, 0.8],
...     [0.2, 0.3, 0.2, 0.3, 0.7],
...     [0.3, 0.4, 0.4, 0.4, 0.7],
... ]
>>> # Estimating similarity matrix
>>> sim_mat = clust.get_sim_mat(emb)
>>> print(np.around(sim_mat[5:, 5:], decimals=3))
[[1.    0.957 0.961 0.904 0.966]
 [0.957 1.    0.977 0.982 0.997]
 [0.961 0.977 1.    0.928 0.972]
 [0.904 0.982 0.928 1.    0.976]
 [0.966 0.997 0.972 0.976 1.   ]]
>>> # Pruning
>>> pruned_sim_mat = clust.p_pruning(sim_mat, 0.3)
>>> print(np.around(pruned_sim_mat[5:, 5:], decimals=3))
[[1.    0.    0.    0.    0.   ]
 [0.    1.    0.    0.982 0.997]
 [0.    0.977 1.    0.    0.972]
 [0.    0.982 0.    1.    0.976]
 [0.    0.997 0.    0.976 1.   ]]
>>> # Symmetrization
>>> sym_pruned_sim_mat = 0.5 * (pruned_sim_mat + pruned_sim_mat.T)
>>> print(np.around(sym_pruned_sim_mat[5:, 5:], decimals=3))
[[1.    0.    0.    0.    0.   ]
 [0.    1.    0.489 0.982 0.997]
 [0.    0.489 1.    0.    0.486]
 [0.    0.982 0.    1.    0.976]
 [0.    0.997 0.486 0.976 1.   ]]
>>> # Laplacian
>>> laplacian = clust.get_laplacian(sym_pruned_sim_mat)
>>> print(np.around(laplacian[5:, 5:], decimals=3))
[[ 1.999  0.     0.     0.     0.   ]
 [ 0.     2.468 -0.489 -0.982 -0.997]
 [ 0.    -0.489  0.975  0.    -0.486]
 [ 0.    -0.982  0.     1.958 -0.976]
 [ 0.    -0.997 -0.486 -0.976  2.458]]
>>> # Spectral Embeddings
>>> spec_emb, num_of_spk = clust.get_spec_embs(laplacian, 3)
>>> print(num_of_spk)
3
>>> # Clustering
>>> clust.cluster_embs(spec_emb, num_of_spk)
>>> print(clust.labels_)
[0 0 0 2 2 2 1 1 1 1]
>>> # Complete spectral clustering
>>> clust.do_spec_clust(emb, k_oracle=3, p_val=0.3)
>>> print(clust.labels_)
[2 2 2 1 1 1 0 0 0 0]

do_spec_clust(X, k_oracle, p_val)[source]

Function for spectral clustering.

Parameters:

X (array) – (n_samples, n_features). Embeddings extracted from the model.
k_oracle (int) – Number of speakers (when oracle number of speakers).
p_val (float) – p percent value to prune the affinity matrix.

get_sim_mat(X)[source]

Returns the similarity matrix based on cosine similarities.

Parameters:: X (array) – (n_samples, n_features). Embeddings extracted from the model.
Returns:: M – (n_samples, n_samples). Similarity matrix with cosine similarities between each pair of embedding.
Return type:: array

p_pruning(A, pval)[source]

Refine the affinity matrix by zeroing less similar values.

Parameters:

A (array) – (n_samples, n_samples). Affinity matrix.
pval (float) – p-value to be retained in each row of the affinity matrix.

Returns:

A – (n_samples, n_samples). pruned affinity matrix based on p_val.

Return type:

array

get_laplacian(M)[source]

Returns the un-normalized laplacian for the given affinity matrix.

Parameters:: M (array) – (n_samples, n_samples) Affinity matrix.
Returns:: L – (n_samples, n_samples) Laplacian matrix.
Return type:: array

get_spec_embs(L, k_oracle=4)[source]

Returns spectral embeddings and estimates the number of speakers using maximum Eigen gap.

Parameters:

L (array (n_samples, n_samples)) – Laplacian matrix.
k_oracle (int) – Number of speakers when the condition is oracle number of speakers, else None.

Returns:

emb (array (n_samples, n_components)) – Spectral embedding for each sample with n Eigen components.
num_of_spk (int) – Estimated number of speakers. If the condition is set to the oracle number of speakers then returns k_oracle.

cluster_embs(emb, k)[source]

Clusters the embeddings using kmeans.

Parameters:

emb (array (n_samples, n_components)) – Spectral embedding for each sample with n Eigen components.
k (int) – Number of clusters to kmeans.

getEigenGaps(eig_vals)[source]

Returns the difference (gaps) between the Eigen values.

Parameters:: eig_vals (list) – List of eigen values
Returns:: eig_vals_gap_list – List of differences (gaps) between adjacent Eigen values.
Return type:: list

speechbrain.integrations.alignment.diarization.do_spec_clustering(diary_obj, out_rttm_file, rec_id, k, pval, affinity_type, n_neighbors)[source]

Performs spectral clustering on embeddings. This function calls specific clustering algorithms as per affinity.

Parameters:

diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) – Path of the output RTTM file.
rec_id (str) – Recording ID for the recording under processing.
k (int) – Number of speaker (None, if it has to be estimated).
pval (float) – pval for pruning affinity matrix.
affinity_type (str) – Type of similarity to be used to get affinity matrix (cos or nn).
n_neighbors (int) – Number of neighbors to use for clustering

speechbrain.integrations.alignment.diarization.do_kmeans_clustering(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]

Performs kmeans clustering on embeddings.

Parameters:

diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) – Path of the output RTTM file.
rec_id (str) – Recording ID for the recording under processing.
k_oracle (int) – Number of speaker (None, if it has to be estimated).
p_val (float) – pval for pruning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.

speechbrain.integrations.alignment.diarization.do_AHC(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]

Performs Agglomerative Hierarchical Clustering on embeddings.

Parameters:

diary_obj (StatObject_SB type) – Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) – Path of the output RTTM file.
rec_id (str) – Recording ID for the recording under processing.
k_oracle (int) – Number of speaker (None, if it has to be estimated).
p_val (float) – pval for pruning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.