speechbrain.integrations.alignment.diarization moduleο
This script contains basic functions used for speaker diarization. This script has a dependency on open source scikit-learn (sklearn) library. A few scikit-learn functions are modified in this script as per requirement.
Referenceο
This code is written using the following:
Von Luxburg, U. A tutorial on spectral clustering. Stat Comput 17, 395β416 (2007). https://doi.org/10.1007/s11222-007-9033-z
https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/cluster/_spectral.py
https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/spectral_opt.py
- Authors
Nauman Dawalatabad 2020
Summaryο
Classes:
This class implements the spectral clustering with unnormalized affinity matrix. |
|
Performs spectral clustering using sklearn on embeddings. |
Functions:
Distributes the overlapped speech equally among the adjacent segments with different speakers. |
|
Performs Agglomerative Hierarchical Clustering on embeddings. |
|
Performs kmeans clustering on embeddings. |
|
Performs spectral clustering on embeddings. |
|
Returns actual number of speakers in a recording from the ground-truth. |
|
Returns True if segments are overlapping. |
|
Merge adjacent sub-segs from the same speaker. |
|
Prepares csv for a given recording ID. |
|
Reads and returns RTTM in list format. |
|
Performs spectral clustering. |
|
Returns spectral embeddings. |
|
Write the final DERs for individual recording. |
|
Writes the segment list in RTTM format (A standard NIST format). |
Referenceο
- speechbrain.integrations.alignment.diarization.read_rttm(rttm_file_path)[source]ο
Reads and returns RTTM in list format.
- speechbrain.integrations.alignment.diarization.write_ders_file(ref_rttm, DER, out_der_file)[source]ο
Write the final DERs for individual recording.
- Parameters:
Example
>>> rttm_file = getfixture("tmpdir").join("testfile.rttm") >>> der_file = getfixture("tmpdir").join("der.txt") >>> segs_list = [["recording_0", 0.0, 1.0, "speaker_0"]] >>> write_rttm(segs_list, rttm_file) >>> rttm = read_rttm(rttm_file) >>> print(rttm) ['SPEAKER recording_0 0 0.0 1.0 <NA> <NA> speaker_0 <NA> <NA>'] >>> write_ders_file(rttm_file, [23.5], der_file) >>> der_text = der_file.read() >>> print(der_text.strip()) OVERALL 23.5
- speechbrain.integrations.alignment.diarization.prepare_subset_csv(full_diary_csv, rec_id, out_csv_file)[source]ο
Prepares csv for a given recording ID.
- speechbrain.integrations.alignment.diarization.is_overlapped(end1, start2)[source]ο
Returns True if segments are overlapping.
- Parameters:
- Returns:
overlapped β True of segments overlapped else False.
- Return type:
Example
>>> is_overlapped(5.5, 3.4) True >>> is_overlapped(5.5, 6.4) False
- speechbrain.integrations.alignment.diarization.merge_ssegs_same_speaker(lol)[source]ο
Merge adjacent sub-segs from the same speaker.
- Parameters:
lol (list of list) β Each list contains [rec_id, sseg_start, sseg_end, spkr_id].
- Returns:
new_lol β new_lol contains adjacent segments merged from the same speaker ID.
- Return type:
Example
>>> lol = [ ... ["r1", 5.5, 7.0, "s1"], ... ["r1", 6.5, 9.0, "s1"], ... ["r1", 8.0, 11.0, "s1"], ... ["r1", 11.5, 13.0, "s2"], ... ["r1", 14.0, 15.0, "s2"], ... ["r1", 14.5, 15.0, "s1"], ... ] >>> merge_ssegs_same_speaker(lol) [['r1', 5.5, 11.0, 's1'], ['r1', 11.5, 13.0, 's2'], ['r1', 14.0, 15.0, 's2'], ['r1', 14.5, 15.0, 's1']]
- speechbrain.integrations.alignment.diarization.distribute_overlap(lol)[source]ο
Distributes the overlapped speech equally among the adjacent segments with different speakers.
- Parameters:
lol (list of list) β It has each list structure as [rec_id, sseg_start, sseg_end, spkr_id].
- Returns:
new_lol β It contains the overlapped part equally divided among the adjacent segments with different speaker IDs.
- Return type:
Example
>>> lol = [ ... ["r1", 5.5, 9.0, "s1"], ... ["r1", 8.0, 11.0, "s2"], ... ["r1", 11.5, 13.0, "s2"], ... ["r1", 12.0, 15.0, "s1"], ... ] >>> distribute_overlap(lol) [['r1', 5.5, 8.5, 's1'], ['r1', 8.5, 11.0, 's2'], ['r1', 11.5, 12.5, 's2'], ['r1', 12.5, 15.0, 's1']]
- speechbrain.integrations.alignment.diarization.write_rttm(segs_list, out_rttm_file)[source]ο
Writes the segment list in RTTM format (A standard NIST format).
- speechbrain.integrations.alignment.diarization.get_oracle_num_spkrs(rec_id, spkr_info)[source]ο
Returns actual number of speakers in a recording from the ground-truth. This can be used when the condition is oracle number of speakers.
- Parameters:
- Returns:
num_spkrs
- Return type:
Example
>>> spkr_info = [ ... "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.A <NA> <NA>", ... "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.B <NA> <NA>", ... "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.C <NA> <NA>", ... "SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.D <NA> <NA>", ... "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.A <NA> <NA>", ... "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.B <NA> <NA>", ... "SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.C <NA> <NA>", ... ] >>> get_oracle_num_spkrs("ES2011a", spkr_info) 4 >>> get_oracle_num_spkrs("ES2011b", spkr_info) 3
- speechbrain.integrations.alignment.diarization.spectral_embedding_sb(adjacency, n_components=8, norm_laplacian=True, drop_first=True)[source]ο
Returns spectral embeddings.
- Parameters:
adjacency (array-like or sparse graph) β shape - (n_samples, n_samples) The adjacency matrix of the graph to embed.
n_components (int) β The dimension of the projection subspace.
norm_laplacian (bool) β If True, then compute normalized Laplacian.
drop_first (bool) β Whether to drop the first eigenvector.
- Returns:
embedding β Spectral embeddings for each sample.
- Return type:
array
Example
>>> affinity = np.array( ... [ ... [1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5], ... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], ... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], ... [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... ] ... ) >>> embs = spectral_embedding_sb(affinity, 3) >>> # Notice similar embeddings >>> print(np.around(embs, decimals=3)) [[ 0.075 0.244 0.285] [ 0.083 0.356 -0.203] [ 0.083 0.356 -0.203] [ 0.26 -0.149 0.154] [ 0.29 -0.218 -0.11 ] [ 0.29 -0.218 -0.11 ] [-0.198 -0.084 -0.122] [-0.198 -0.084 -0.122] [-0.198 -0.084 -0.122] [-0.167 -0.044 0.316]]
- speechbrain.integrations.alignment.diarization.spectral_clustering_sb(affinity, n_clusters=8, n_components=None, random_state=None, n_init=10)[source]ο
Performs spectral clustering.
- Parameters:
affinity (matrix) β Affinity matrix.
n_clusters (int) β Number of clusters for kmeans.
n_components (int) β Number of components to retain while estimating spectral embeddings.
random_state (int) β A pseudo random number generator used by kmeans.
n_init (int) β Number of time the k-means algorithm will be run with different centroid seeds.
- Returns:
labels β Cluster label for each sample.
- Return type:
array
Example
>>> affinity = np.array( ... [ ... [1, 1, 1, 0.5, 0, 0, 0, 0, 0, 0.5], ... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], ... [1, 1, 1, 0, 0, 0, 0, 0, 0, 0], ... [0.5, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 1, 1, 1, 0, 0, 0, 0], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... [0.5, 0, 0, 0, 0, 0, 1, 1, 1, 1], ... ] ... ) >>> labs = spectral_clustering_sb(affinity, 3) >>> print(labs) [1 1 1 0 0 0 2 2 2 2]
- class speechbrain.integrations.alignment.diarization.Spec_Cluster(n_clusters=8, *, eigen_solver=None, n_components=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol='auto', assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, verbose=False)[source]ο
Bases:
SpectralClusteringPerforms spectral clustering using sklearn on embeddings.
- perform_sc(X, n_neighbors=10)[source]ο
Performs spectral clustering using sklearn on embeddings.
- Parameters:
X (array (n_samples, n_features)) β Embeddings to be clustered.
n_neighbors (int) β Number of neighbors in estimating affinity matrix.
- Returns:
Spec_Cluster
Reference
βββ
https (//github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/cluster/_spectral.py)
- class speechbrain.integrations.alignment.diarization.Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10)[source]ο
Bases:
objectThis class implements the spectral clustering with unnormalized affinity matrix. Useful when affinity matrix is based on cosine similarities.
- Parameters:
Example
>>> clust = Spec_Clust_unorm(min_num_spkrs=2, max_num_spkrs=10) >>> emb = [ ... [2.1, 3.1, 4.1, 4.2, 3.1], ... [2.2, 3.1, 4.2, 4.2, 3.2], ... [2.0, 3.0, 4.0, 4.1, 3.0], ... [8.0, 7.0, 7.0, 8.1, 9.0], ... [8.1, 7.1, 7.2, 8.1, 9.2], ... [8.3, 7.4, 7.0, 8.4, 9.0], ... [0.3, 0.4, 0.4, 0.5, 0.8], ... [0.4, 0.3, 0.6, 0.7, 0.8], ... [0.2, 0.3, 0.2, 0.3, 0.7], ... [0.3, 0.4, 0.4, 0.4, 0.7], ... ] >>> # Estimating similarity matrix >>> sim_mat = clust.get_sim_mat(emb) >>> print(np.around(sim_mat[5:, 5:], decimals=3)) [[1. 0.957 0.961 0.904 0.966] [0.957 1. 0.977 0.982 0.997] [0.961 0.977 1. 0.928 0.972] [0.904 0.982 0.928 1. 0.976] [0.966 0.997 0.972 0.976 1. ]] >>> # Pruning >>> pruned_sim_mat = clust.p_pruning(sim_mat, 0.3) >>> print(np.around(pruned_sim_mat[5:, 5:], decimals=3)) [[1. 0. 0. 0. 0. ] [0. 1. 0. 0.982 0.997] [0. 0.977 1. 0. 0.972] [0. 0.982 0. 1. 0.976] [0. 0.997 0. 0.976 1. ]] >>> # Symmetrization >>> sym_pruned_sim_mat = 0.5 * (pruned_sim_mat + pruned_sim_mat.T) >>> print(np.around(sym_pruned_sim_mat[5:, 5:], decimals=3)) [[1. 0. 0. 0. 0. ] [0. 1. 0.489 0.982 0.997] [0. 0.489 1. 0. 0.486] [0. 0.982 0. 1. 0.976] [0. 0.997 0.486 0.976 1. ]] >>> # Laplacian >>> laplacian = clust.get_laplacian(sym_pruned_sim_mat) >>> print(np.around(laplacian[5:, 5:], decimals=3)) [[ 1.999 0. 0. 0. 0. ] [ 0. 2.468 -0.489 -0.982 -0.997] [ 0. -0.489 0.975 0. -0.486] [ 0. -0.982 0. 1.958 -0.976] [ 0. -0.997 -0.486 -0.976 2.458]] >>> # Spectral Embeddings >>> spec_emb, num_of_spk = clust.get_spec_embs(laplacian, 3) >>> print(num_of_spk) 3 >>> # Clustering >>> clust.cluster_embs(spec_emb, num_of_spk) >>> print(clust.labels_) [0 0 0 2 2 2 1 1 1 1] >>> # Complete spectral clustering >>> clust.do_spec_clust(emb, k_oracle=3, p_val=0.3) >>> print(clust.labels_) [2 2 2 1 1 1 0 0 0 0]
- get_sim_mat(X)[source]ο
Returns the similarity matrix based on cosine similarities.
- Parameters:
X (array) β (n_samples, n_features). Embeddings extracted from the model.
- Returns:
M β (n_samples, n_samples). Similarity matrix with cosine similarities between each pair of embedding.
- Return type:
array
- p_pruning(A, pval)[source]ο
Refine the affinity matrix by zeroing less similar values.
- Parameters:
A (array) β (n_samples, n_samples). Affinity matrix.
pval (float) β p-value to be retained in each row of the affinity matrix.
- Returns:
A β (n_samples, n_samples). pruned affinity matrix based on p_val.
- Return type:
array
- get_laplacian(M)[source]ο
Returns the un-normalized laplacian for the given affinity matrix.
- Parameters:
M (array) β (n_samples, n_samples) Affinity matrix.
- Returns:
L β (n_samples, n_samples) Laplacian matrix.
- Return type:
array
- get_spec_embs(L, k_oracle=4)[source]ο
Returns spectral embeddings and estimates the number of speakers using maximum Eigen gap.
- Parameters:
L (array (n_samples, n_samples)) β Laplacian matrix.
k_oracle (int) β Number of speakers when the condition is oracle number of speakers, else None.
- Returns:
emb (array (n_samples, n_components)) β Spectral embedding for each sample with n Eigen components.
num_of_spk (int) β Estimated number of speakers. If the condition is set to the oracle number of speakers then returns k_oracle.
- speechbrain.integrations.alignment.diarization.do_spec_clustering(diary_obj, out_rttm_file, rec_id, k, pval, affinity_type, n_neighbors)[source]ο
Performs spectral clustering on embeddings. This function calls specific clustering algorithms as per affinity.
- Parameters:
diary_obj (StatObject_SB type) β Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) β Path of the output RTTM file.
rec_id (str) β Recording ID for the recording under processing.
k (int) β Number of speaker (None, if it has to be estimated).
pval (float) β
pvalfor pruning affinity matrix.affinity_type (str) β Type of similarity to be used to get affinity matrix (cos or nn).
n_neighbors (int) β Number of neighbors to use for clustering
- speechbrain.integrations.alignment.diarization.do_kmeans_clustering(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]ο
Performs kmeans clustering on embeddings.
- Parameters:
diary_obj (StatObject_SB type) β Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) β Path of the output RTTM file.
rec_id (str) β Recording ID for the recording under processing.
k_oracle (int) β Number of speaker (None, if it has to be estimated).
p_val (float) β
pvalfor pruning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.
- speechbrain.integrations.alignment.diarization.do_AHC(diary_obj, out_rttm_file, rec_id, k_oracle=4, p_val=0.3)[source]ο
Performs Agglomerative Hierarchical Clustering on embeddings.
- Parameters:
diary_obj (StatObject_SB type) β Contains embeddings in diary_obj.stat1 and segment IDs in diary_obj.segset.
out_rttm_file (str) β Path of the output RTTM file.
rec_id (str) β Recording ID for the recording under processing.
k_oracle (int) β Number of speaker (None, if it has to be estimated).
p_val (float) β
pvalfor pruning affinity matrix. Used only when number of speakers are unknown. Note that this is just for experiment. Prefer Spectral clustering for better clustering results.