speechbrain.integrations.k2_fsa.align module

Force alignment using k2 for CTC models. This module provides an abstract class, Aligner, for force alignment using k2 for CTC models. Besides, it also provides a concrete class, CTCAligner, for force alignment using k2 specifically for a pre-trained CTC model and a tokeniser (CTCTextEncoder). Note that we must make sure that the blank symbol is index 0 in the tokeniser’s vocabulary.

Users can simply mimic the usage of CTCAligner to implement their own aligner. There are two methods in the Aligner class that users need to implement:

encode_texts: encode texts (List[str]) to a list of lists of token indexes (List[List[int]]).

get_log_prob_and_targets: get log-probabilities (torch.Tensor), its length (torch.Tensor) and targets (List[List[int]])
from audio files and transcripts.

The align method is implemented in the Aligner class, so users do not need to implement it. We support three different ways of conducting force alignment:

One audio file and one transcript at a time.

A batch of audio files and transcripts.

A csv file containing the audio file paths and transcripts.
In this case, the csv file should follow the standard speechbrain csv format with a header line as follows: ID, duration, wav, spk_id, wrd

at two different levels (tokens and words).

When token-level alignment is conducted, for one single audio file or a batch of audio files, the aligning method will return a list of lists of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of lists of integers may look like [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]. For an input of csv file, the aligning method will return a dictionary (Dict[str, List[int]]), where the keys are the IDs of the audio files and the values are the list of token indexes.

When word-level alignment is conducted, for one single audio file or a batch of audio files, the aligning method will return a list of lists of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is ‘hello word’, and there are 20 frames in the audio file, then the returned list of lists of tuples may look like [[(3, 10, ‘hello’), (11, 16, ‘word’)]]. For an input of csv file, the aligning method will return a pandas.DataFrame, where the columns are [‘ID’, ‘word’, ‘start’, ‘end’], and note that the start and end are in seconds. However, if the frame_shift for the method, align_csv_word, is None, then the start and end will be in frames.

Author:

Zeyu Zhao 2024

Summary

Classes:

`Aligner`	Abstract class for aligner.
`CTCAligner`	Aligner class for CTC models. There are six methods designed to be applied by users directly: * align_audio_to_tokens * align_audio_to_words * align_batch_to_tokens * align_batch_to_words * align_csv_to_tokens * align_csv_to_words For more details, please refer to the documentation of each method.

Reference

class speechbrain.integrations.k2_fsa.align.Aligner[source]

Bases: ABC

Abstract class for aligner.

To implement your own aligner, you need to implement two methods:

encode_texts: encode texts (List[str]) to a list of lists of token indexes (List[List[int]]).
get_log_prob_and_targets: get log-probabilities (torch.Tensor), its length (torch.Tensor) and targets (List[List[int]])

The align method is implemented in the Aligner class, so users do not need to implement it. We support three different ways of conducting force alignment:

One audio file and one transcript at a time.

A batch of audio files and transcripts.

A csv file containing the audio file paths and transcripts.

When token-level alignment is conducted, for one single audio file, the aligning method will return a list of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of integers may look like [0, 1, 2, 3, 4].

For a batch of audio files, the aligning method will return a list of lists of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of lists of integers may look like [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]].

For an input of csv file, the aligning method will return a dictionary (Dict[str, List[int]]), where the keys are the IDs of the audio files and the values are the list of token indexes.

When word-level alignment is conducted, for one single audio file, the aligning method will return a list of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is ‘hello word’, and there are 20 frames in the audio file, then the returned list of tuples may look like [(3, 10, ‘hello’), (11, 16, ‘word’)]. If the frame_shift for the method, align_csv_word, is None, then the start and end will be in frames. If the frame_shift for the method, align_csv_word, is not None, then the start and end will be in seconds.

For a batch of audio files, the aligning method will return a list of lists of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is [‘hello world’, ‘hello speechbrain’], and there are 20 frames in each audio file, then the returned list of lists of tuples may look like [[(3, 10, ‘hello’), (11, 16, ‘world’)], [(3, 10, ‘hello’), (11, 20, ‘speechbrain’)]].

For an input of csv file, the aligning method will return nothing but save the alignment results to a csv file. The columns of the csv file are [‘ID’, ‘word’, ‘start’, ‘end’], and note that the start and end are in seconds, if the frame_shift is not None, else the start and end will be in frames.

abstractmethod encode_texts(texts: List[str]) → List[List[int]][source]

Encode texts to list of tokens.

Parameters:: texts (List[str], the texts to be encoded.)
Return type:: List[List[int]], the encoded texts.

abstractmethod get_log_prob_and_targets(audio_files: ~typing.List[str], transcripts: ~typing.List[str]) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Align transcripts to input_speech.

Parameters:

audio_files (List[str], the input audio directory.)
transcripts (List[str], the input transcripts.)

Returns:

torch.Tensor (the log-probabilities over the tokens.)
torch.Tensor (the lengths of the log-probabilities.)
list (the encoded targets.)

align(log_prob: Tensor, log_prob_len: Tensor, targets: List[List[int]]) → List[List[int]][source]

Align targets to log_probs.

Parameters:

log_prob (torch.Tensor) – A tensor of shape (N, T, C) containing the log-probabilities. Please make sure that index 0 of the C dimension corresponds to the blank symbol.
log_prob_len (torch.Tensor) – A tensor of shape (N,) containing the lengths of the log_probs. This is needed because the log_probs may have been padded. All elements in this tensor must be integers and <= T.
targets (list) – A list of list of integers containing the targets. Note that the targets should not contain the blank symbol. The blank symbol is assumed to be index 0 in log_prob.

Returns:

alignments

Return type:

List[List[int]], containing the alignments.

align_batch(audio_files: List[str], transcripts: List[str]) → List[List[int]][source]

Align targets to log_probs.

Parameters:

audio_files (List[str], the input audio directory.)
transcripts (List[str], the input transcripts.)

Return type:

List[List[int]], the alignments.

get_word_alignment(alignments: List[List[int]], transcripts: List[str]) → List[List[Tuple[int, int, str]]][source]

Get word alignment from character alignment.

Parameters:

alignments (List[List[int]], the character alignments.)
transcripts (List[str], the input transcripts.)

Returns:

List[List[Tuple[int, int, str]]], the word alignments.
Each tuple contains the start (include) and end (include) frame index of the word, and the word itself.

align_audio_to_tokens(audio_file: str, transcript: str) → List[int][source]

Align audio to tokens.

Parameters:

audio_file (str, the input audio file path.)
transcript (str, the input transcript.)

Returns:

alignment – Note that the length of the alignments is the same as the number of frames in the audio file, i.e., the length of the output of the NN model.

Return type:

List[int], the token-level alignments for the audio file.

align_audio_to_words(audio_file: str, transcript: str, frame_shift: float = 0.02) → List[Tuple[int, int, str]][source]

Align audio to words.

Parameters:

audio_file (str, the input audio file path.)
transcript (str, the input transcript.)
frame_shift (float, the frame shift in seconds, default to 0.02.)

Returns:

alignment – Each tuple contains the start (include) and end (include) frame index of the word, and the word itself.

Return type:

List[Tuple[int, int, str]], the word-level alignments for the audio file.

align_batch_to_tokens(audio_files: List[str], transcripts: List[str]) → List[List[int]][source]

Align a batch of audio files to tokens.

Parameters:

audio_files (List[str], the input audio files.)
transcripts (List[str], the input transcripts.)

Returns:

alignments – Note that the length of the alignments is the same as the number of frames in the audio file, i.e., the length of the output of the NN model.

Return type:

List[List[int]], the token-level alignments for the audio files.

align_batch_to_words(audio_files: List[str], transcripts: List[str], frame_shift: float = 0.02) → List[List[Tuple[int, int, str]]][source]

Align a batch of audio files to words.

Parameters:

audio_files (List[str], the input audio files.)
transcripts (List[str], the input transcripts.)
frame_shift (float, the frame shift in seconds, default to 0.02.)

Returns:

alignments (List[List[Tuple[int, int, str]]], the word-level alignments for the audio files.) – Each tuple contains the start (include) and end (include) frame index of the word, and the word itself.
Note that, the batch size should be small enough to fit into the GPU memory.

align_csv_to_tokens(input_csv: str, output_file: str, batch_size: int = 4)[source]

Align all the audio files in the input_csv and write the token alignments to output_csv. The output file will have the format: <audio id> <token alignment>

Parameters:

input_csv (str, the input csv file.)
output_file (str, the output file.)
batch_size (int, the batch size, default 4.)

align_csv_to_words(input_csv: str, output_csv: str, batch_size: int = 4, frame_shift: float = 0.02)[source]

Align all the audio files in the input_csv and write the word alignments to output_csv. The output file will have the format: <audio id> <word> <start> <end>

Parameters:

input_csv (str, the input csv file.)
output_csv (str, the output csv file.)
batch_size (int, the batch size, default 4.)
frame_shift (float, the frame shift in seconds at the output end of the NN model, default 0.02.)

class speechbrain.integrations.k2_fsa.align.CTCAligner(model: Module, tokenizer: CTCTextEncoder, device: device = device(type='cpu'))[source]

Bases: Aligner

Aligner class for CTC models. There are six methods designed to be applied by users directly:

align_audio_to_tokens

align_audio_to_words

align_batch_to_tokens

align_batch_to_words

align_csv_to_tokens

align_csv_to_words

For more details, please refer to the documentation of each method.

Parameters:

model (torch.nn.Module, the model applied for alignment.)
tokenizer (sb.dataio.encoder.CTCTextEncoder, the tokenizer used for) – encoding the text.
device (torch.device, the device to run the model on, default torch.device("cpu").)

Example

>>> import torch
>>> from speechbrain.inference import EncoderASR
>>> from speechbrain.integrations.k2_fsa.align import CTCAligner
>>> asr_model = EncoderASR.from_hparams(
...     source="speechbrain/asr-wav2vec2-librispeech",
...     savedir="pretrained_models/asr-wav2vec2-librispeech",
... )
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> aligner = CTCAligner(
...     model=asr_model, tokenizer=asr_model.tokenizer, device=device
... )
>>> audio_files = ["tests/samples/ASR/spk1_snt1.wav"]
>>> transcripts = ["THE CHILD ALMOST HURT THE SMALL DOG"]
>>> # align one audio file to tokens
>>> # alignment = aligner.align_audio_to_tokens(audio_files[0], transcripts[0])
>>> # align one audio file to words
>>> alignment = aligner.align_audio_to_words(
...     audio_files[0], transcripts[0], frame_shift=0.02
... )
>>> alignment
[(0.04, 0.1, 'THE'), (0.26, 0.6, 'CHILD'), (0.84, 1.18, 'ALMOST'), (1.380..., 1.58, 'HURT'), (1.84, 1.880..., 'THE'), (2.04, 2.32, 'SMALL'), (2.46, 2.72, 'DOG')]
>>> # align a batch of audio files to tokens
>>> # alignments = aligner.align_batch_to_tokens(audio_files, transcripts)
>>> # align a batch of audio files to words
>>> # alignments = aligner.align_batch_to_words(audio_files, transcripts, frame_shift=0.02)
>>> # align a csv file to tokens
>>> # aligner.align_csv_to_tokens("samples/audio_samples/example.csv", "samples/audio_samples/example_token_alignment.txt")
>>> # align a csv file to words
>>> # aligner.align_csv_to_words("samples/audio_samples/example.csv", "samples/audio_samples/example_word_alignment.csv", frame_shift=0.02)

encode_texts(texts: List[str]) → List[List[int]][source]

Encode texts to list of tokens.

Parameters:: texts (List[str], the texts to be encoded.)
Return type:: List[List[int]], the encoded texts.

Note

This method is specific to the tokeniser used in the model. In this case, we use the CTCTextEncoder.

get_log_prob_and_targets(audio_files: ~typing.List[str], transcripts: ~typing.List[str]) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Align transcripts to input_speech.

Parameters:

audio_files (List[str], the input audio directory.)
transcripts (List[str], the input transcripts.)

Returns:

torch.Tensor (the log-probabilities over the tokens.)
torch.Tensor (the lengths of the log-probabilities.)
list (the encoded targets.)