speechbrain.integrations.k2_fsa.align module
Force alignment using k2 for CTC models. This module provides an abstract class, Aligner, for force alignment using k2 for CTC models. Besides, it also provides a concrete class, CTCAligner, for force alignment using k2 specifically for a pre-trained CTC model and a tokeniser (CTCTextEncoder). Note that we must make sure that the blank symbol is index 0 in the tokeniser’s vocabulary.
Users can simply mimic the usage of CTCAligner to implement their own aligner. There are two methods in the Aligner class that users need to implement:
encode_texts: encode texts (List[str]) to a list of lists of token indexes (List[List[int]]).
- get_log_prob_and_targets: get log-probabilities (torch.Tensor), its length (torch.Tensor) and targets (List[List[int]])
from audio files and transcripts.
The align method is implemented in the Aligner class, so users do not need to implement it. We support three different ways of conducting force alignment:
One audio file and one transcript at a time.
A batch of audio files and transcripts.
- A csv file containing the audio file paths and transcripts.
In this case, the csv file should follow the standard speechbrain csv format with a header line as follows: ID, duration, wav, spk_id, wrd
at two different levels (tokens and words).
When token-level alignment is conducted, for one single audio file or a batch of audio files, the aligning method will return a list of lists of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of lists of integers may look like [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]. For an input of csv file, the aligning method will return a dictionary (Dict[str, List[int]]), where the keys are the IDs of the audio files and the values are the list of token indexes.
When word-level alignment is conducted, for one single audio file or a batch of audio files, the aligning method will return a list of lists of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is ‘hello word’, and there are 20 frames in the audio file, then the returned list of lists of tuples may look like [[(3, 10, ‘hello’), (11, 16, ‘word’)]]. For an input of csv file, the aligning method will return a pandas.DataFrame, where the columns are [‘ID’, ‘word’, ‘start’, ‘end’], and note that the start and end are in seconds. However, if the frame_shift for the method, align_csv_word, is None, then the start and end will be in frames.
- Author:
Zeyu Zhao 2024
Summary
Classes:
Abstract class for aligner. |
|
Aligner class for CTC models. There are six methods designed to be applied by users directly: * align_audio_to_tokens * align_audio_to_words * align_batch_to_tokens * align_batch_to_words * align_csv_to_tokens * align_csv_to_words For more details, please refer to the documentation of each method. |
Reference
- class speechbrain.integrations.k2_fsa.align.Aligner[source]
Bases:
ABCAbstract class for aligner.
- To implement your own aligner, you need to implement two methods:
encode_texts: encode texts (List[str]) to a list of lists of token indexes (List[List[int]]).
get_log_prob_and_targets: get log-probabilities (torch.Tensor), its length (torch.Tensor) and targets (List[List[int]])
The align method is implemented in the Aligner class, so users do not need to implement it. We support three different ways of conducting force alignment:
One audio file and one transcript at a time.
A batch of audio files and transcripts.
A csv file containing the audio file paths and transcripts.
When token-level alignment is conducted, for one single audio file, the aligning method will return a list of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of integers may look like [0, 1, 2, 3, 4].
For a batch of audio files, the aligning method will return a list of lists of integers, where each integer represents the index of the token in the tokeniser’s vocabulary. For example, if the tokeniser’s vocabulary is [‘<blank>’, ‘<unk>’, ‘a’, ‘b’, ‘c’], then the returned list of lists of integers may look like [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]].
For an input of csv file, the aligning method will return a dictionary (Dict[str, List[int]]), where the keys are the IDs of the audio files and the values are the list of token indexes.
When word-level alignment is conducted, for one single audio file, the aligning method will return a list of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is ‘hello word’, and there are 20 frames in the audio file, then the returned list of tuples may look like [(3, 10, ‘hello’), (11, 16, ‘word’)]. If the frame_shift for the method, align_csv_word, is None, then the start and end will be in frames. If the frame_shift for the method, align_csv_word, is not None, then the start and end will be in seconds.
For a batch of audio files, the aligning method will return a list of lists of tuples, where each tuple represents (start_frame (int, including), end_frame (int, including), word (str)). For example, if the transcript is [‘hello world’, ‘hello speechbrain’], and there are 20 frames in each audio file, then the returned list of lists of tuples may look like [[(3, 10, ‘hello’), (11, 16, ‘world’)], [(3, 10, ‘hello’), (11, 20, ‘speechbrain’)]].
For an input of csv file, the aligning method will return nothing but save the alignment results to a csv file. The columns of the csv file are [‘ID’, ‘word’, ‘start’, ‘end’], and note that the start and end are in seconds, if the frame_shift is not None, else the start and end will be in frames.
- abstractmethod encode_texts(texts: List[str]) List[List[int]][source]
Encode texts to list of tokens.
- abstractmethod get_log_prob_and_targets(audio_files: ~typing.List[str], transcripts: ~typing.List[str]) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Align transcripts to input_speech.
- align(log_prob: Tensor, log_prob_len: Tensor, targets: List[List[int]]) List[List[int]][source]
Align targets to log_probs.
- Parameters:
log_prob (torch.Tensor) – A tensor of shape (N, T, C) containing the log-probabilities. Please make sure that index 0 of the C dimension corresponds to the blank symbol.
log_prob_len (torch.Tensor) – A tensor of shape (N,) containing the lengths of the log_probs. This is needed because the log_probs may have been padded. All elements in this tensor must be integers and <= T.
targets (list) – A list of list of integers containing the targets. Note that the targets should not contain the blank symbol. The blank symbol is assumed to be index 0 in log_prob.
- Returns:
alignments
- Return type:
List[List[int]], containing the alignments.
- align_batch(audio_files: List[str], transcripts: List[str]) List[List[int]][source]
Align targets to log_probs.
- get_word_alignment(alignments: List[List[int]], transcripts: List[str]) List[List[Tuple[int, int, str]]][source]
Get word alignment from character alignment.
- align_audio_to_tokens(audio_file: str, transcript: str) List[int][source]
Align audio to tokens.
- Parameters:
- Returns:
alignment – Note that the length of the alignments is the same as the number of frames in the audio file, i.e., the length of the output of the NN model.
- Return type:
List[int], the token-level alignments for the audio file.
- align_audio_to_words(audio_file: str, transcript: str, frame_shift: float = 0.02) List[Tuple[int, int, str]][source]
Align audio to words.
- align_batch_to_tokens(audio_files: List[str], transcripts: List[str]) List[List[int]][source]
Align a batch of audio files to tokens.
- Parameters:
- Returns:
alignments – Note that the length of the alignments is the same as the number of frames in the audio file, i.e., the length of the output of the NN model.
- Return type:
List[List[int]], the token-level alignments for the audio files.
- align_batch_to_words(audio_files: List[str], transcripts: List[str], frame_shift: float = 0.02) List[List[Tuple[int, int, str]]][source]
Align a batch of audio files to words.
- Parameters:
- Returns:
alignments (List[List[Tuple[int, int, str]]], the word-level alignments for the audio files.) – Each tuple contains the start (include) and end (include) frame index of the word, and the word itself.
Note that, the batch size should be small enough to fit into the GPU memory.
- align_csv_to_tokens(input_csv: str, output_file: str, batch_size: int = 4)[source]
Align all the audio files in the input_csv and write the token alignments to output_csv. The output file will have the format: <audio id> <token alignment>
- class speechbrain.integrations.k2_fsa.align.CTCAligner(model: Module, tokenizer: CTCTextEncoder, device: device = device(type='cpu'))[source]
Bases:
AlignerAligner class for CTC models. There are six methods designed to be applied by users directly:
align_audio_to_tokens
align_audio_to_words
align_batch_to_tokens
align_batch_to_words
align_csv_to_tokens
align_csv_to_words
For more details, please refer to the documentation of each method.
- Parameters:
model (torch.nn.Module, the model applied for alignment.)
tokenizer (sb.dataio.encoder.CTCTextEncoder, the tokenizer used for) – encoding the text.
device (torch.device, the device to run the model on, default torch.device("cpu").)
Example
>>> import torch >>> from speechbrain.inference import EncoderASR >>> from speechbrain.integrations.k2_fsa.align import CTCAligner >>> asr_model = EncoderASR.from_hparams( ... source="speechbrain/asr-wav2vec2-librispeech", ... savedir="pretrained_models/asr-wav2vec2-librispeech", ... ) >>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") >>> aligner = CTCAligner( ... model=asr_model, tokenizer=asr_model.tokenizer, device=device ... ) >>> audio_files = ["tests/samples/ASR/spk1_snt1.wav"] >>> transcripts = ["THE CHILD ALMOST HURT THE SMALL DOG"] >>> # align one audio file to tokens >>> # alignment = aligner.align_audio_to_tokens(audio_files[0], transcripts[0]) >>> # align one audio file to words >>> alignment = aligner.align_audio_to_words( ... audio_files[0], transcripts[0], frame_shift=0.02 ... ) >>> alignment [(0.04, 0.1, 'THE'), (0.26, 0.6, 'CHILD'), (0.84, 1.18, 'ALMOST'), (1.380..., 1.58, 'HURT'), (1.84, 1.880..., 'THE'), (2.04, 2.32, 'SMALL'), (2.46, 2.72, 'DOG')] >>> # align a batch of audio files to tokens >>> # alignments = aligner.align_batch_to_tokens(audio_files, transcripts) >>> # align a batch of audio files to words >>> # alignments = aligner.align_batch_to_words(audio_files, transcripts, frame_shift=0.02) >>> # align a csv file to tokens >>> # aligner.align_csv_to_tokens("samples/audio_samples/example.csv", "samples/audio_samples/example_token_alignment.txt") >>> # align a csv file to words >>> # aligner.align_csv_to_words("samples/audio_samples/example.csv", "samples/audio_samples/example_word_alignment.csv", frame_shift=0.02)
- encode_texts(texts: List[str]) List[List[int]][source]
Encode texts to list of tokens.
- Parameters:
texts (List[str], the texts to be encoded.)
- Return type:
List[List[int]], the encoded texts.
Note
This method is specific to the tokeniser used in the model. In this case, we use the CTCTextEncoder.