Perform CTC segmentation to align utterances within audio files.
This uses the ctc-segmentation Python package. Install it with pip or see the installing instructions in https://github.com/lumaku/ctc-segmentation
Align text to audio using CTC segmentation.
Task object for CTC segmentation.
- class speechbrain.alignment.ctc_segmentation.CTCSegmentationTask
Task object for CTC segmentation.
This object is automatically generated and acts as a container for results of a CTCSegmentation object.
When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.
- Utterance texts, separated by line. But without the utterance
name at the beginning of the line (as in kaldi-style text).
Ground truth matrix (CTC segmentation).
Utterance separator for the Ground truth matrix.
Time marks of the corresponding chars.
Estimated alignment of chars/tokens.
Calculated segments as: (start, end, confidence score).
CTC Segmentation configuration object.
Name of aligned audio file (Optional). If given, name is considered when generating the text. Default: “utt”.
The list of utterance names (Optional). This list should have the same length as the number of utterances.
CTC posterior log probabilities (Optional).
Properties for printing
Include the confidence score. Default: True.
Include utterance text. Default: True.
- text = None
- ground_truth_mat = None
- utt_begin_indices = None
- timings = None
- char_probs = None
- state_list = None
- segments = None
- config = None
- done = False
- name = 'utt'
- utt_ids = None
- lpz = None
- print_confidence_score = True
- print_utterance_text = True
Update object attributes.
Return a kaldi-style
- class speechbrain.alignment.ctc_segmentation.CTCSegmentation(asr_model: EncoderASR | EncoderDecoderASR, kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)
Align text to audio using CTC segmentation.
Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with
set_config(·). Then call the instance as function to align text within an audio file.
- param asr_model:
Speechbrain ASR interface. This requires a model that has a trained CTC layer for inference. It is better to use a model with single-character tokens to get a better time resolution. Please note that the inference complexity with Transformer models usually increases quadratically with audio length. It is therefore recommended to use RNN-based models, if available.
- type asr_model:
- param kaldi_style_text:
A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.
- type kaldi_style_text:
- param text_converter:
How CTC segmentation handles text. “tokenize”: Use the ASR model tokenizer to tokenize the text. “classic”: The text is preprocessed as text pieces which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.
- type text_converter:
- param time_stamps:
Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter
samples_to_frames_ratio. Recommended for longer audio files: “auto”.
- type time_stamps:
- param **ctc_segmentation_args:
Parameters for CTC segmentation. The full list of parameters is found in
>>> # using example file included in the SpeechBrain repository >>> from speechbrain.pretrained import EncoderDecoderASR >>> from speechbrain.alignment.ctc_segmentation import CTCSegmentation >>> # load an ASR model >>> pre_trained = "speechbrain/asr-transformer-transformerlm-librispeech" >>> asr_model = EncoderDecoderASR.from_hparams(source=pre_trained) >>> aligner = CTCSegmentation(asr_model, kaldi_style_text=False) >>> # load data >>> audio_path = "tests/samples/single-mic/example1.wav" >>> text = ["THE BIRCH CANOE", "SLID ON THE", "SMOOTH PLANKS"] >>> segments = aligner(audio_path, text, name="example1")
To parallelize the computation with multiprocessing, these three steps can be separated: (1)
get_lpz: obtain the lpz, (2)
prepare_segmentation_task: prepare the task, and (3)
get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation obj́ect.
CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127
More parameters are described in https://github.com/lumaku/ctc-segmentation
- fs = 16000
- kaldi_style_text = True
- samples_to_frames_ratio = None
- time_stamps = 'auto'
- choices_time_stamps = ['auto', 'fixed']
- text_converter = 'tokenize'
- choices_text_converter = ['tokenize', 'classic']
- warned_about_misconfiguration = False
- config = CtcSegmentationParameters( )
- __init__(asr_model: EncoderASR | EncoderDecoderASR, kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)
Initialize the CTCSegmentation module.
- set_config(time_stamps: str | None = None, fs: int | None = None, samples_to_frames_ratio: float | None = None, set_blank: int | None = None, replace_spaces_with_blanks: bool | None = None, kaldi_style_text: bool | None = None, text_converter: str | None = None, gratis_blank: bool | None = None, min_window_size: int | None = None, max_window_size: int | None = None, scoring_length: int | None = None)
Set CTC segmentation parameters.
Parameters for timing
Select method how CTC index duration is estimated, and thus how the time stamps are calculated.
Sample rate. Usually derived from ASR model; use this parameter to overwrite the setting.
If you want to directly determine the ratio of samples to CTC frames, set this parameter, and set
time_stampsto “fixed”. Note: If you want to calculate the time stamps from a model with fixed subsampling, set this parameter to:
subsampling_factor * frame_duration / 1000.
Parameters for text preparation
Index of blank in token list. Default: 0.
Inserts blanks between words, which is useful for handling long pauses between words. Only used in
text_converter="classic"preprocessing mode. Default: False.
Determines whether the utterance name is expected as fist word of the utterance. Set at module initialization.
How CTC segmentation handles text. Set at module initialization.
Parameters for alignment
Minimum number of frames considered for a single utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.
Maximum window size. It should not be necessary to change this value.
If True, the transition cost of blank is set to zero. Useful for long preambles or if there are large unrelated segments between utterances. Default: False.
Parameters for calculation of confidence score
Block length to calculate confidence score. The default value of 30 should be OK in most cases. 30 corresponds to roughly 1-2s of audio.
- get_timing_config(speech_len=None, lpz_len=None)
Obtain parameters to determine time stamps.
Determine the ratio of encoded frames to sample points.
This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.
- get_lpz(speech: Tensor | ndarray)
Obtain CTC posterior log probabilities for given speech data.
speech (Union[torch.Tensor, np.ndarray]) – Speech audio input.
Numpy vector with CTC log posterior probabilities.
- Return type:
- prepare_segmentation_task(text, lpz, name=None, speech_len=None)
Preprocess text, and gather text and lpz into a task object.
Text is pre-processed and tokenized depending on configuration. If
speech_lenis given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.
It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.
The text is tokenized based on the
The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.
The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into
['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.
text (list) – List or multiline-string with utterance ground truths.
lpz (np.ndarray) – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).
name (str) – Audio file name that will be included in the segments output. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.
speech_len (int) – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.
Task object that can be passed to
CTCSegmentation.get_segments()in order to obtain alignments.
- Return type:
- static get_segments(task: CTCSegmentationTask)
Obtain segments for given utterance texts and CTC log posteriors.
- __call__(speech: Tensor | ndarray | str | Path, text: List[str] | str, name: str | None = None) CTCSegmentationTask
name (str) – Name of the file. Utterance names are derived from it.
Task object with segments. Apply str(·) or print(·) on it to obtain the segments list.
- Return type: