speechbrain.alignment.ctc_segmentation module

Perform CTC segmentation to align utterances within audio files.

This uses the ctc-segmentation Python package. Install it with pip or see the installing instructions in https://github.com/lumaku/ctc-segmentation

Authors

Ludwig Kürzinger 2021

Summary

Classes:

`CTCSegmentation`	Align text to audio using CTC segmentation.
`CTCSegmentationTask`	Task object for CTC segmentation.

Reference

class speechbrain.alignment.ctc_segmentation.CTCSegmentationTask[source]

Bases: SimpleNamespace

Task object for CTC segmentation.

This object is automatically generated and acts as a container for results of a CTCSegmentation object.

When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.

text

Utterance texts, separated by line. But without the utterance: name at the beginning of the line (as in kaldi-style text).

Type:: list

ground_truth_mat

Ground truth matrix (CTC segmentation).

Type:: array

utt_begin_indices

Utterance separator for the Ground truth matrix.

Type:: np.ndarray

timings

Time marks of the corresponding chars.

Type:: np.ndarray

state_list

Estimated alignment of chars/tokens.

Type:: list

segments

Calculated segments as: (start, end, confidence score).

Type:: list

config

CTC Segmentation configuration object.

Type:: CtcSegmentationParameters

name

Name of aligned audio file (Optional). If given, name is considered when generating the text. Default: “utt”.

Type:: str

utt_ids

The list of utterance names (Optional). This list should have the same length as the number of utterances.

Type:: list

lpz

CTC posterior log probabilities (Optional).

Type:: np.ndarray

print_confidence_score

Include the confidence score. Default: True.

Type:: bool

print_utterance_text

Include utterance text. Default: True.

Type:: bool

text = None

ground_truth_mat = None

utt_begin_indices = None

timings = None

char_probs = None

state_list = None

segments = None

config = None

done = False

name = 'utt'

utt_ids = None

lpz = None

print_confidence_score = True

print_utterance_text = True

set(**kwargs)[source]: Update object attributes.

__str__()[source]: Return a kaldi-style segments file (string).

class speechbrain.alignment.ctc_segmentation.CTCSegmentation(asr_model: EncoderASR | EncoderDecoderASR, kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]

Bases: object

Align text to audio using CTC segmentation.

Usage

Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with set_config(·). Then call the instance as function to align text within an audio file.

asr_model

Speechbrain ASR interface. This requires a model that has a trained CTC layer for inference. It is better to use a model with single-character tokens to get a better time resolution. Please note that the inference complexity with Transformer models usually increases quadratically with audio length. It is therefore recommended to use RNN-based models, if available.

Type:: EncoderDecoderASR

kaldi_style_text

A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.

Type:: bool

text_converter

How CTC segmentation handles text. “tokenize”: Use the ASR model tokenizer to tokenize the text. “classic”: The text is preprocessed as text pieces which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.

Type:: str

time_stamps

Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter samples_to_frames_ratio. Recommended for longer audio files: “auto”.

Type:: str

\*\*ctc_segmentation_args: Parameters for CTC segmentation. The full list of parameters is found in set_config.

Example

>>> # using example file included in the SpeechBrain repository
>>> from speechbrain.inference.ASR import EncoderDecoderASR
>>> from speechbrain.alignment.ctc_segmentation import CTCSegmentation
>>> # load an ASR model
>>> pre_trained = "speechbrain/asr-transformer-transformerlm-librispeech"
>>> asr_model = EncoderDecoderASR.from_hparams(source=pre_trained)  
>>> aligner = CTCSegmentation(asr_model, kaldi_style_text=False)  
>>> # load data
>>> audio_path = "tests/samples/single-mic/example1.wav"
>>> text = ["THE BIRCH CANOE", "SLID ON THE", "SMOOTH PLANKS"]
>>> segments = aligner(audio_path, text, name="example1")  

On multiprocessing

To parallelize the computation with multiprocessing, these three steps can be separated: (1) get_lpz: obtain the lpz, (2) prepare_segmentation_task: prepare the task, and (3) get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation obj́ect.

References

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127

More parameters are described in https://github.com/lumaku/ctc-segmentation

fs = 16000

kaldi_style_text = True

samples_to_frames_ratio = None

time_stamps = 'auto'

choices_time_stamps = ['auto', 'fixed']

text_converter = 'tokenize'

choices_text_converter = ['tokenize', 'classic']

warned_about_misconfiguration = False

config = CtcSegmentationParameters( )

__init__(asr_model: EncoderASR | EncoderDecoderASR, kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]: Initialize the CTCSegmentation module.

Set CTC segmentation parameters.

Parameters for timing

time_stampsstr: Select method how CTC index duration is estimated, and thus how the time stamps are calculated.
fsint: Sample rate. Usually derived from ASR model; use this parameter to overwrite the setting.
samples_to_frames_ratiofloat: If you want to directly determine the ratio of samples to CTC frames, set this parameter, and set time_stamps to “fixed”. Note: If you want to calculate the time stamps from a model with fixed subsampling, set this parameter to: subsampling_factor * frame_duration / 1000.

Parameters for text preparation

set_blankint: Index of blank in token list. Default: 0.
replace_spaces_with_blanksbool: Inserts blanks between words, which is useful for handling long pauses between words. Only used in text_converter="classic" preprocessing mode. Default: False.
kaldi_style_textbool: Determines whether the utterance name is expected as fist word of the utterance. Set at module initialization.
text_converterstr: How CTC segmentation handles text. Set at module initialization.

Parameters for alignment

min_window_sizeint: Minimum number of frames considered for a single utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.
max_window_sizeint: Maximum window size. It should not be necessary to change this value.
gratis_blankbool: If True, the transition cost of blank is set to zero. Useful for long preambles or if there are large unrelated segments between utterances. Default: False.

Parameters for calculation of confidence score

scoring_lengthint: Block length to calculate confidence score. The default value of 30 should be OK in most cases. 30 corresponds to roughly 1-2s of audio.

get_timing_config(speech_len=None, lpz_len=None)[source]: Obtain parameters to determine time stamps.

estimate_samples_to_frames_ratio(speech_len=215040)[source]

Determine the ratio of encoded frames to sample points.

This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.

Parameters:: speech_len (int) – Length of randomly generated speech vector for single inference. Default: 215040.
Returns:: Estimated ratio.
Return type:: int

get_lpz(speech: Tensor | ndarray)[source]

Obtain CTC posterior log probabilities for given speech data.

Parameters:: speech (Union[torch.Tensor, np.ndarray]) – Speech audio input.
Returns:: Numpy vector with CTC log posterior probabilities.
Return type:: np.ndarray

prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]

Preprocess text, and gather text and lpz into a task object.

Text is pre-processed and tokenized depending on configuration. If speech_len is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.

It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.

The text is tokenized based on the text_converter setting:

The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.

The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into ['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.

Parameters:

text (list) – List or multiline-string with utterance ground truths.
lpz (np.ndarray) – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).
name (str) – Audio file name that will be included in the segments output. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.
speech_len (int) – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.

Returns:

Task object that can be passed to CTCSegmentation.get_segments() in order to obtain alignments.

Return type:

CTCSegmentationTask

static get_segments(task: CTCSegmentationTask)[source]

Obtain segments for given utterance texts and CTC log posteriors.

Parameters:: task (CTCSegmentationTask) – Task object that contains ground truth and CTC posterior probabilities.
Returns:: Dictionary with alignments. Combine this with the task object to obtain a human-readable segments representation.
Return type:: dict

Align utterances.

Parameters:

speech (Union[torch.Tensor, np.ndarray, str, Path]) – Audio file that can be given as path or as array.
text (Union[List[str], str]) – List or multiline-string with utterance ground truths. The required formatting depends on the setting kaldi_style_text.
name (str) – Name of the file. Utterance names are derived from it.

Returns:

Task object with segments. Apply str(·) or print(·) on it to obtain the segments list.

Return type:

CTCSegmentationTask