speechbrain.alignment.ctc_segmentation module

Perform CTC segmentation to align utterances within audio files.

This uses the ctc-segmentation Python package. Install it with pip or see the installing instructions in https://github.com/lumaku/ctc-segmentation

Summary

Classes:

CTCSegmentation

Align text to audio using CTC segmentation.

CTCSegmentationTask

Task object for CTC segmentation.

Reference

class speechbrain.alignment.ctc_segmentation.CTCSegmentationTask[source]

Bases: types.SimpleNamespace

Task object for CTC segmentation.

This object is automatically generated and acts as a container for results of a CTCSegmentation object.

When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.

textlist
Utterance texts, separated by line. But without the utterance

name at the beginning of the line (as in kaldi-style text).

ground_truth_matarray

Ground truth matrix (CTC segmentation).

utt_begin_indicesnp.ndarray

Utterance separator for the Ground truth matrix.

timingsnp.ndarray

Time marks of the corresponding chars.

state_listlist

Estimated alignment of chars/tokens.

segmentslist

Calculated segments as: (start, end, confidence score).

configCtcSegmentationParameters

CTC Segmentation configuration object.

namestr

Name of aligned audio file (Optional). If given, name is considered when generating the text. Default: “utt”.

utt_idslist

The list of utterance names (Optional). This list should have the same length as the number of utterances.

lpznp.ndarray

CTC posterior log probabilities (Optional).

print_confidence_scorebool

Include the confidence score. Default: True.

print_utterance_textbool

Include utterance text. Default: True.

text = None
ground_truth_mat = None
utt_begin_indices = None
timings = None
char_probs = None
state_list = None
segments = None
config = None
done = False
name = 'utt'
utt_ids = None
lpz = None
print_confidence_score = True
print_utterance_text = True
set(**kwargs)[source]

Update object attributes.

__str__()[source]

Return a kaldi-style segments file (string).

class speechbrain.alignment.ctc_segmentation.CTCSegmentation(asr_model: Union[speechbrain.pretrained.interfaces.EncoderASR, speechbrain.pretrained.interfaces.EncoderDecoderASR], kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]

Bases: object

Align text to audio using CTC segmentation.

Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with set_config(·). Then call the instance as function to align text within an audio file.

Parameters
  • asr_model (EncoderDecoderASR) – Speechbrain ASR interface. This requires a model that has a trained CTC layer for inference. It is better to use a model with single-character tokens to get a better time resolution. Please note that the inference complexity with Transformer models usually increases quadratically with audio length. It is therefore recommended to use RNN-based models, if available.

  • kaldi_style_text (bool) – A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.

  • text_converter (str) – How CTC segmentation handles text. “tokenize”: Use the ASR model tokenizer to tokenize the text. “classic”: The text is preprocessed as text pieces which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.

  • time_stamps (str) – Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter samples_to_frames_ratio. Recommended for longer audio files: “auto”.

  • **ctc_segmentation_args – Parameters for CTC segmentation. The full list of parameters is found in set_config.

Example

>>> # using example file included in the SpeechBrain repository
>>> from speechbrain.pretrained import EncoderDecoderASR
>>> from speechbrain.alignment.ctc_segmentation import CTCSegmentation
>>> # load an ASR model
>>> pre_trained = "speechbrain/asr-transformer-transformerlm-librispeech"
>>> asr_model = EncoderDecoderASR.from_hparams(source=pre_trained)
>>> aligner = CTCSegmentation(asr_model, kaldi_style_text=False)
>>> # load data
>>> audio_path = "./samples/audio_samples/example1.wav"
>>> text = ["THE BIRCH CANOE", "SLID ON THE", "SMOOTH PLANKS"]
>>> segments = aligner(audio_path, text, name="example1")

To parallelize the computation with multiprocessing, these three steps can be separated: (1) get_lpz: obtain the lpz, (2) prepare_segmentation_task: prepare the task, and (3) get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation obj́ect.

References

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127

More parameters are described in https://github.com/lumaku/ctc-segmentation

fs = 16000
kaldi_style_text = True
samples_to_frames_ratio = None
time_stamps = 'auto'
choices_time_stamps = ['auto', 'fixed']
text_converter = 'tokenize'
choices_text_converter = ['tokenize', 'classic']
warned_about_misconfiguration = False
config = CtcSegmentationParameters( )
__init__(asr_model: Union[speechbrain.pretrained.interfaces.EncoderASR, speechbrain.pretrained.interfaces.EncoderDecoderASR], kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]

Initialize the CTCSegmentation module.

set_config(time_stamps: Optional[str] = None, fs: Optional[int] = None, samples_to_frames_ratio: Optional[float] = None, set_blank: Optional[int] = None, replace_spaces_with_blanks: Optional[bool] = None, kaldi_style_text: Optional[bool] = None, text_converter: Optional[str] = None, gratis_blank: Optional[bool] = None, min_window_size: Optional[int] = None, max_window_size: Optional[int] = None, scoring_length: Optional[int] = None)[source]

Set CTC segmentation parameters.

time_stampsstr

Select method how CTC index duration is estimated, and thus how the time stamps are calculated.

fsint

Sample rate. Usually derived from ASR model; use this parameter to overwrite the setting.

samples_to_frames_ratiofloat

If you want to directly determine the ratio of samples to CTC frames, set this parameter, and set time_stamps to “fixed”. Note: If you want to calculate the time stamps from a model with fixed subsampling, set this parameter to: subsampling_factor * frame_duration / 1000.

set_blankint

Index of blank in token list. Default: 0.

replace_spaces_with_blanksbool

Inserts blanks between words, which is useful for handling long pauses between words. Only used in text_converter="classic" preprocessing mode. Default: False.

kaldi_style_textbool

Determines whether the utterance name is expected as fist word of the utterance. Set at module initialization.

text_converterstr

How CTC segmentation handles text. Set at module initialization.

min_window_sizeint

Minimum number of frames considered for a single utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.

max_window_sizeint

Maximum window size. It should not be necessary to change this value.

gratis_blankbool

If True, the transition cost of blank is set to zero. Useful for long preambles or if there are large unrelated segments between utterances. Default: False.

scoring_lengthint

Block length to calculate confidence score. The default value of 30 should be OK in most cases. 30 corresponds to roughly 1-2s of audio.

get_timing_config(speech_len=None, lpz_len=None)[source]

Obtain parameters to determine time stamps.

estimate_samples_to_frames_ratio(speech_len=215040)[source]

Determine the ratio of encoded frames to sample points.

This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.

Parameters

speech_len (int) – Length of randomly generated speech vector for single inference. Default: 215040.

Returns

Estimated ratio.

Return type

int

get_lpz(speech: Union[torch.Tensor, numpy.ndarray])[source]

Obtain CTC posterior log probabilities for given speech data.

Parameters

speech (Union[torch.Tensor, np.ndarray]) – Speech audio input.

Returns

Numpy vector with CTC log posterior probabilities.

Return type

np.ndarray

prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]

Preprocess text, and gather text and lpz into a task object.

Text is pre-processed and tokenized depending on configuration. If speech_len is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.

It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.

The text is tokenized based on the text_converter setting:

The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.

The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into ['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.

Parameters
  • text (list) – List or multiline-string with utterance ground truths.

  • lpz (np.ndarray) – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).

  • name (str) – Audio file name that will be included in the segments output. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.

  • speech_len (int) – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.

Returns

Task object that can be passed to CTCSegmentation.get_segments() in order to obtain alignments.

Return type

CTCSegmentationTask

static get_segments(task: speechbrain.alignment.ctc_segmentation.CTCSegmentationTask)[source]

Obtain segments for given utterance texts and CTC log posteriors.

Parameters

task (CTCSegmentationTask) – Task object that contains ground truth and CTC posterior probabilities.

Returns

Dictionary with alignments. Combine this with the task object to obtain a human-readable segments representation.

Return type

dict

__call__(speech: Union[torch.Tensor, numpy.ndarray, str, pathlib.Path], text: Union[List[str], str], name: Optional[str] = None) speechbrain.alignment.ctc_segmentation.CTCSegmentationTask[source]

Align utterances.

Parameters
  • speech (Union[torch.Tensor, np.ndarray, str, Path]) – Audio file that can be given as path or as array.

  • text (Union[List[str], str]) – List or multiline-string with utterance ground truths. The required formatting depends on the setting kaldi_style_text.

  • name (str) – Name of the file. Utterance names are derived from it.

Returns

Task object with segments. Apply str(·) or print(·) on it to obtain the segments list.

Return type

CTCSegmentationTask