speechbrain.alignment.ctc_segmentation module
Perform CTC segmentation to align utterances within audio files.
This uses the ctc-segmentation Python package. Install it with pip or see the installing instructions in https://github.com/lumaku/ctc-segmentation
Summary
Classes:
Align text to audio using CTC segmentation. |
|
Task object for CTC segmentation. |
Reference
- class speechbrain.alignment.ctc_segmentation.CTCSegmentationTask[source]
Bases:
SimpleNamespace
Task object for CTC segmentation.
This object is automatically generated and acts as a container for results of a CTCSegmentation object.
When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.
Properties
- textlist
- Utterance texts, separated by line. But without the utterance
name at the beginning of the line (as in kaldi-style text).
- ground_truth_matarray
Ground truth matrix (CTC segmentation).
- utt_begin_indicesnp.ndarray
Utterance separator for the Ground truth matrix.
- timingsnp.ndarray
Time marks of the corresponding chars.
- state_listlist
Estimated alignment of chars/tokens.
- segmentslist
Calculated segments as: (start, end, confidence score).
- configCtcSegmentationParameters
CTC Segmentation configuration object.
- namestr
Name of aligned audio file (Optional). If given, name is considered when generating the text. Default: “utt”.
- utt_idslist
The list of utterance names (Optional). This list should have the same length as the number of utterances.
- lpznp.ndarray
CTC posterior log probabilities (Optional).
Properties for printing
- print_confidence_scorebool
Include the confidence score. Default: True.
- print_utterance_textbool
Include utterance text. Default: True.
- text = None
- ground_truth_mat = None
- utt_begin_indices = None
- timings = None
- char_probs = None
- state_list = None
- segments = None
- config = None
- done = False
- name = 'utt'
- utt_ids = None
- lpz = None
- print_confidence_score = True
- print_utterance_text = True
- class speechbrain.alignment.ctc_segmentation.CTCSegmentation(asr_model: Union[EncoderASR, EncoderDecoderASR], kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]
Bases:
object
Align text to audio using CTC segmentation.
Usage
Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with
set_config(·)
. Then call the instance as function to align text within an audio file.- param asr_model
Speechbrain ASR interface. This requires a model that has a trained CTC layer for inference. It is better to use a model with single-character tokens to get a better time resolution. Please note that the inference complexity with Transformer models usually increases quadratically with audio length. It is therefore recommended to use RNN-based models, if available.
- type asr_model
EncoderDecoderASR
- param kaldi_style_text
A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.
- type kaldi_style_text
bool
- param text_converter
How CTC segmentation handles text. “tokenize”: Use the ASR model tokenizer to tokenize the text. “classic”: The text is preprocessed as text pieces which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.
- type text_converter
str
- param time_stamps
Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter
samples_to_frames_ratio
. Recommended for longer audio files: “auto”.- type time_stamps
str
- param **ctc_segmentation_args
Parameters for CTC segmentation. The full list of parameters is found in
set_config
.
Example
>>> # using example file included in the SpeechBrain repository >>> from speechbrain.pretrained import EncoderDecoderASR >>> from speechbrain.alignment.ctc_segmentation import CTCSegmentation >>> # load an ASR model >>> pre_trained = "speechbrain/asr-transformer-transformerlm-librispeech" >>> asr_model = EncoderDecoderASR.from_hparams(source=pre_trained) >>> aligner = CTCSegmentation(asr_model, kaldi_style_text=False) >>> # load data >>> audio_path = "tests/samples/single-mic/example1.wav" >>> text = ["THE BIRCH CANOE", "SLID ON THE", "SMOOTH PLANKS"] >>> segments = aligner(audio_path, text, name="example1")
On multiprocessing
To parallelize the computation with multiprocessing, these three steps can be separated: (1)
get_lpz
: obtain the lpz, (2)prepare_segmentation_task
: prepare the task, and (3)get_segments
: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation obj́ect.References
CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127
More parameters are described in https://github.com/lumaku/ctc-segmentation
- fs = 16000
- kaldi_style_text = True
- samples_to_frames_ratio = None
- time_stamps = 'auto'
- choices_time_stamps = ['auto', 'fixed']
- text_converter = 'tokenize'
- choices_text_converter = ['tokenize', 'classic']
- warned_about_misconfiguration = False
- config = CtcSegmentationParameters( )
- __init__(asr_model: Union[EncoderASR, EncoderDecoderASR], kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]
Initialize the CTCSegmentation module.
- set_config(time_stamps: Optional[str] = None, fs: Optional[int] = None, samples_to_frames_ratio: Optional[float] = None, set_blank: Optional[int] = None, replace_spaces_with_blanks: Optional[bool] = None, kaldi_style_text: Optional[bool] = None, text_converter: Optional[str] = None, gratis_blank: Optional[bool] = None, min_window_size: Optional[int] = None, max_window_size: Optional[int] = None, scoring_length: Optional[int] = None)[source]
Set CTC segmentation parameters.
Parameters for timing
- time_stampsstr
Select method how CTC index duration is estimated, and thus how the time stamps are calculated.
- fsint
Sample rate. Usually derived from ASR model; use this parameter to overwrite the setting.
- samples_to_frames_ratiofloat
If you want to directly determine the ratio of samples to CTC frames, set this parameter, and set
time_stamps
to “fixed”. Note: If you want to calculate the time stamps from a model with fixed subsampling, set this parameter to:subsampling_factor * frame_duration / 1000
.
Parameters for text preparation
- set_blankint
Index of blank in token list. Default: 0.
- replace_spaces_with_blanksbool
Inserts blanks between words, which is useful for handling long pauses between words. Only used in
text_converter="classic"
preprocessing mode. Default: False.- kaldi_style_textbool
Determines whether the utterance name is expected as fist word of the utterance. Set at module initialization.
- text_converterstr
How CTC segmentation handles text. Set at module initialization.
Parameters for alignment
- min_window_sizeint
Minimum number of frames considered for a single utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.
- max_window_sizeint
Maximum window size. It should not be necessary to change this value.
- gratis_blankbool
If True, the transition cost of blank is set to zero. Useful for long preambles or if there are large unrelated segments between utterances. Default: False.
Parameters for calculation of confidence score
- scoring_lengthint
Block length to calculate confidence score. The default value of 30 should be OK in most cases. 30 corresponds to roughly 1-2s of audio.
- get_timing_config(speech_len=None, lpz_len=None)[source]
Obtain parameters to determine time stamps.
- estimate_samples_to_frames_ratio(speech_len=215040)[source]
Determine the ratio of encoded frames to sample points.
This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.
- get_lpz(speech: Union[Tensor, ndarray])[source]
Obtain CTC posterior log probabilities for given speech data.
- Parameters
speech (Union[torch.Tensor, np.ndarray]) – Speech audio input.
- Returns
Numpy vector with CTC log posterior probabilities.
- Return type
np.ndarray
- prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]
Preprocess text, and gather text and lpz into a task object.
Text is pre-processed and tokenized depending on configuration. If
speech_len
is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.
The text is tokenized based on the
text_converter
setting:The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.
The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into
['▁', '▁r', '▁re', '▁real', '▁really']
. The alignment will be based on the most probable activation sequence given by the network.- Parameters
text (list) – List or multiline-string with utterance ground truths.
lpz (np.ndarray) – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).
name (str) – Audio file name that will be included in the segments output. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.
speech_len (int) – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.
- Returns
Task object that can be passed to
CTCSegmentation.get_segments()
in order to obtain alignments.- Return type
- static get_segments(task: CTCSegmentationTask)[source]
Obtain segments for given utterance texts and CTC log posteriors.
- Parameters
task (CTCSegmentationTask) – Task object that contains ground truth and CTC posterior probabilities.
- Returns
Dictionary with alignments. Combine this with the task object to obtain a human-readable segments representation.
- Return type
- __call__(speech: Union[Tensor, ndarray, str, Path], text: Union[List[str], str], name: Optional[str] = None) CTCSegmentationTask [source]
Align utterances.
- Parameters
speech (Union[torch.Tensor, np.ndarray, str, Path]) – Audio file that can be given as path or as array.
text (Union[List[str], str]) – List or multiline-string with utterance ground truths. The required formatting depends on the setting
kaldi_style_text
.name (str) – Name of the file. Utterance names are derived from it.
- Returns
Task object with segments. Apply str(·) or print(·) on it to obtain the segments list.
- Return type