speechbrain.integrations.nlp.spacy_pipeline module

Models and tooling for natural language processing using spaCy

Authors * Sylvain de Langen 2024

Summary

Classes:

SpacyPipeline

Wraps a spaCy pipeline with methods that makes it easier to deal with SB's typical sentence format, and adds some convenience functions if you only care about a specific task.

Reference

class speechbrain.integrations.nlp.spacy_pipeline.SpacyPipeline(nlp: spacy.language.Language)[source]

Bases: object

Wraps a spaCy pipeline with methods that makes it easier to deal with SB’s typical sentence format, and adds some convenience functions if you only care about a specific task.

Parameters:

nlp (spacy.language.Language) – spaCy text processing pipeline to use.

Example

>>> # NOTE: To run this example, you must first download a pipeline, e.g.
>>> # spacy download en_core_web_sm
>>> ler_model = SpacyPipeline.from_name(
...     name="en_core_web_sm", exclude=["parser", "ner", "textcat"]
... )
>>> ler_model.lemmatize(["i", "am", "sitting"])
[['I'], ['be'], ['sit']]
static from_name(name, *args, **kwargs)[source]

Create a pipeline by loading a model using spacy.load. Unlike other toolkits, you must explicitly download the model if you want to use a remote model (e.g. spacy download fr_core_news_md) rather than just specifying a HF hub name.

Note

If you only need a subset of modules enabled in the pipeline, e.g. for lemmatization, consider excluding <https://spacy.io/usage/processing-pipelines#disabling>_ using the exclude=[...] argument.

Parameters:
  • name (str | Path) – Package name or model path.

  • *args – Extra positional arguments passed to spacy.load.

  • **kwargs – Extra keyword arguments passed to spacy.load.

Return type:

New SpacyPipeline

__call__(inputs: List[str] | List[List[str]]) Iterator[spacy.tokens.Doc][source]

Processes a batch of sentences into an iterator of spaCy documents.

Parameters:

inputs (list of sentences (str or list of tokens)) – Sentences to process, in the form of batches of lists of tokens (list of str) or a str. In the case of token lists, tokens do not need to be already tokenized for this specific sequence tagger, and they will be joined with spaces instead.

Returns:

Iterator of documents for the passed sentences.

Return type:

iterator of spacy.tokens.Doc

lemmatize(inputs: List[str] | List[List[str]]) List[List[str]][source]

Lemmatize a batch of sentences by processing the input sentences, discarding other irrelevant outputs.

Parameters:

inputs (list of sentences (str or list of tokens)) – Sentences to lemmatize, in the form of batches of lists of tokens (list of str) or a str. In the case of token lists, tokens do not need to be already tokenized for this specific sequence tagger, and they will be joined with spaces instead.

Returns:

For each sentence, the sequence of extracted lemmas as `str`s.

Return type:

list of list of str