speechbrain.integrations.nlp.spacy_pipeline moduleο
Models and tooling for natural language processing using spaCy
Authors * Sylvain de Langen 2024
Summaryο
Classes:
Wraps a spaCy pipeline with methods that makes it easier to deal with SB's typical sentence format, and adds some convenience functions if you only care about a specific task. |
Referenceο
- class speechbrain.integrations.nlp.spacy_pipeline.SpacyPipeline(nlp: spacy.language.Language)[source]ο
Bases:
objectWraps a spaCy pipeline with methods that makes it easier to deal with SBβs typical sentence format, and adds some convenience functions if you only care about a specific task.
- Parameters:
nlp (spacy.language.Language) β spaCy text processing pipeline to use.
Example
>>> # NOTE: To run this example, you must first download a pipeline, e.g. >>> # spacy download en_core_web_sm >>> ler_model = SpacyPipeline.from_name( ... name="en_core_web_sm", exclude=["parser", "ner", "textcat"] ... ) >>> ler_model.lemmatize(["i", "am", "sitting"]) [['I'], ['be'], ['sit']]
- static from_name(name, *args, **kwargs)[source]ο
Create a pipeline by loading a model using
spacy.load. Unlike other toolkits, you must explicitly download the model if you want to use a remote model (e.g.spacy download fr_core_news_md) rather than just specifying a HF hub name.Note
If you only need a subset of modules enabled in the pipeline, e.g. for lemmatization, consider
excluding <https://spacy.io/usage/processing-pipelines#disabling>_using theexclude=[...]argument.- Parameters:
name (str | Path) β Package name or model path.
*args β Extra positional arguments passed to
spacy.load.**kwargs β Extra keyword arguments passed to
spacy.load.
- Return type:
New SpacyPipeline
- __call__(inputs: List[str] | List[List[str]]) Iterator[spacy.tokens.Doc][source]ο
Processes a batch of sentences into an iterator of spaCy documents.
- Parameters:
inputs (list of sentences (str or list of tokens)) β Sentences to process, in the form of batches of lists of tokens (list of str) or a str. In the case of token lists, tokens do not need to be already tokenized for this specific sequence tagger, and they will be joined with spaces instead.
- Returns:
Iterator of documents for the passed sentences.
- Return type:
iterator of spacy.tokens.Doc
- lemmatize(inputs: List[str] | List[List[str]]) List[List[str]][source]ο
Lemmatize a batch of sentences by processing the input sentences, discarding other irrelevant outputs.
- Parameters:
inputs (list of sentences (str or list of tokens)) β Sentences to lemmatize, in the form of batches of lists of tokens (list of str) or a str. In the case of token lists, tokens do not need to be already tokenized for this specific sequence tagger, and they will be joined with spaces instead.
- Returns:
For each sentence, the sequence of extracted lemmas as `str`s.
- Return type: