speechbrain.integrations.huggingface.wordemb.transformer module

A convenience wrapper for word embeddings retrieved out of HuggingFace transformers (e.g. BERT)

Authors * Artem Ploujnikov 2021

Summary

Exceptions:

MissingTransformersError

Thrown when HuggingFace Transformers is not installed

Classes:

TransformerWordEmbeddings

A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT).

Reference

class speechbrain.integrations.huggingface.wordemb.transformer.TransformerWordEmbeddings(model, tokenizer=None, layers=None, device=None)[source]

Bases: Module

A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT)

Parameters:
  • model (str|nn.Module) – the underlying model instance or the name of the model to download

  • tokenizer (str|transformers.tokenization_utils_base.PreTrainedTokenizerBase) – a pretrained tokenizer - or the identifier to retrieve one from HuggingFace

  • layers (int|list) – a list of layer indexes from which to construct an embedding or the number of layers

  • device (str) – a torch device identifier. If provided, the model will be transferred onto that device

Example

>>> from transformers import AutoTokenizer, AutoModel
>>> from speechbrain.integrations.huggingface.wordemb.transformer import (
...     TransformerWordEmbeddings,
... )
>>> model_name = "bert-base-uncased"
>>> tokenizer = AutoTokenizer.from_pretrained(
...     model_name, return_tensors="pt"
... )
>>> model = AutoModel.from_pretrained(model_name, output_hidden_states=True)
>>> word_emb = TransformerWordEmbeddings(
...     model=model, layers=4, tokenizer=tokenizer
... )
>>> embedding = word_emb.embedding(
...     sentence="THIS IS A TEST SENTENCE", word="TEST"
... )
>>> embedding[:8]
tensor([ 3.4332, -3.6702,  0.5152, -1.9301,  0.9197,  2.1628, -0.2841, -0.3549])
>>> embeddings = word_emb.embeddings("This is cool")
>>> embeddings.shape
torch.Size([3, 768])
>>> embeddings[:, :3]
tensor([[-2.9078,  1.2496,  0.7269],
        [-0.9940, -0.6960,  1.4350],
        [-1.2401, -3.8237,  0.2740]])
>>> sentences = [
...     "This is the first test sentence",
...     "This is the second test sentence",
...     "A quick brown fox jumped over the lazy dog",
... ]
>>> batch_embeddings = word_emb.batch_embeddings(sentences)
>>> batch_embeddings.shape
torch.Size([3, 9, 768])
>>> batch_embeddings[:, :2, :3]
tensor([[[-5.0935, -1.2838,  0.7868],
         [-4.6889, -2.1488,  2.1380]],

        [[-4.4993, -2.0178,  0.9369],
         [-4.1760, -2.4141,  1.9474]],

        [[-1.0065,  1.4227, -2.6671],
         [-0.3408, -0.6238,  0.1780]]])
MSG_WORD = "'word' should be either a word or the index of a word"
DEFAULT_LAYERS = 4
forward(sentence, word=None)[source]

Retrieves a word embedding for the specified word within a given sentence, if a word is provided, or all word embeddings if only a sentence is given

Parameters:
  • sentence (str) – a sentence

  • word (str|int) – a word or a word’s index within the sentence. If a word is given, and it is encountered multiple times in a sentence, the first occurrence is used

Returns:

emb – the word embedding

Return type:

torch.Tensor

embedding(sentence, word)[source]

Retrieves a word embedding for the specified word within a given sentence

Parameters:
  • sentence (str) – a sentence

  • word (str|int) – a word or a word’s index within the sentence. If a word is given, and it is encountered multiple times in a sentence, the first occurrence is used

Returns:

emb – the word embedding

Return type:

torch.Tensor

embeddings(sentence)[source]

Returns the model embeddings for all words in a sentence

Parameters:

sentence (str) – a sentence

Returns:

emb – a tensor of all word embeddings

Return type:

torch.Tensor

batch_embeddings(sentences)[source]

Returns embeddings for a collection of sentences

Parameters:

sentences (List[str]) – a list of strings corresponding to a batch of sentences

Returns:

emb – a (B x W x E) tensor B - the batch dimensions (samples) W - the word dimension E - the embedding dimension

Return type:

torch.Tensor

to(device)[source]

Transfers the model to the specified PyTorch device

exception speechbrain.integrations.huggingface.wordemb.transformer.MissingTransformersError[source]

Bases: Exception

Thrown when HuggingFace Transformers is not installed

MESSAGE = 'This module requires HuggingFace Transformers'