speechbrain.wordemb.transformer module

A convenience wrapper for word embeddings retrieved out of HuggingFace transformers (e.g. BERT)

Authors * Artem Ploujnikov 2021

Summary

Exceptions:

MissingTransformersError

Thrown when HuggingFace Transformers is not installed

Classes:

TransformerWordEmbeddings

A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g.

Reference

class speechbrain.wordemb.transformer.TransformerWordEmbeddings(model, tokenizer=None, layers=None, device=None)[source]

Bases: Module

A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT)

Parameters

model (str|nn.Module) – the underlying model instance or the name of the model to download
tokenizer (str|transformers.tokenization_utils_base.PreTrainedTokenizerBase) – a pretrained tokenizer - or the identifier to retrieve one from HuggingFace
layers (int|list) – a list of layer indexes from which to construct an embedding or the number of layers
device – a torch device identifier. If provided, the model will be transferred onto that device

Example

NOTE: Doctests are disabled because the dependency on the HuggingFace transformer library is optional.

>>> from transformers import AutoTokenizer, AutoModel 
>>> from speechbrain.wordemb.transformer import TransformerWordEmbeddings
>>> model_name = "bert-base-uncased" 
>>> tokenizer = AutoTokenizer.from_pretrained(
...    model_name, return_tensors='pt') 
>>> model = AutoModel.from_pretrained(
...     model_name,
...     output_hidden_states=True) 
>>> word_emb = TransformerWordEmbeddings(
...     model=model,
...     layers=4,
...     tokenizer=tokenizer
... ) 
>>> embedding = word_emb.embedding(
...     sentence="THIS IS A TEST SENTENCE",
...     word="TEST"
... ) 
>>> embedding[:8] 
tensor([ 3.4332, -3.6702,  0.5152, -1.9301,  0.9197,  2.1628, -0.2841, -0.3549])
>>> embeddings = word_emb.embeddings("This is cool") 
>>> embeddings.shape 
torch.Size([3, 768])
>>> embeddings[:, :3] 
tensor([[-2.9078,  1.2496,  0.7269],
    [-0.9940, -0.6960,  1.4350],
    [-1.2401, -3.8237,  0.2739]])
>>> sentences = [
...     "This is the first test sentence",
...     "This is the second test sentence",
...     "A quick brown fox jumped over the lazy dog"
... ]
>>> batch_embeddings = word_emb.batch_embeddings(sentences) 
>>> batch_embeddings.shape 
torch.Size([3, 9, 768])
>>> batch_embeddings[:, :2, :3] 
tensor([[[-5.0935, -1.2838,  0.7868],
         [-4.6889, -2.1488,  2.1380]],

[[-4.4993, -2.0178, 0.9369],
[-4.1760, -2.4141, 1.9474]],

[[-1.0065, 1.4227, -2.6671],
[-0.3408, -0.6238, 0.1780]]])

MSG_WORD = "'word' should be either a word or the index of a word"

DEFAULT_LAYERS = 4

forward(sentence, word=None)[source]

Retrieves a word embedding for the specified word within a given sentence, if a word is provided, or all word embeddings if only a sentence is given

Parameters

sentence (str) – a sentence
word (str|int) – a word or a word’s index within the sentence. If a word is given, and it is encountered multiple times in a sentence, the first occurrence is used

Returns

emb – the word embedding

Return type

torch.Tensor

embedding(sentence, word)[source]

Retrieves a word embedding for the specified word within a given sentence

Parameters

sentence (str) – a sentence
word (str|int) – a word or a word’s index within the sentence. If a word is given, and it is encountered multiple times in a sentence, the first occurrence is used

Returns

emb – the word embedding

Return type

torch.Tensor

embeddings(sentence)[source]

Returns the model embeddings for all words in a sentence

Parameters: sentence (str) – a sentence
Returns: emb – a tensor of all word embeddings
Return type: torch.Tensor

batch_embeddings(sentences)[source]

Returns embeddings for a collection of sentences

Parameters: sentences (List[str]) – a list of strings corresponding to a batch of sentences
Returns: emb – a (B x W x E) tensor B - the batch dimensions (samples) W - the word dimension E - the embedding dimension
Return type: torch.Tensor

to(device)[source]: Transfers the model to the specified PyTorch device

training: bool

exception speechbrain.wordemb.transformer.MissingTransformersError[source]

Bases: Exception

Thrown when HuggingFace Transformers is not installed

MESSAGE = 'This module requires HuggingFace Transformers'