speechbrain.wordemb.transformer module
A convenience wrapper for word embeddings retrieved out of HuggingFace transformers (e.g. BERT)
Authors * Artem Ploujnikov 2021
Summary
Exceptions:
Thrown when HuggingFace Transformers is not installed |
Classes:
A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. |
Reference
- class speechbrain.wordemb.transformer.TransformerWordEmbeddings(model, tokenizer=None, layers=None, device=None)[source]
Bases:
Module
A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT)
- Parameters
model (str|nn.Module) – the underlying model instance or the name of the model to download
tokenizer (str|transformers.tokenization_utils_base.PreTrainedTokenizerBase) – a pretrained tokenizer - or the identifier to retrieve one from HuggingFace
layers (int|list) – a list of layer indexes from which to construct an embedding or the number of layers
device – a torch device identifier. If provided, the model will be transferred onto that device
Example
NOTE: Doctests are disabled because the dependency on the HuggingFace transformer library is optional.
>>> from transformers import AutoTokenizer, AutoModel >>> from speechbrain.wordemb.transformer import TransformerWordEmbeddings >>> model_name = "bert-base-uncased" >>> tokenizer = AutoTokenizer.from_pretrained( ... model_name, return_tensors='pt') >>> model = AutoModel.from_pretrained( ... model_name, ... output_hidden_states=True) >>> word_emb = TransformerWordEmbeddings( ... model=model, ... layers=4, ... tokenizer=tokenizer ... ) >>> embedding = word_emb.embedding( ... sentence="THIS IS A TEST SENTENCE", ... word="TEST" ... ) >>> embedding[:8] tensor([ 3.4332, -3.6702, 0.5152, -1.9301, 0.9197, 2.1628, -0.2841, -0.3549]) >>> embeddings = word_emb.embeddings("This is cool") >>> embeddings.shape torch.Size([3, 768]) >>> embeddings[:, :3] tensor([[-2.9078, 1.2496, 0.7269], [-0.9940, -0.6960, 1.4350], [-1.2401, -3.8237, 0.2739]]) >>> sentences = [ ... "This is the first test sentence", ... "This is the second test sentence", ... "A quick brown fox jumped over the lazy dog" ... ] >>> batch_embeddings = word_emb.batch_embeddings(sentences) >>> batch_embeddings.shape torch.Size([3, 9, 768]) >>> batch_embeddings[:, :2, :3] tensor([[[-5.0935, -1.2838, 0.7868], [-4.6889, -2.1488, 2.1380]],
- [[-4.4993, -2.0178, 0.9369],
[-4.1760, -2.4141, 1.9474]],
- [[-1.0065, 1.4227, -2.6671],
[-0.3408, -0.6238, 0.1780]]])
- MSG_WORD = "'word' should be either a word or the index of a word"
- DEFAULT_LAYERS = 4
- forward(sentence, word=None)[source]
Retrieves a word embedding for the specified word within a given sentence, if a word is provided, or all word embeddings if only a sentence is given
- Parameters
- Returns
emb – the word embedding
- Return type
- embedding(sentence, word)[source]
Retrieves a word embedding for the specified word within a given sentence
- Parameters
- Returns
emb – the word embedding
- Return type
- embeddings(sentence)[source]
Returns the model embeddings for all words in a sentence
- Parameters
sentence (str) – a sentence
- Returns
emb – a tensor of all word embeddings
- Return type
- batch_embeddings(sentences)[source]
Returns embeddings for a collection of sentences
- Parameters
sentences (List[str]) – a list of strings corresponding to a batch of sentences
- Returns
emb – a (B x W x E) tensor B - the batch dimensions (samples) W - the word dimension E - the embedding dimension
- Return type