speechbrain.integrations.huggingface.wordemb.transformer moduleο
A convenience wrapper for word embeddings retrieved out of HuggingFace transformers (e.g. BERT)
Authors * Artem Ploujnikov 2021
Summaryο
Exceptions:
Thrown when HuggingFace Transformers is not installed |
Classes:
A wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT). |
Referenceο
- class speechbrain.integrations.huggingface.wordemb.transformer.TransformerWordEmbeddings(model, tokenizer=None, layers=None, device=None)[source]ο
Bases:
ModuleA wrapper to retrieve word embeddings out of a pretrained Transformer model from HuggingFace Transformers (e.g. BERT)
- Parameters:
model (str|nn.Module) β the underlying model instance or the name of the model to download
tokenizer (str|transformers.tokenization_utils_base.PreTrainedTokenizerBase) β a pretrained tokenizer - or the identifier to retrieve one from HuggingFace
layers (int|list) β a list of layer indexes from which to construct an embedding or the number of layers
device (str) β a torch device identifier. If provided, the model will be transferred onto that device
Example
>>> from transformers import AutoTokenizer, AutoModel >>> from speechbrain.integrations.huggingface.wordemb.transformer import ( ... TransformerWordEmbeddings, ... ) >>> model_name = "bert-base-uncased" >>> tokenizer = AutoTokenizer.from_pretrained( ... model_name, return_tensors="pt" ... ) >>> model = AutoModel.from_pretrained(model_name, output_hidden_states=True) >>> word_emb = TransformerWordEmbeddings( ... model=model, layers=4, tokenizer=tokenizer ... ) >>> embedding = word_emb.embedding( ... sentence="THIS IS A TEST SENTENCE", word="TEST" ... ) >>> embedding[:8] tensor([ 3.4332, -3.6702, 0.5152, -1.9301, 0.9197, 2.1628, -0.2841, -0.3549]) >>> embeddings = word_emb.embeddings("This is cool") >>> embeddings.shape torch.Size([3, 768]) >>> embeddings[:, :3] tensor([[-2.9078, 1.2496, 0.7269], [-0.9940, -0.6960, 1.4350], [-1.2401, -3.8237, 0.2740]]) >>> sentences = [ ... "This is the first test sentence", ... "This is the second test sentence", ... "A quick brown fox jumped over the lazy dog", ... ] >>> batch_embeddings = word_emb.batch_embeddings(sentences) >>> batch_embeddings.shape torch.Size([3, 9, 768]) >>> batch_embeddings[:, :2, :3] tensor([[[-5.0935, -1.2838, 0.7868], [-4.6889, -2.1488, 2.1380]], [[-4.4993, -2.0178, 0.9369], [-4.1760, -2.4141, 1.9474]], [[-1.0065, 1.4227, -2.6671], [-0.3408, -0.6238, 0.1780]]])
- MSG_WORD = "'word' should be either a word or the index of a word"ο
- DEFAULT_LAYERS = 4ο
- forward(sentence, word=None)[source]ο
Retrieves a word embedding for the specified word within a given sentence, if a word is provided, or all word embeddings if only a sentence is given
- Parameters:
- Returns:
emb β the word embedding
- Return type:
- embedding(sentence, word)[source]ο
Retrieves a word embedding for the specified word within a given sentence
- Parameters:
- Returns:
emb β the word embedding
- Return type:
- embeddings(sentence)[source]ο
Returns the model embeddings for all words in a sentence
- Parameters:
sentence (str) β a sentence
- Returns:
emb β a tensor of all word embeddings
- Return type:
- batch_embeddings(sentences)[source]ο
Returns embeddings for a collection of sentences
- Parameters:
sentences (List[str]) β a list of strings corresponding to a batch of sentences
- Returns:
emb β a (B x W x E) tensor B - the batch dimensions (samples) W - the word dimension E - the embedding dimension
- Return type: