{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_header",
    "tags": [
     "sb_auto_header"
    ]
   },
   "source": [
    "<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",
    "<!-- The contents are initialized from tutorials/notebook-header.md -->\n",
    "\n",
    "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/advanced/text-tokenizer.ipynb)\n",
    "to execute or view/download this notebook on\n",
    "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/advanced/text-tokenizer.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Wa5O28sydb_U"
   },
   "source": [
    "# Text Tokenization\n",
    "\n",
    "## Why do we need tokenization?\n",
    "\n",
    "Almost all languages have a huge number of possible words. Machine learning tasks that process text have thus to support large vocabularies that might contain several thousands of words.  Dealing with such a large vocabulary, however, is critical.  The input and output embeddings  (e.g. one-hot-vectors) are normally huge vectors, leading to and increase memory consumption and memory usage. More importantly,  learning with such extremely sparse and high-dimensional embeddings might be sub-optimal.\n",
    "\n",
    "A naive alternative can be to simply use characters instead of words.\n",
    "The latter approach alleviates some of the aforementioned issues, but\n",
    "it requires processing a longer sequence  (that is critical as well from a machine learning point of view).\n",
    "\n",
    "Can we find a middle ground between words and characters? Yes, this is what the tokenizer is trying to do.\n",
    "\n",
    "One popular technique called **rule-based tokenization** (e.g. [spaCy](https://spacy.io)) allows splitting the text into smaller chunks based on grammar rules, spaces, and punctuation. Unfortunately, this approach is language-dependent and must be set for each language considered ...\n",
    "\n",
    "Another solution to get the best of both word-level and character-level tokenizations is a hybrid solution named **subword tokenization** relying on the principle that frequently-used words should not be split into smaller subwords, but rare words should be decomposed into meaningful (i.e. more frequent) subwords.\n",
    "\n",
    "\n",
    "SpeechBrain currently relies on a custom integration of the [*SentencePiece tokenizer*](https://github.com/google/sentencepiece) which treats the input as a raw input stream. The following tokenizer algorithms are supported:\n",
    "1. [BPE](https://web.archive.org/web/20230319172720/https://www.derczynski.com/papers/archive/BPE_Gage.pdf).\n",
    "2. [Unigram](https://arxiv.org/pdf/1804.10959.pdf) (Subword Regularization).\n",
    "\n",
    "\n",
    "The *SentencePiece tokenizer* is available at `speechbrain.tokenizer.SentencePiece`. In the following, we will describe all the aforementioned techniques, but first of all, let's install SpeechBrain.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "executionInfo": {
     "elapsed": 36501,
     "status": "ok",
     "timestamp": 1708531382261,
     "user": {
      "displayName": "adel moumen",
      "userId": "01620107593621714109"
     },
     "user_tz": -60
    },
    "id": "JSRmMsPvdkfu"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Installing SpeechBrain via pip\n",
    "BRANCH = 'develop'\n",
    "!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH\n",
    "\n",
    "# Clone SpeechBrain repository\n",
    "!git clone https://github.com/speechbrain/speechbrain/\n",
    "%cd /content/speechbrain/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0v2mq9wwfBeV"
   },
   "source": [
    "Let's also download a csv file to train our tokenizer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "executionInfo": {
     "elapsed": 4253,
     "status": "ok",
     "timestamp": 1708531386505,
     "user": {
      "displayName": "adel moumen",
      "userId": "01620107593621714109"
     },
     "user_tz": -60
    },
    "id": "e1vUAGGkfPl9"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!wget https://www.dropbox.com/s/atg0zycfbacmwqi/dev-clean.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hz__Nn1Z1fxO"
   },
   "source": [
    "## Train sentencepiece tokenizer within SpeechBrain\n",
    "SentencePiece is a class that can be instantiated with few parameters\n",
    "\n",
    "\n",
    "*   **model_dir**: it is the directory where the trained tokenizer model is saved. The model will be saved as *`model_dir/model_type_vocab_size.model`*\n",
    "*   **vocab_sizes**: It is the vocabulary size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for character tokenization and mandatory for BPE & unigram tokenization.\n",
    "* **csv_train**: It is the path of the csv file which is used to learn the tokenizer.\n",
    "* **csv_read**: It is the data entry (csv header) which contains the word sequence in the csv file.\n",
    "* **model_type**: It can be: word, char, bpe, or unigram tokenization.\n",
    "\n",
    "Let's now apply it to our dev-clean.csv."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "U-1IDruE0UO_"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from speechbrain.tokenizers.SentencePiece import SentencePiece"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UH5mHGkn110v"
   },
   "outputs": [],
   "source": [
    "spm = SentencePiece(model_dir=\"tokenizer_data\",\n",
    "                    vocab_size=2000,\n",
    "                    annotation_train=\"dev-clean.csv\",\n",
    "                    annotation_read=\"wrd\",\n",
    "                    model_type=\"bpe\",\n",
    "                    annotation_list_to_check=[\"dev-clean.csv\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SK0NPosDVtbk"
   },
   "outputs": [],
   "source": [
    "%less tokenizer_data/2000_bpe.vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jE1ubRuGeWKq"
   },
   "source": [
    "As you can see, SetencePiece lib is an unsupervised text tokenizer and detokenizer.  Some of the tokens have `_` symbols representing spaces. The sentence piece detokenization will simply merge the sequence of tokens and replace `_` with spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2j0CeZ0RKJQ4"
   },
   "source": [
    "### Advanced parameters\n",
    "* `character_coverage`: it is the number of characters covered by the model (value between [0.98 - 1]). default: 1.0 for languages with a small character set. It can be set to 0.995 for languages with rich characters set like Japanese or Chinese.\n",
    "* `bos_id/eos_id/pad_id/unk_id`: allow users to define specefic index for `bos/eos/pad and unk` tokens\n",
    "* `split_by_whitespace`: this parameter allows sentencepiece to extract crossword pieces and consider space as a unique token.\n",
    "* `num_sequences`: use at most `num_sequences` to train the tokenize (limit the training text for large datasets).\n",
    "* `csv_list_to_check`: List of csv files used for checking the accuracy of recovering words from the tokenizer.\n",
    "* `user_defined_symbols`: it is a string list (separated by comma ',') which force the insertion of specific vocabulary.\n",
    "\n",
    "As an example, if we set the `character_coverage` to `0.98` and reduce the `vocab_size`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EhEvT1JZ_hBA"
   },
   "outputs": [],
   "source": [
    "spm = SentencePiece(model_dir=\"tokenizer_data\",\n",
    "                    vocab_size=500,\n",
    "                    annotation_train=\"dev-clean.csv\",\n",
    "                    annotation_read=\"wrd\",\n",
    "                    model_type=\"unigram\",\n",
    "                    character_coverage=0.98,\n",
    "                    annotation_list_to_check=[\"dev-clean.csv\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OZ7bhmnpJoiO"
   },
   "source": [
    "As we can see, we are not able to recover all the words from the text because some characters are missing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QEZnDlKehEvi"
   },
   "source": [
    "## Loading a pre-trained sentence piece tokenizer within SpeechBrain\n",
    "Loading the sentencepiece tokenizer is very simple. We just need to specify the path of the model,  the `vocab_size`, and the `model_type`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "l8wCKWfphfAy"
   },
   "outputs": [],
   "source": [
    "spm = SentencePiece(model_dir=\"tokenizer_data\",\n",
    "                    vocab_size=2000,\n",
    "                    model_type=\"bpe\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OOnHQACiiXKY"
   },
   "source": [
    "Now, we can directly use the tokenizer loaded from `tokenizer_data/2000_bpe.model`. This feature is very useful to replicate results. As an example, you can upload your tokenizer to the internet and someone else can download it to obtain the same tokenization as you."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cLNxiVpkfgXo"
   },
   "source": [
    "## How to use the sentencepiece\n",
    "\n",
    "The SentencePiece object is available at `speechbrain.tokenizer.SentencePiece.sp`. By accessing this object, you can easily perform tokenization and detokenization. If interested in all the features of SentencePiece, please feel free to read the [official tutorial](https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb#scrollTo=uzBiPAm4ljor)\n",
    "\n",
    "Let's try to tokenize and detokenize some text!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WKyVd23AgnyU"
   },
   "outputs": [],
   "source": [
    "# Encode as pieces\n",
    "print(spm.sp.encode_as_pieces('THIS IS A TEST'))\n",
    "# Encode as ids\n",
    "print(spm.sp.encode_as_ids('THIS IS A TEST'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OvxCFkiYoBpH"
   },
   "outputs": [],
   "source": [
    "# Decode from ids\n",
    "print(spm.sp.decode_ids([244, 177, 3, 1, 97]))\n",
    "# Decode from pieces\n",
    "print(spm.sp.decode_pieces(['▁THIS', '▁IS', '▁A', '▁T', 'EST']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y5zbiLO4pAeN"
   },
   "source": [
    "## Use SpeechBrain SentencePiece with Pytorch\n",
    "We designed our SentencePiece wrapper to be used jointly to our data transform pipeline [(see the tutorial)](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html) and therefore deal with tensors.\n",
    "For that purpose, two options are available:\n",
    "1. Option 1: Generating token tensors directly from a word tensors + an external dictionary named `int2lab` (which maps your tensors to words).\n",
    "1. Option 2: If you use our DynamicDataset, the DynamicItem will automatically generate the token tensors.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wqpG4Ccoxo9y"
   },
   "source": [
    "### Example for option 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QeLJbRntpcfc"
   },
   "outputs": [],
   "source": [
    "# INPUTS\n",
    "# word vocab\n",
    "dict_int2lab = {1: \"HELLO\", 2: \"WORLD\", 3: \"GOOD\", 4:\"MORNING\"}\n",
    "# wrd tensors\n",
    "wrd_tensor = torch.Tensor([[1, 2, 0], [3,4,2]])\n",
    "# relative lens tensor (will help for dealing with padding)\n",
    "lens_tensor = torch.Tensor([0.75, 1.0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d0QLTr1GzJ_S"
   },
   "source": [
    "Our SentencePiece can be called like any other pytorch function with the tensors passed to the __call__ method. Parameters are given as:\n",
    "batch : it is a word_ids tensor (i.e. your words). Shape: [batch_size, max_seq_lenght]\n",
    "batch_lens: it is a relative length tensor. shape: [batch_size]\n",
    "int2lab: dictionary which maps the word_ids to the word.\n",
    "task:\n",
    "\"encode\": convert the word batch tensor into a token tensor.\n",
    "\"decode\": convert the token tensor into a list of word sequences.\n",
    "\"decode_from_list\": convert a list of token sequences to a list of word sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k4WDwuSIyleS"
   },
   "outputs": [],
   "source": [
    "encoded_seq_ids, encoded_seq_lens = spm(\n",
    "        wrd_tensor,\n",
    "        lens_tensor,\n",
    "        dict_int2lab,\n",
    "        \"encode\",\n",
    "    )\n",
    "# tokens tensor\n",
    "print(encoded_seq_ids)\n",
    "# relative lens token tensor\n",
    "print(encoded_seq_lens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jA7KAgZ3N3uj"
   },
   "source": [
    "Then we can simply decode it by simply specifying `\"decode\"` to the function!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "P0QijcAKyvSL"
   },
   "outputs": [],
   "source": [
    "# decode from torch tensors (batch, batch_lens)\n",
    "words_seq = spm(encoded_seq_ids, encoded_seq_lens, task=\"decode\")\n",
    "print(words_seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "L64IE3lH4wb6"
   },
   "source": [
    "### Example for option 2\n",
    "\n",
    "**Note:** please first read our dataio [tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html) to perfectly grasp the next lines.\n",
    "\n",
    "Here, we use a tokenizer to tokenize on-the-fly the text obtained from a .csv file. In the following example, we combined  it with the data_io pipeline of SpeechBrain.\n",
    "\n",
    "First, we define a DynamicItemDataset from our csv file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "S6-Os1Eb4ycu"
   },
   "outputs": [],
   "source": [
    "import speechbrain as sb\n",
    "train_set = sb.dataio.dataset.DynamicItemDataset.from_csv(\n",
    "        csv_path=\"dev-clean.csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "68xx4kgezPtX"
   },
   "outputs": [],
   "source": [
    "%less dev-clean.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xehgtxCmP1ir"
   },
   "source": [
    "Then, we define the text_pipeline (i.e. what is called for each sample gathered in a mini-batch). In the text_pipeline, we simply call our tokenizer to obtain the tokenized text!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "f00AeLtt5O9o"
   },
   "outputs": [],
   "source": [
    "    @sb.utils.data_pipeline.takes(\"wrd\")\n",
    "    @sb.utils.data_pipeline.provides(\n",
    "        \"wrd\", \"tokens_list\", \"tokens\"\n",
    "    )\n",
    "    def text_pipeline(wrd):\n",
    "        yield wrd\n",
    "        tokens_list = spm.sp.encode_as_ids(wrd)\n",
    "        yield tokens_list\n",
    "        tokens = torch.LongTensor(tokens_list)\n",
    "        yield tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ksltlAzaQNvt"
   },
   "source": [
    "Some more SpeechBrain stuff to finalize the data pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tVmLtHLO5ep9"
   },
   "outputs": [],
   "source": [
    "train_set.add_dynamic_item(text_pipeline)\n",
    "train_set.set_output_keys([\"wrd\", \"tokens\", \"tokens_list\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5D1lzSTCQV-S"
   },
   "source": [
    "Finally, we create a data loader that contains the defined transformation (i.e. tokenizer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ko0IBWvN8cox"
   },
   "outputs": [],
   "source": [
    "train_dataloader = sb.dataio.dataloader.make_dataloader(train_set, batch_size=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hQ63N5urQhys"
   },
   "source": [
    "Now, we can simply get our tokenized samples !!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7Me64J__yQsO"
   },
   "outputs": [],
   "source": [
    "b = next(iter(train_dataloader))\n",
    "print(b.wrd)\n",
    "print(b.tokens)\n",
    "print(b.tokens_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_footer",
    "tags": [
     "sb_auto_footer"
    ]
   },
   "source": [
    "## Citing SpeechBrain\n",
    "\n",
    "If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:\n",
    "\n",
    "```bibtex\n",
    "@misc{speechbrainV1,\n",
    "  title={Open-Source Conversational AI with {SpeechBrain} 1.0},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},\n",
    "  year={2024},\n",
    "  eprint={2407.00463},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={cs.LG},\n",
    "  url={https://arxiv.org/abs/2407.00463},\n",
    "}\n",
    "@misc{speechbrain,\n",
    "  title={{SpeechBrain}: A General-Purpose Speech Toolkit},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},\n",
    "  year={2021},\n",
    "  eprint={2106.04624},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={eess.AS},\n",
    "  note={arXiv:2106.04624}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}