{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_header",
    "tags": [
     "sb_auto_header"
    ]
   },
   "source": [
    "<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",
    "<!-- The contents are initialized from tutorials/notebook-header.md -->\n",
    "\n",
    "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/basics/introduction-to-speechbrain.ipynb)\n",
    "to execute or view/download this notebook on\n",
    "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/basics/introduction-to-speechbrain.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dmf1KHEN6g32"
   },
   "source": [
    "# **Introduction to SpeechBrain**\n",
    "\n",
    "SpeechBrain is an **open-source** **all-in-one** speech toolkit based on **PyTorch**. It is designed to make the research and development of speech technology easier.\n",
    "\n",
    "## Motivation\n",
    "There are many speech and audio processing tasks of great practical and scientific interest.  \n",
    "\n",
    "In the past, the dominant approach was to develop a **different toolkit for each different task**. Nevertheless, learning several toolkits is **time-demanding**, might require knowledge of **different programming languages**,  and forces you to familiarize yourself with  **different code styles and standards** (e.g., data readers).\n",
    "\n",
    "Nowadays, most of these tasks can be implemented with the same **deep learning**  technology.\n",
    "We thus explicitly designed SpeechBrain to natively support **multiple speech processing tasks**. We think that this might make much easier the life of speech developers. Moreover, we think that the combination of different speech technologies in single **end-to-end** and **fully differential system** will be crucial in the development of future speech technologies.\n",
    "\n",
    "We did our best to design a toolkit which is:\n",
    "*   *Easy to use*\n",
    "*   *Easy to customize*\n",
    "*  *Flexible*\n",
    "* *Modular*\n",
    "* *Well-documented*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iiwk7738KFvt"
   },
   "source": [
    "## Supported Technologies\n",
    "\n",
    "You can thus use speechbrain to convert *speech-to-text*, to perform authentication using `speaker verification`, to enhance the quality of the speech signal, to combine the information from multiple microphones, and for many other things.\n",
    "\n",
    "More precisely, SpeechBrain currently supports many conversational AI technologies, including:\n",
    "\n",
    "- Speech Recognition\n",
    "- Speaker Recognition\n",
    "- Speech Separation\n",
    "- Speech Enhancement\n",
    "- Text-to-Speech\n",
    "- Vocoding\n",
    "- Spoken Language Understanding\n",
    "- Speech-to-Speech Translation\n",
    "- Speech Translation\n",
    "- Emotion Classification\n",
    "- Language Identification\n",
    "- Voice Activity Detection\n",
    "- Sound Classification\n",
    "- Self-Supervised Learning\n",
    "- Interpretabiliy\n",
    "- Speech Generation\n",
    "- Metric Learning\n",
    "- Alignment\n",
    "- Diarization\n",
    "- Language Modeling\n",
    "- Response Generation\n",
    "- Grapheme-to-Phoneme\n",
    "\n",
    "For all these tasks, we propose recipes on popular datasets that achieve **competitive** or state-of-the-art **performance**.\n",
    "\n",
    "SpeechBrain is an ongoing project (still in beta version) and we are building a large community to further expand the current functionalities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qbn5PX8mKHFz"
   },
   "source": [
    "## Installation\n",
    "\n",
    "There are essentially two ways to install SpeechBrain:\n",
    "*  **Local installation**: it is suggested if you want to modify the toolkit or train a full speech processing system from scratch.\n",
    "\n",
    "*  **Install via PyPI**: it is suggested when you wanna just use some core functionality of SpeechBrain in your project.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Z60RdK8V54dW"
   },
   "source": [
    "### Local Installation (Git clone)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NQ3rDQslkn12"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "# Installing SpeechBrain via pip\n",
    "BRANCH = 'develop'\n",
    "!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH\n",
    "\n",
    "# Clone SpeechBrain repository\n",
    "!git clone https://github.com/speechbrain/speechbrain/\n",
    "%cd /content/speechbrain/templates/speech_recognition/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XG16N39sfnJs"
   },
   "source": [
    "Once installed, you should be able to import the speechbrain project with python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hbMISpjh0s3e"
   },
   "outputs": [],
   "source": [
    "import speechbrain as sb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lxVO1Mj9MsSh"
   },
   "source": [
    "## Running an Experiment\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TpYzWszYjOZR"
   },
   "source": [
    "To run an experiment with SpeechBrain, the typical syntax is:\n",
    "\n",
    "```\n",
    "python train.py hparams.yaml\n",
    "```\n",
    "\n",
    "All the hyperparameters are summarized in a yaml file, while the main script for training is `train.py`.\n",
    "\n",
    "For instance, let's run one of the minimal examples made available with SpeechBrain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fCukWAmn5TN2"
   },
   "outputs": [],
   "source": [
    "%cd /content/speechbrain/tests/integration/ASR_CTC/\n",
    "!python example_asr_ctc_experiment.py hyperparams.yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PAdSpX2mn0YT"
   },
   "source": [
    "In this case,  we trained a CTC-based **speech recognizer** with a tiny dataset stored in the folder `samples`. As you can see, the training loss is very small, which indicates that the model is implemented correctly.\n",
    "The validation loss, instead, is high. This happens because, as expected, the dataset is too small to allow the network to generalize.\n",
    "\n",
    "For a more detailed description of the minimal examples, please see the tutorial on \"minimal examples step-by-step\".\n",
    "\n",
    "All the results of the experiments are stored in the `output_folder` defined in the yaml file. Here, you can find, the checkpoints, the trained models, a file summarizing the performance, and a logger.\n",
    "\n",
    "This way, you can compare your performance with the one achieved by us and you can have access to all the pre-trained models.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "19-Fpm4ArfP_"
   },
   "source": [
    "## Hyperparameter specification with YAML\n",
    "\n",
    "Machine learning systems often require the specification of several hyperparameters. In SpeechBrain, we do it with YAML. YAML allows us to specify the hyperparameters in an elegant,  flexible,  and transparent way.\n",
    "\n",
    "Let's see for instance this yaml snippet:\n",
    "\n",
    "\n",
    "```yaml\n",
    "dropout: 0.8\n",
    "compute_features: !new:speechbrain.lobes.features.MFCC\n",
    "    n_mels: 40\n",
    "    left_frames: 5\n",
    "    right_frames: 5\n",
    "\n",
    "model: !new:speechbrain.lobes.models.CRDNN.CRDNN\n",
    "   input_shape: [null, null, 440]\n",
    "   activation: !name:torch.nn.LeakyReLU []\n",
    "   dropout:  !ref <dropout>\n",
    "   cnn_blocks: 2\n",
    "   cnn_channels: (32, 16)\n",
    "   cnn_kernelsize: (3, 3)\n",
    "   time_pooling: True\n",
    "   rnn_layers: 2\n",
    "   rnn_neurons: 512\n",
    "   rnn_bidirectional: True\n",
    "   dnn_blocks: 2\n",
    "   dnn_neurons: 1024\n",
    "```\n",
    "\n",
    "As you can see, this is not just a plain list of hyperparameters. For each parameter, we specify the class (or function) that is going to use it. This makes the code **more transparent** and **easier to debug**.\n",
    "\n",
    "The YAML file contains all the information to initialize the classes when loading them. In SpeechBrain we load it with a special function called `load_hyperpyyaml`, which initializes for us all the declared classes. This makes the code extremely **readable** and **compact**.\n",
    "\n",
    "Our hyperpyyaml is an extension of the standard YAML. For an overview of all the supported functionalities, please take a look at the [YAML tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/hyperpyyaml.html).\n",
    "\n",
    "Note that all the hyperparameters can be overridden from the command line. For instance, to change the dropout factor:\n",
    "\n",
    "`python experiment.py params.yaml --dropout=0.5 `\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-yZGzRmFxJGg"
   },
   "source": [
    "## Experiment File\n",
    "The experiment file (e.g., `example_asr_ctc_experiment.py` in the example) trains a model by **combining** the functions or **classes declared in the yaml file**. This script defines the data processing pipeline and defines all the computations from the input signal to the final cost function. Everything is designed to be** easy to customize**.\n",
    "\n",
    "\n",
    "### Data Specification\n",
    "The user should prepare a data specification file (in **CSV** or JSON) format that reports all the data and the labels to process.\n",
    "\n",
    "For instance, in the minimal example ran before, the data specification file is this:\n",
    "\n",
    "\n",
    "```csv\n",
    "ID, duration, wav, wav_format, wav_opts, spk_id, spk_id_format, spk_id_opts, ali, ali_format, ali_opts, phn, phn_format, phn_opts,char,char_format,char_opts\n",
    "spk1_snt5,2.6,$data_folder/spk1_snt5.wav, wav, ,spk1,string, ,$data_folder/spk1_snt5.pkl,pkl, ,s ah n vcl d ey ih z dh ax vcl b eh s cl t cl p aa r dx ax v dh ax w iy cl,string, ,s u n d a y i s t h e b e s t p a r t o f t h e w e e k,string,\n",
    "spk2_snt5,1.98,$data_folder/spk2_snt5.wav, wav, ,spk2,string, ,$data_folder/spk2_snt5.pkl,pkl, ,vcl jh ah m cl p dh ax f eh n s ae n hh er iy ah cl p dh ax vcl b ae ng cl,string, ,k e n p a I r s l a c k f u l l f l a v o r,string,\n",
    "```\n",
    "\n",
    "You can open this file with a CSV reader for better rendering. For each row, you have an example with the corresponding paths to wav signal and labels.\n",
    "\n",
    "As an alternative, users can specify the data in a **JSON** format:\n",
    "\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"spk1_snt5\": {\n",
    "        \"wav\": \"{data_root}/spk1_snt5.wav\",\n",
    "        \"length\": 2.6,\n",
    "        \"spk_id\": \"spk1\",\n",
    "        \"ali\": \"{data_root}/spk1_snt5.pkl\",\n",
    "        \"phn\": \"s ah n vcl d ey ih z dh ax vcl b eh s cl t cl p aa r dx ax v dh ax w iy cl\",\n",
    "        \"char\": \"s u n d a y i s t h e b e s t p a r t o f t h e w e e k\"\n",
    "    },\n",
    "    \"spk2_snt5\": {\n",
    "        \"wav\": \"{data_root}/spk2_snt5.wav\",\n",
    "        \"length\": 1.98,\n",
    "        \"spk_id\": \"spk2\",\n",
    "        \"ali\": \"{data_root}/spk2_snt5.pkl\",\n",
    "        \"phn\": \"vcl jh ah m cl p dh ax f eh n s ae n hh er iy ah cl p dh ax vcl b ae ng cl\",\n",
    "        \"char\": \"k e n p a i r s l a c k f u l l f l a v o r\"\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "JSON is less compact than CSV but more flexible. For many applications, using the CSV file is enough. For more complex tasks (e.g, speaker diarization, speaker diarization + recognition), however, people might take advantage of the hierarchical structure offered by JSON.\n",
    "\n",
    "All datasets are formatted differently. In general, the users have to write a **data preparation** script that parses the target dataset and creates the data specification files. For all the proposed recipes, however, we also release the corresponding data preparation library.\n",
    "\n",
    "### Data processing pipeline\n",
    "Thanks to our Dynamic datasets, the data reading pipeline is fully customizable in the experiment file directly. For instance, in the minimal example, you can define the following intuitive function to read the audio file\n",
    "\n",
    "\n",
    "```python\n",
    "    # 2. Define audio pipeline:\n",
    "    @sb.utils.data_pipeline.takes(\"wav\")\n",
    "    @sb.utils.data_pipeline.provides(\"sig\")\n",
    "    def audio_pipeline(wav):\n",
    "        sig = sb.dataio.dataio.read_audio(wav)\n",
    "        return sig\n",
    "```\n",
    "\n",
    "The function takes in input the wav path and returns a signal read with the specified reader. In the variable `batch.sig` (see `example_asr_ctc_experiment.py`) you will have your batches of signals ready to be used. Note that here you can add any kind of processing (e.g, adding noise, speech change, dynamic mixing, etc) just by coding with the desired pipeline.\n",
    "\n",
    "A similar function should be written for all the entries that our script is supposed to process. The minimal example, for instance, reads a sequence of phoneme labels as well:\n",
    "\n",
    "\n",
    "```python\n",
    "    @sb.utils.data_pipeline.takes(\"phn\")\n",
    "    @sb.utils.data_pipeline.provides(\"phn_list\", \"phn_encoded\")\n",
    "    def text_pipeline(phn):\n",
    "        phn_list = phn.strip().split()\n",
    "        yield phn_list\n",
    "        phn_encoded = label_encoder.encode_sequence_torch(phn_list)\n",
    "        yield phn_encoded\n",
    "\n",
    "```\n",
    "Here, we read the phoneme list, separate each entry by space, and convert the list of phonemes to their corresponding indexes (using the label_encoder described [in this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html)).\n",
    "\n",
    "As you can see, we directly expose in the main script the data reading pipeline because this adds a lot of transparency and flexibility.\n",
    "\n",
    "### Custom forward and cost computation methods\n",
    "The other thing that users often want to customize is the sequence of computations that go from the input to the final cost function. In the experiment file, users are required to specify them in the `forward` and `compute_objectives` methods.\n",
    "In the minimal example, the forward method is defined as follows:\n",
    "\n",
    "\n",
    "```python\n",
    "    def compute_forward(self, batch, stage):\n",
    "        \"Given an input batch it computes the output probabilities.\"\n",
    "        wavs, lens = batch.sig\n",
    "        feats = self.hparams.compute_features(wavs)\n",
    "        feats = self.modules.mean_var_norm(feats, lens)\n",
    "        x = self.modules.model(feats)\n",
    "        x = self.modules.lin(x)\n",
    "        outputs = self.hparams.softmax(x)\n",
    "```\n",
    "\n",
    "The input is the variable batch that contains all the entries specified in the data loader (e.g, we have `batch.sig` and `batch.phn_encoded`). As you can see, we compute the features, we perform a mean and variance normalization, and we call the model. Finally, a linear transformation + softmax is applied.\n",
    "\n",
    "The compute objective function looks like this:\n",
    "\n",
    "```python\n",
    "    def compute_objectives(self, predictions, batch, stage):\n",
    "        \"Given the network predictions and targets computed the CTC loss.\"\n",
    "        predictions, lens = predictions\n",
    "        phns, phn_lens = batch.phn_encoded\n",
    "        loss = self.hparams.compute_cost(predictions, phns, lens, phn_lens)\n",
    "\n",
    "        if stage != sb.Stage.TRAIN:\n",
    "            seq = sb.decoders.ctc_greedy_decode(\n",
    "                predictions, lens, blank_id=self.hparams.blank_index\n",
    "            )\n",
    "            self.per_metrics.append(batch.id, seq, phns, target_len=phn_lens)\n",
    "\n",
    "        return loss\n",
    "```\n",
    "We take the predictions done in the forward step and compute a cost function using the encoded labels in batch.phn_encoded. During validation/test, we also perform actual decoding on the speech sequence (in this case using a greedy decoder and, in a more general case, using beam search) to monitor the performance.\n",
    "\n",
    "### Brain Class\n",
    "To make training easier, we implemented a simple trainer called **Brain class**. The Brain class defines a set of customizable routines that implement all the steps needed in standard **training and validation loops**. After defining the data pipeline, the forward, the compute objective, and other custom methods, you can call the fit method of the brain class for training (and the eval one for the test):\n",
    "\n",
    "```python\n",
    "    # Trainer initialization\n",
    "    ctc_brain = CTCBrain(hparams[\"modules\"], hparams[\"opt_class\"], hparams)\n",
    "\n",
    "    # Training/validation loop\n",
    "    ctc_brain.fit(\n",
    "        range(hparams[\"N_epochs\"]),\n",
    "        train_data,\n",
    "        valid_data,\n",
    "        train_loader_kwargs=hparams[\"dataloader_options\"],\n",
    "        valid_loader_kwargs=hparams[\"dataloader_options\"],\n",
    "    )\n",
    "    # Evaluation is run separately (now just evaluating on valid data)\n",
    "    ctc_brain.evaluate(valid_data)\n",
    "```\n",
    "For a more detailed description, take a look at the [Brain class tutorial here](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/brain-class.html\n",
    ").\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gOE1UED8z05V"
   },
   "source": [
    "## Pretrain and use\n",
    "Sometimes you might only want to use a pre-trained model rather than training it from scratch. For instance, you might want to transcribe an audio file, compute speaker embeddings, apply a voice activity detector, and doing many other operations in your scripts. To make this easier, we uploaded several models in [HuggingFace](https://huggingface.co/speechbrain/). The models uses inference classes to make inference easier. For instance, to transcribe an audio file with a model trained with librispeech you can simply do:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PQonIcrNhebC"
   },
   "outputs": [],
   "source": [
    "from speechbrain.inference.ASR import EncoderDecoderASR\n",
    "\n",
    "asr_model = EncoderDecoderASR.from_hparams(source=\"speechbrain/asr-crdnn-rnnlm-librispeech\", savedir=\"pretrained_models/asr-crdnn-rnnlm-librispeech\")\n",
    "asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UzbhBKksjTH4"
   },
   "source": [
    "As you can see, in this case there is a matching between the text uttered by the speaker and the content of the audio file.\n",
    "We have similar functions for speaker recognition, speech separation, enhancement."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_YgdsT3x_xz8"
   },
   "source": [
    "## Folder Organization\n",
    "The main folder is organized in this way:\n",
    "\n",
    "*   **SpeechBrain** contains the main libraries of SpeechBrain. You can find here the core.py that implements core functionalities such as the Brain class. You also find here libraries for data loading, decoders, neural networks, signal processing, and many others. Under the folder lobe, you can find combinations of basic functionalities that we think are useful for speech and audio processing. For instance, you can find here the implementation of features like FBANKs and MFCCs, the data augmentation functions, as well as some popular neural networks used a lot in the recipes.\n",
    "*   **Recipes** contains training scripts for several speech datasets. For instance, you can find recipes for *LibriSpeech*, *TIMIT*, *VoxCeleb*, *VoiceBank*, and many others.\n",
    "*   **Samples** is a tiny speech dataset used for training minimal examples and to perform debug tests.\n",
    "*   **Test** is a collection of unit and integration tests that we use for debugging and continuous integration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "raC03quSSCYC"
   },
   "source": [
    "## Tensor Format\n",
    "All the tensors within SpeechBrain are formatted using the following convention:\n",
    "\n",
    "`tensor=(batch, time_steps, channels[optional])`\n",
    "\n",
    "The batch is always the first element, and time steps are always the second one. The remaining dimensions are channels, which are options and there might be as many as you need).\n",
    "\n",
    "Let's now some examples. For instance, let's try to compute the FBANKS of an input signal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "slYIC4fbXKER"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from speechbrain.lobes.features import Fbank\n",
    "\n",
    "signal = torch.rand([4, 16000]) # [batch, time]\n",
    "print(signal.shape)\n",
    "\n",
    "fbank_maker = Fbank()\n",
    "fbanks = fbank_maker(signal) # [batch, time, features]\n",
    "print(fbanks.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fZVzo5sUWUih"
   },
   "source": [
    "The `Fbank` function expects in input a signal formatted as `[batch, time]`. It returns the features in the format `[batch, time, features]`.\n",
    "\n",
    "Let's now try to compute the STFT of any audio signal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Zto6xdG6ZRm1"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from speechbrain.dataio.dataio import read_audio\n",
    "from speechbrain.processing.features import STFT\n",
    "\n",
    "signal = torch.rand([4, 1600]) # [batch, time]\n",
    "print(signal.shape)\n",
    "\n",
    "compute_STFT = STFT(sample_rate=16000, win_length=25, hop_length=10, n_fft=400)\n",
    "signal_STFT = compute_STFT(signal) #[batch, time, channel1, channel2]\n",
    "print(signal_STFT.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tCLEwNFmaJpx"
   },
   "source": [
    "The output here is `[batch, time, channel1, channel2]`, where  `channel1` is the feature axis and `channel2` is the real and imaginary part."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sAWfaD24bAvU"
   },
   "source": [
    "**Why do we need a tensor format?**\n",
    "Defining a tensor format makes model combination easier. Many formats are possible. For SpeechBrain, we selected this one because it is commonly used in recurrent neural networks.\n",
    "\n",
    "In SpeechBrain, the basic building blocks of the neural networks (e.g, *RNN*, *CNN*, *normalization*, *pooling*, ...) are designed to support the same tensor format and can thus be combined smoothly.\n",
    "\n",
    "To convince you about that, let's try to combine a CNN and an RNN using SpeechBrain:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fURQxwJvcLf5"
   },
   "outputs": [],
   "source": [
    "from speechbrain.nnet.CNN import Conv1d\n",
    "from  speechbrain.nnet.RNN import LSTM\n",
    "\n",
    "inp_tensor = torch.rand([10, 15, 40])\n",
    "print(inp_tensor.shape)\n",
    "\n",
    "# CNN\n",
    "CNN = Conv1d(input_shape=inp_tensor.shape, out_channels=8, kernel_size=5)\n",
    "cnn_out = CNN(inp_tensor)\n",
    "print(cnn_out.shape)\n",
    "\n",
    "\n",
    "# RNN\n",
    "RNN = LSTM(input_shape=cnn_out.shape, hidden_size=256, num_layers=1)\n",
    "rnn_out, _ = RNN(cnn_out)\n",
    "print(rnn_out.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GUZcFaxqfAKd"
   },
   "source": [
    "The combination is done without any tensor reshaping (e.g, we don't have to transpose, squeeze, unsqueeze). The basic nnet functions are a wrapper of the original pytorch functions.  The difference is that er manage for you all the annoying tensor reshaping operations. This makes the code cleaner and easier to follow.\n",
    "Let's try to do the same operation with raw PyTorch:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZGedYMKMgDHA"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "inp_tensor = torch.rand([10, 15, 40])\n",
    "print(inp_tensor.shape)\n",
    "\n",
    "# CNN\n",
    "CNN = torch.nn.Conv1d(in_channels=40, out_channels=8, kernel_size=5)\n",
    "inp_tensor_tr = inp_tensor.transpose(1,2) # requires (N,C,L)\n",
    "cnn_out_tr = CNN(inp_tensor_tr)\n",
    "print(cnn_out_tr.shape)\n",
    "\n",
    "# RNN\n",
    "cnn_out_tr2 = cnn_out_tr.transpose(1,2)\n",
    "RNN = torch.nn.LSTM(input_size=8, hidden_size=256, num_layers=1)\n",
    "rnn_out, _ = RNN(cnn_out_tr2)\n",
    "print(rnn_out.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K_UyB9yzi0PY"
   },
   "source": [
    "The raw pytorch approach requires two transpose operations because of the different tensor formats used in CNN and RNN modules. In SpeechBrain, this is managed internally and users do not have to worry about it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sb_auto_footer",
    "tags": [
     "sb_auto_footer"
    ]
   },
   "source": [
    "## Citing SpeechBrain\n",
    "\n",
    "If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:\n",
    "\n",
    "```bibtex\n",
    "@misc{speechbrainV1,\n",
    "  title={Open-Source Conversational AI with {SpeechBrain} 1.0},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},\n",
    "  year={2024},\n",
    "  eprint={2407.00463},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={cs.LG},\n",
    "  url={https://arxiv.org/abs/2407.00463},\n",
    "}\n",
    "@misc{speechbrain,\n",
    "  title={{SpeechBrain}: A General-Purpose Speech Toolkit},\n",
    "  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},\n",
    "  year={2021},\n",
    "  eprint={2106.04624},\n",
    "  archivePrefix={arXiv},\n",
    "  primaryClass={eess.AS},\n",
    "  note={arXiv:2106.04624}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}