{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "sb_auto_header", "tags": [ "sb_auto_header" ] }, "source": [ "\n", "\n", "\n", "[\"Open](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/advanced/dynamic-batching.ipynb)\n", "to execute or view/download this notebook on\n", "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/advanced/dynamic-batching.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "1lMT9knfou6A" }, "source": [ "# Dynamic Batching: What is it and why it is necessary sometimes\n", "\n", "Batching examples together is a crucial optimization that significantly accelerates training processes. This, combined with distributed training across multiple GPUs, enables the training of models with large parameter counts on extensive datasets in a matter of days instead of months.\n", "\n", "The conventional approach involves using a fixed batch size to group examples together. However, when each input has a different size, as is often the case in audio or natural language processing (NLP) applications, it necessitates padding each example in a batch to match the size of the largest one in that batch.\n", "\n", "While this is a common practice, it introduces a potential inefficiency when the lengths of examples exhibit significant variance. In scenarios like audio and NLP applications, a substantial portion of computation is performed on padded values, leading to computational waste. To address this issue, dynamic batching becomes essential, allowing for more efficient and resource-conscious processing of variable-length sequences in the context of diverse machine learning tasks.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "tILFmgtVDaJK" }, "source": [ "To illustrate this point, let's look, for example, at **MiniLibriSpeech** which is a subset of LibriSpeech. Let's download this dataset and other tools from the [data-io tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html) which uses this same data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "executionInfo": { "elapsed": 23640, "status": "ok", "timestamp": 1718826449038, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "-Xb6KaE6DkvI" }, "outputs": [], "source": [ "%%capture\n", "# here we download the material needed for this tutorial: images and an example based on mini-librispeech\n", "!wget https://www.dropbox.com/s/b61lo6gkpuplanq/MiniLibriSpeechTutorial.tar.gz?dl=0\n", "!tar -xvzf MiniLibriSpeechTutorial.tar.gz?dl=0\n", "# downloading mini_librispeech dev data\n", "!wget https://www.openslr.org/resources/31/train-clean-5.tar.gz\n", "!tar -xvzf train-clean-5.tar.gz" ] }, { "cell_type": "markdown", "metadata": { "id": "TgEBlx-iVht4" }, "source": [ "Next, we install `speechbrain`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "executionInfo": { "elapsed": 114162, "status": "ok", "timestamp": 1718826563198, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "fVHJYKO8tOic" }, "outputs": [], "source": [ "%%capture\n", "# Installing SpeechBrain via pip\n", "BRANCH = 'develop'\n", "!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH" ] }, { "cell_type": "markdown", "metadata": { "id": "n6sGSbUUEitE" }, "source": [ "Now, let's look at what is the length of each audio in this dataset and how it is distributed.\n", "\n", "We can plot the histogram of the lengths for each audio in this dataset using `torchaudio`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 490 }, "executionInfo": { "elapsed": 25614, "status": "ok", "timestamp": 1718826588807, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "eTcGLwrwEtQG", "outputId": "3322dbec-68bd-479e-b427-338295c25d2f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of audio files in MiniLibriSpeech train-clean-5: 1519\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkMAAAHHCAYAAAC88FzIAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA7YklEQVR4nO3deVxV1f7/8fcBZHAARAVEEZxx1tCUKLFEcQhzyiH1qnnLDJwb1MypAeua4zXN7lXLtJuWmlpqpohWOKRZmuaUUyaaoeCQYLB/f/j1/DyCCnjgiPv1fDzO48HZe+21PwtI3q299j4WwzAMAQAAmJSTowsAAABwJMIQAAAwNcIQAAAwNcIQAAAwNcIQAAAwNcIQAAAwNcIQAAAwNcIQAAAwNcIQAAAwNcIQkAcWi0Xjxo3L07HBwcHq06fPXZ9z3LhxslgsOnv2bJ7quFW/95Pg4GA9/vjjji4jV5o1a6ZmzZoV+Hn79Omj4ODgAj8vcC8gDMG05s+fL4vFIovFom+++SbLfsMwFBgYKIvFkq9/UDdu3CiLxaJPP/00386RW7t371bnzp0VFBQkd3d3lStXTi1atNCMGTMcXZrD7d27V+PGjdPRo0cdXYrpHD161Prf7M2v//3vf44uD4WYi6MLABzN3d1dixYt0sMPP2yzPSEhQb/99pvc3NyyHPPXX3/JxSVv//ns379fTk65//+Quzlnbvr97rvv9Oijj6pChQp65pln5O/vrxMnTmjLli2aNm2aBg4caPcaCpO9e/dq/PjxatasWb7MpHz11Vd27/N+0717d7Vp08ZmW1hYmIOqwf2AMATTa9OmjZYsWaLp06fbhIJFixYpNDQ028tQ7u7ueT5fduEqJ+7mnDfLzMxUenq63N3ds/T7xhtvyMvLS9u3b5e3t7fNvjNnztitBjMwDENXrlyRh4dHjo9xdXXNx4ruDw888IB69uzp6DJwH+EyGUyve/fu+vPPP7Vu3TrrtvT0dH366ad66qmnsj3mVut3Dh06pD59+sjb21teXl7q27evLl++bHOsvdYMXXf27Fl16dJFnp6eKlWqlAYPHqwrV65kOTY2NlYLFy5UrVq15ObmpjVr1mTb7+HDh1WrVq0sQUiSfH19b9lv9erV5e7urtDQUG3atCnLsSdPntTTTz8tPz8/ubm5qVatWpo7d26WdmlpaRo7dqyqVKkiNzc3BQYG6qWXXlJaWlqWth999JEefPBBFS1aVCVLllTTpk2znVn55ptv9OCDD8rd3V2VKlXShx9+mKXN4cOHdfjw4SzbbzR//nw9+eSTkqRHH33Ueolm48aNkv7/GqW1a9eqYcOG8vDw0HvvvSdJmjdvnh577DH5+vrKzc1NNWvW1KxZs7Kc4+Y1Q9cvoy5evFhvvPGGypcvL3d3dzVv3lyHDh26bb03Wr16tSIiIlSiRAl5enqqUaNGWrRo0W2PyczM1NSpU1WrVi25u7vLz89P/fv317lz52zaff7552rbtq0CAgLk5uamypUr67XXXlNGRkaWsdWuXVt79+7Vo48+qqJFi6pcuXJ6++23czyO6y5duqT09PRcHwdkhzAE0wsODlZYWJg+/vhj67bVq1crJSVF3bp1y1VfXbp00YULFxQXF6cuXbpo/vz5Gj9+vL1LznLOK1euKC4uTm3atNH06dP17LPPZmm3YcMGDR06VF27dtW0adNueYknKChIO3bs0J49e3J0/oSEBA0ZMkQ9e/bUhAkT9Oeff6pVq1Y2x58+fVpNmjTR119/rdjYWE2bNk1VqlRRv379NHXqVGu7zMxMtWvXTpMmTVJ0dLRmzJih9u3ba8qUKeratavNecePH69evXqpSJEimjBhgsaPH6/AwEBt2LDBpt2hQ4fUuXNntWjRQu+8845KliypPn366Oeff7Zp17x5czVv3vy2Y23atKkGDRokSRo1apQWLFigBQsWqEaNGtY2+/fvV/fu3dWiRQtNmzZN9evXlyTNmjVLQUFBGjVqlN555x0FBgbq+eef18yZM3P0fZ44caKWLVumF154QSNHjtSWLVvUo0ePHB07f/58tW3bVsnJyRo5cqQmTpyo+vXrWwPxrfTv318vvviiwsPDNW3aNPXt21cLFy5UVFSUrl69atN/8eLFNWzYME2bNk2hoaEaM2aMRowYkaXPc+fOqVWrVqpXr57eeecdhYSE6OWXX9bq1atzNBbp2s++ePHicnd3V6NGjbi0iLtnACY1b948Q5Kxfft249///rdRokQJ4/Lly4ZhGMaTTz5pPProo4ZhGEZQUJDRtm1bm2MlGWPHjrW+Hzt2rCHJePrpp23adejQwShVqpTNtqCgIKN3797W9/Hx8YYkY8mSJbet91bnbNeunU27559/3pBk/PjjjzbHOjk5GT///PMd+/3qq68MZ2dnw9nZ2QgLCzNeeuklY+3atUZ6enq2x0oyvv/+e+u2Y8eOGe7u7kaHDh2s2/r162eULVvWOHv2rM3x3bp1M7y8vKzf9wULFhhOTk7G5s2bbdrNnj3bkGR8++23hmEYxsGDBw0nJyejQ4cORkZGhk3bzMxM69dBQUGGJGPTpk3WbWfOnDHc3NyM4cOH2xwXFBRkBAUFZRnjzZYsWWJIMuLj47Psu36+NWvWZNl3fYw3ioqKMipVqmSzLSIiwoiIiLC+v/77UaNGDSMtLc26fdq0aYYkY/fu3bet9/z580aJEiWMxo0bG3/99ZfNvhu/V71797YZ/+bNmw1JxsKFC22OWbNmTZbt2Y2tf//+RtGiRY0rV67YjE2S8eGHH1q3paWlGf7+/kanTp1uOw7DuPa71bJlS2PWrFnGihUrjKlTpxoVKlQwnJycjFWrVt3xeOBWmBkCdG125a+//tKqVat04cIFrVq16paXyG7nueees3n/yCOP6M8//1Rqaqq9Ss0iJibG5v31Bc5ffvmlzfaIiAjVrFnzjv21aNFCiYmJateunX788Ue9/fbbioqKUrly5bRixYos7cPCwhQaGmp9X6FCBT3xxBNau3atMjIyZBiGPvvsM0VHR8swDJ09e9b6ioqKUkpKinbu3ClJWrJkiWrUqKGQkBCbdo899pgkKT4+XpK0fPlyZWZmasyYMVkWo1ssFpv3NWvW1COPPGJ9X6ZMGVWvXl2//vqrTbujR4/a5Q6xihUrKioqKsv2G9cNpaSk6OzZs4qIiNCvv/6qlJSUO/bbt29fm/VE18d08zhutm7dOl24cEEjRozIsj7s5u/VjZYsWSIvLy+1aNHC5mcRGhqq4sWLW38WN4/twoULOnv2rB555BFdvnxZv/zyi02/xYsXt1nv4+rqqgcffPCO45Cu/W6tXbtWzz33nKKjozV48GD98MMPKlOmjIYPH37H44FbYQE1oGt/ICMjI7Vo0SJdvnxZGRkZ6ty5c677qVChgs37kiVLSrp2acDT09Mutd6satWqNu8rV64sJyenLH/YK1asmOM+GzVqpKVLlyo9PV0//vijli1bpilTpqhz587atWuXTai6+fySVK1aNV2+fFl//PGHnJycdP78ec2ZM0dz5szJ9nzXF2YfPHhQ+/btU5kyZW7b7vDhw3JycspRuLv5ZyJd+7ncvO7FXm71ff722281duxYJSYmZllHlpKSIi8vr9v2e7vfLUm6ePGiLl68aN3v7OysMmXKWNdB1a5dO1fjOHjwoFJSUrKsE7vuxsX0P//8s0aPHq0NGzZkCf43B73y5ctnCWElS5bUTz/9ZH2flJRks9/Ly+uWi9B9fHzUt29fTZw4Ub/99pvKly9/58EBNyEMAf/nqaee0jPPPKOkpCS1bt062wXEd+Ls7JztdsMw7rK6nLvV/+3n5o6m61xdXdWoUSM1atRI1apVU9++fbVkyRKNHTs2x31kZmZKknr27KnevXtn26Zu3brWtnXq1NHkyZOzbRcYGJjLERT8zyS77/Phw4fVvHlzhYSEaPLkyQoMDJSrq6u+/PJLTZkyxfo9up07jWPSpEk269OCgoLuaqYrMzNTvr6+WrhwYbb7rwfW8+fPKyIiQp6enpowYYIqV64sd3d37dy5Uy+//HKWseXk51G2bFmbffPmzbvtTQfXfy+Sk5MJQ8gTwhDwfzp06KD+/ftry5Yt+uSTTxxdTo4dPHjQZjbi0KFDyszMtPszcBo2bChJOnXqVJbz3+zAgQMqWrSo9Q9miRIllJGRocjIyNueo3Llyvrxxx/VvHnz217CqVy5sjIzM7V3717rAuWCcru6bmXlypVKS0vTihUrbGZ4brzUdLf+8Y9/2Dwr63ooq1y5siRpz549qlKlSo77q1y5sr7++muFh4ffNkhv3LhRf/75p5YuXaqmTZtatx85ciS3Q7C68c5OSapVq9Zt21+/xHarGUXgTlgzBPyf4sWLa9asWRo3bpyio6MdXU6O3Xw30vWnRLdu3TpP/cXHx2c7a3J9DVL16tVtticmJlrX/EjSiRMn9Pnnn6tly5ZydnaWs7OzOnXqpM8++yzbO9T++OMP69ddunTRyZMn9f7772dp99dff+nSpUuSpPbt28vJyUkTJkzIMvOQ1xmfnNxaL0nFihWTdG1GJKeuz4bcWFtKSormzZuXuyJvo1KlSoqMjLS+wsPDJUktW7ZUiRIlFBcXl+WRC7f7XnXp0kUZGRl67bXXsuz7+++/rePPbmzp6el699138zyWG8cRGRlpnSm68XflupMnT2ru3LmqW7dulhklIKeYGQJucKvLOAXhs88+y7LYVLpW0+0uDx05ckTt2rVTq1atlJiYqI8++khPPfWU6tWrl6c6Bg4cqMuXL6tDhw4KCQlRenq6vvvuO33yyScKDg5W3759bdrXrl1bUVFRGjRokNzc3Kx/BG+8ZDNx4kTFx8ercePGeuaZZ1SzZk0lJydr586d+vrrr5WcnCxJ6tWrlxYvXqznnntO8fHxCg8PV0ZGhn755RctXrzY+vyeKlWq6JVXXtFrr72mRx55RB07dpSbm5u2b9+ugIAAxcXF5Xrc12+rv9Olpfr168vZ2VlvvfWWUlJS5ObmZn1+0K20bNlSrq6uio6OVv/+/XXx4kW9//778vX1zTLTZm+enp6aMmWK/vnPf6pRo0Z66qmnVLJkSf3444+6fPmyPvjgg2yPi4iIUP/+/RUXF6ddu3apZcuWKlKkiA4ePKglS5Zo2rRp6ty5sx566CGVLFlSvXv31qBBg2SxWLRgwYJ8uQz50ksvWS85BgQE6OjRo3rvvfd06dIlTZs2ze7ng3kQhoB7xK0+W6lZs2a3DUOffPKJ9ZkuLi4uio2N1b/+9a881zFp0iQtWbJEX375pebMmaP09HRVqFBBzz//vEaPHp1lLVVERITCwsI0fvx4HT9+XDVr1tT8+fOt64Akyc/PT9u2bdOECRO0dOlSvfvuuypVqpRq1aqlt956y9rOyclJy5cv15QpU/Thhx9q2bJlKlq0qCpVqqTBgwerWrVq1rYTJkxQxYoVNWPGDL3yyisqWrSo6tatq169euV57Dnh7++v2bNnKy4uTv369VNGRobi4+NvG4aqV6+uTz/9VKNHj9YLL7wgf39/DRgwQGXKlNHTTz+dr/VKUr9+/eTr66uJEyfqtddeU5EiRRQSEqKhQ4fe9rjZs2crNDRU7733nkaNGiUXFxcFBwerZ8+e1pmnUqVKadWqVRo+fLhGjx6tkiVLqmfPnmrevHm2d9XdjZYtW2r27NmaOXOmzp07J29vbzVt2lSjR4/WAw88YNdzwVwsRkGu7ARwX7FYLIqJidG///1vR5cCAHnGmiEAAGBqhCEAAGBqhCEAAGBqLKAGkGcsOQRwP2BmCAAAmBphCAAAmBqXyXTtM3h+//13lShRIk+P2gcAAAXPMAxduHBBAQEBcnLK+/wOYUjS77//nqcPgAQAAI534sSJu/qQXsKQrn2IpHTtm+np6engagAAQE6kpqYqMDDQ+nc8rwhD+v+fQu3p6UkYAgCgkLnbJS4soAYAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKZGGAIAAKbm4ugCAAAwu+ARX9itr6MT29qtL7NgZggAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgaYQgAAJgan1oPAEAe2POT5uFYzAwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTIwwBAABTc2gYiouLU6NGjVSiRAn5+vqqffv22r9/v02bK1euKCYmRqVKlVLx4sXVqVMnnT592qbN8ePH1bZtWxUtWlS+vr568cUX9ffffxfkUAAAQCHl0DCUkJCgmJgYbdmyRevWrdPVq1fVsmVLXbp0ydpm6NChWrlypZYsWaKEhAT9/vvv6tixo3V/RkaG2rZtq/T0dH333Xf64IMPNH/+fI0ZM8YRQwIAAIWMxTAMw9FFXPfHH3/I19dXCQkJatq0qVJSUlSmTBktWrRInTt3liT98ssvqlGjhhITE9WkSROtXr1ajz/+uH7//Xf5+flJkmbPnq2XX35Zf/zxh1xdXe943tTUVHl5eSklJUWenp75OkYAwP0heMQXji4hW0cntnV0CQXGXn+/76k1QykpKZIkHx8fSdKOHTt09epVRUZGWtuEhISoQoUKSkxMlCQlJiaqTp061iAkSVFRUUpNTdXPP/+c7XnS0tKUmppq8wIAAOZ0z4ShzMxMDRkyROHh4apdu7YkKSkpSa6urvL29rZp6+fnp6SkJGubG4PQ9f3X92UnLi5OXl5e1ldgYKCdRwMAAAqLeyYMxcTEaM+ePfrf//6X7+caOXKkUlJSrK8TJ07k+zkBAMC9ycXRBUhSbGysVq1apU2bNql8+fLW7f7+/kpPT9f58+dtZodOnz4tf39/a5tt27bZ9Hf9brPrbW7m5uYmNzc3O48CAAAURg6dGTIMQ7GxsVq2bJk2bNigihUr2uwPDQ1VkSJFtH79euu2/fv36/jx4woLC5MkhYWFaffu3Tpz5oy1zbp16+Tp6amaNWsWzEAAAECh5dCZoZiYGC1atEiff/65SpQoYV3j4+XlJQ8PD3l5ealfv34aNmyYfHx85OnpqYEDByosLExNmjSRJLVs2VI1a9ZUr1699PbbbyspKUmjR49WTEwMsz8AAOCOHBqGZs2aJUlq1qyZzfZ58+apT58+kqQpU6bIyclJnTp1UlpamqKiovTuu+9a2zo7O2vVqlUaMGCAwsLCVKxYMfXu3VsTJkwoqGEAAIBC7J56zpCj8JwhAEBu8Zwhx7svnzMEAABQ0AhDAADA1O6JW+sBACgI9+qlLTgWM0MAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUHBqGNm3apOjoaAUEBMhisWj58uU2+/v06SOLxWLzatWqlU2b5ORk9ejRQ56envL29la/fv108eLFAhwFAAAozBwahi5duqR69epp5syZt2zTqlUrnTp1yvr6+OOPbfb36NFDP//8s9atW6dVq1Zp06ZNevbZZ/O7dAAAcJ9wceTJW7durdatW9+2jZubm/z9/bPdt2/fPq1Zs0bbt29Xw4YNJUkzZsxQmzZtNGnSJAUEBNi9ZgAAcH+559cMbdy4Ub6+vqpevboGDBigP//807ovMTFR3t7e1iAkSZGRkXJyctLWrVtv2WdaWppSU1NtXgAAwJwcOjN0J61atVLHjh1VsWJFHT58WKNGjVLr1q2VmJgoZ2dnJSUlydfX1+YYFxcX+fj4KCkp6Zb9xsXFafz48fldPgDADoJHfOHoEnCfu6fDULdu3axf16lTR3Xr1lXlypW1ceNGNW/ePM/9jhw5UsOGDbO+T01NVWBg4F3VCgAACqd7/jLZjSpVqqTSpUvr0KFDkiR/f3+dOXPGps3ff/+t5OTkW64zkq6tQ/L09LR5AQAAcypUYei3337Tn3/+qbJly0qSwsLCdP78ee3YscPaZsOGDcrMzFTjxo0dVSYAAChEHHqZ7OLFi9ZZHkk6cuSIdu3aJR8fH/n4+Gj8+PHq1KmT/P39dfjwYb300kuqUqWKoqKiJEk1atRQq1at9Mwzz2j27Nm6evWqYmNj1a1bN+4kAwAAOeLQmaHvv/9eDRo0UIMGDSRJw4YNU4MGDTRmzBg5Ozvrp59+Urt27VStWjX169dPoaGh2rx5s9zc3Kx9LFy4UCEhIWrevLnatGmjhx9+WHPmzHHUkAAAQCHj0JmhZs2ayTCMW+5fu3btHfvw8fHRokWL7FkWAAAwkUK1ZggAAMDeCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDUCEMAAMDU7joMZWRkaNeuXTp37pw96gEAAChQuQ5DQ4YM0X//+19J14JQRESEHnjgAQUGBmrjxo32rg8AACBf5ToMffrpp6pXr54kaeXKlTpy5Ih++eUXDR06VK+88ordCwQAAMhPuQ5DZ8+elb+/vyTpyy+/1JNPPqlq1arp6aef1u7du+1eIAAAQH7KdRjy8/PT3r17lZGRoTVr1qhFixaSpMuXL8vZ2dnuBQIAAOQnl9we0LdvX3Xp0kVly5aVxWJRZGSkJGnr1q0KCQmxe4EAAAD5KddhaNy4capdu7ZOnDihJ598Um5ubpIkZ2dnjRgxwu4FAgAA5KdchyFJ6ty5c5ZtvXv3vutiAAAAClqenjOUkJCg6OhoValSRVWqVFG7du20efNme9cGAACQ73Idhj766CNFRkaqaNGiGjRokAYNGiQPDw81b95cixYtyo8aAQAA8o3FMAwjNwfUqFFDzz77rIYOHWqzffLkyXr//fe1b98+uxZYEFJTU+Xl5aWUlBR5eno6uhwAwA2CR3zh6BIKlaMT2zq6hAJjr7/fuZ4Z+vXXXxUdHZ1le7t27XTkyJE8FwIAAOAIuQ5DgYGBWr9+fZbtX3/9tQIDA+1SFAAAQEHJ9d1kw4cP16BBg7Rr1y499NBDkqRvv/1W8+fP17Rp0+xeIAAAQH7KdRgaMGCA/P399c4772jx4sWSrq0j+uSTT/TEE0/YvUAAAID8lKfnDHXo0EEdOnSwdy0AAAAFLk/PGQIAALhf5GhmyMfHRwcOHFDp0qVVsmRJWSyWW7ZNTk62W3EAAAD5LUdhaMqUKSpRooQkaerUqflZDwAAQIHKURi68XPH+AwyAABwP8lRGEpNTc1xhzzBGQAAFCY5CkPe3t63XSd0o4yMjLsqCAAAoCDlKAzFx8dbvz569KhGjBihPn36KCwsTJKUmJioDz74QHFxcflTJQAAQD7JURiKiIiwfj1hwgRNnjxZ3bt3t25r166d6tSpozlz5rCmCAAAFCq5fs5QYmKiGjZsmGV7w4YNtW3bNrsUBQAAUFDy9EGt77//fpbt//nPf/igVgAAUOjk+uM4pkyZok6dOmn16tVq3LixJGnbtm06ePCgPvvsM7sXCAAAkJ9yPTPUpk0bHThwQNHR0UpOTlZycrKio6N14MABtWnTJj9qBAAAyDd5+qDWwMBAvfnmm/auBQAAoMDlOgxt2rTptvubNm2a52IAAAAKWq7DULNmzbJsu/GBjDx0EQAAFCa5XjN07tw5m9eZM2e0Zs0aNWrUSF999VV+1AgAAJBvcj0z5OXllWVbixYt5OrqqmHDhmnHjh12KQwAAKAg5Hpm6Fb8/Py0f/9+e3UHAABQIHI9M/TTTz/ZvDcMQ6dOndLEiRNVv359e9UFAABQIHIdhurXry+LxSLDMGy2N2nSRHPnzrVbYQAAAAUh12HoyJEjNu+dnJxUpkwZubu7260oAACAgpLrMBQUFJQfdQAAADhEnp5AfenSJSUkJOj48eNKT0+32Tdo0CC7FAYAAFAQch2GfvjhB7Vp00aXL1/WpUuX5OPjo7Nnz6po0aLy9fUlDAEAgEIl17fWDx06VNHR0Tp37pw8PDy0ZcsWHTt2TKGhoZo0aVJ+1AgAAJBvch2Gdu3apeHDh8vJyUnOzs5KS0tTYGCg3n77bY0aNSo/agQAAMg3uQ5DRYoUkZPTtcN8fX11/PhxSdeeTH3ixAn7VgcAAJDPcr1mqEGDBtq+fbuqVq2qiIgIjRkzRmfPntWCBQtUu3bt/KgRAAAg3+R6ZujNN99U2bJlJUlvvPGGSpYsqQEDBuiPP/7QnDlz7F4gAABAfsr1zFDDhg2tX/v6+mrNmjV2LQgAAKAg2e2DWgEAAAqjPD10EQCAOwke8YWjSwByhJkhAABgaoQhAABgaoQhAABgankKQ7GxsUpOTrZ3LQAAAAUux2Hot99+s369aNEiXbx4UZJUp04dnjwNAAAKrRzfTRYSEqJSpUopPDxcV65c0YkTJ1ShQgUdPXpUV69ezc8aAQAA8k2OZ4bOnz+vJUuWKDQ0VJmZmWrTpo2qVaumtLQ0rV27VqdPn871yTdt2qTo6GgFBATIYrFo+fLlNvsNw9CYMWNUtmxZeXh4KDIyUgcPHrRpk5ycrB49esjT01Pe3t7q16+fddYKAADgTnIchq5evaoHH3xQw4cPl4eHh3744QfNmzdPzs7Omjt3ripWrKjq1avn6uSXLl1SvXr1NHPmzGz3v/3225o+fbpmz56trVu3qlixYoqKitKVK1esbXr06KGff/5Z69at06pVq7Rp0yY9++yzuaoDAACYV44vk3l7e6t+/foKDw9Xenq6/vrrL4WHh8vFxUWffPKJypUrp+3bt+fq5K1bt1br1q2z3WcYhqZOnarRo0friSeekCR9+OGH8vPz0/Lly9WtWzft27dPa9as0fbt260fEzJjxgy1adNGkyZNUkBAQK7qAQAA5pPjMHTy5EklJibqu+++099//63Q0FA1atRI6enp2rlzp8qXL6+HH37YboUdOXJESUlJioyMtG7z8vJS48aNlZiYqG7duikxMVHe3t42n5cWGRkpJycnbd26VR06dMi277S0NKWlpVnfp6am2q1uACjMeGo0zCjHl8lKly6t6OhoxcXFqWjRotq+fbsGDhwoi8WiF154QV5eXoqIiLBbYUlJSZIkPz8/m+1+fn7WfUlJSfL19bXZ7+LiIh8fH2ub7MTFxcnLy8v6CgwMtFvdAACgcMnzQxe9vLzUpUsXFSlSRBs2bNCRI0f0/PPP27O2fDNy5EilpKRYXzwaAAAA88rTB7X+9NNPKleunCQpKChIRYoUkb+/v7p27Wq3wvz9/SVJp0+fVtmyZa3bT58+rfr161vbnDlzxua4v//+W8nJydbjs+Pm5iY3Nze71QoAAAqvPM0MBQYGysnp2qF79uzJl8tMFStWlL+/v9avX2/dlpqaqq1btyosLEySFBYWpvPnz2vHjh3WNhs2bFBmZqYaN25s95oAAMD9J08zQ/Zy8eJFHTp0yPr+yJEj2rVrl3x8fFShQgUNGTJEr7/+uqpWraqKFSvq1VdfVUBAgNq3by9JqlGjhlq1aqVnnnlGs2fP1tWrVxUbG6tu3bpxJxkAAMgRh4ah77//Xo8++qj1/bBhwyRJvXv31vz58/XSSy/p0qVLevbZZ3X+/Hk9/PDDWrNmjdzd3a3HLFy4ULGxsWrevLmcnJzUqVMnTZ8+vcDHAgAACieLYRiGo4twtNTUVHl5eSklJUWenp6OLgcAHIZb6wu/oxPbOrqEAmOvv995vpsMAADgfkAYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApkYYAgAApubi6AIAAID9BI/4wi79HJ3Y1i79FAbMDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFMjDAEAAFO7p8PQuHHjZLFYbF4hISHW/VeuXFFMTIxKlSql4sWLq1OnTjp9+rQDKwYAAIXNPR2GJKlWrVo6deqU9fXNN99Y9w0dOlQrV67UkiVLlJCQoN9//10dO3Z0YLUAAKCwcXF0AXfi4uIif3//LNtTUlL03//+V4sWLdJjjz0mSZo3b55q1KihLVu2qEmTJgVdKgAAKITu+ZmhgwcPKiAgQJUqVVKPHj10/PhxSdKOHTt09epVRUZGWtuGhISoQoUKSkxMdFS5AACgkLmnZ4YaN26s+fPnq3r16jp16pTGjx+vRx55RHv27FFSUpJcXV3l7e1tc4yfn5+SkpJu229aWprS0tKs71NTU/OjfAAAUAjc02GodevW1q/r1q2rxo0bKygoSIsXL5aHh0ee+42Li9P48ePtUSIAOFzwiC8cXQJQqN3zl8lu5O3trWrVqunQoUPy9/dXenq6zp8/b9Pm9OnT2a4xutHIkSOVkpJifZ04cSIfqwYAAPeyQhWGLl68qMOHD6ts2bIKDQ1VkSJFtH79euv+/fv36/jx4woLC7ttP25ubvL09LR5AQAAc7qnL5O98MILio6OVlBQkH7//XeNHTtWzs7O6t69u7y8vNSvXz8NGzZMPj4+8vT01MCBAxUWFsadZAAAIMfu6TD022+/qXv37vrzzz9VpkwZPfzww9qyZYvKlCkjSZoyZYqcnJzUqVMnpaWlKSoqSu+++66DqwYAAIWJxTAMw9FFOFpqaqq8vLyUkpLCJTMAhQ4LqJEfjk5s6+gS7shef78L1ZohAAAAeyMMAQAAUyMMAQAAU7unF1ADwP2KdT7AvYOZIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGqEIQAAYGp8NhmA+569Pgfs6MS2dukHwL2FmSEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBq3FoPADlkr1v0AdxbmBkCAACmRhgCAACmxmUyk7LndD9P5QUAFGaEIQD3JNbnACgoXCYDAACmxswQ7hlcugMAOAJhCIDdcGkLQGHEZTIAAGBqhCEAAGBqhCEAAGBqrBkCCil7rc9hsTkAsyMMFSIsTgUAwP64TAYAAEyNMAQAAEyNMAQAAEyNMAQAAEyNBdS4ayzsLtz4+QEwO8IQUIAIHgBw7+EyGQAAMDVmhoA7YDYHAO5vzAwBAABTIwwBAABT4zIZ7ktc2gIA5BQzQwAAwNSYGcpnzFAAAHBvY2YIAACYGmEIAACYGmEIAACYGmEIAACYGmEIAACYGmEIAACYGmEIAACYGs8ZAgAAWdjzOXlHJ7a1W1/5gZkhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgaoQhAABgavdNGJo5c6aCg4Pl7u6uxo0ba9u2bY4uCQAAFAL3RRj65JNPNGzYMI0dO1Y7d+5UvXr1FBUVpTNnzji6NAAAcI+7L8LQ5MmT9cwzz6hv376qWbOmZs+eraJFi2ru3LmOLg0AANzjCn0YSk9P144dOxQZGWnd5uTkpMjISCUmJjqwMgAAUBi4OLqAu3X27FllZGTIz8/PZrufn59++eWXbI9JS0tTWlqa9X1KSookKTU11e71ZaZdtnufAAAUJvnx9/XGfg3DuKt+Cn0Yyou4uDiNHz8+y/bAwEAHVAMAwP3Na2r+9n/hwgV5eXnl+fhCH4ZKly4tZ2dnnT592mb76dOn5e/vn+0xI0eO1LBhw6zvMzMzlZycrFKlSsliseRbrampqQoMDNSJEyfk6emZb+e5F5hlrGYZp2SesZplnJJ5xmqWcUrmGev1cR4/flwWi0UBAQF31V+hD0Ourq4KDQ3V+vXr1b59e0nXws369esVGxub7TFubm5yc3Oz2ebt7Z3Plf5/np6e9/Uv6Y3MMlazjFMyz1jNMk7JPGM1yzgl84zVy8vLLuMs9GFIkoYNG6bevXurYcOGevDBBzV16lRdunRJffv2dXRpAADgHndfhKGuXbvqjz/+0JgxY5SUlKT69etrzZo1WRZVAwAA3Oy+CEOSFBsbe8vLYvcKNzc3jR07NssluvuRWcZqlnFK5hmrWcYpmWesZhmnZJ6x2nucFuNu70cDAAAoxAr9QxcBAADuBmEIAACYGmEIAACYGmEIAACYGmGoAMTFxalRo0YqUaKEfH191b59e+3fv9/RZeW7iRMnymKxaMiQIY4uJV+cPHlSPXv2VKlSpeTh4aE6dero+++/d3RZdpWRkaFXX31VFStWlIeHhypXrqzXXnvtrj8H6F6wadMmRUdHKyAgQBaLRcuXL7fZbxiGxowZo7Jly8rDw0ORkZE6ePCgY4q9C7cb59WrV/Xyyy+rTp06KlasmAICAvSPf/xDv//+u+MKvgt3+pne6LnnnpPFYtHUqVMLrD57yck49+3bp3bt2snLy0vFihVTo0aNdPz48YIv9i7daawXL15UbGysypcvLw8PD9WsWVOzZ8/O9XkIQwUgISFBMTEx2rJli9atW6erV6+qZcuWunTpkqNLyzfbt2/Xe++9p7p16zq6lHxx7tw5hYeHq0iRIlq9erX27t2rd955RyVLlnR0aXb11ltvadasWfr3v/+tffv26a233tLbb7+tGTNmOLq0u3bp0iXVq1dPM2fOzHb/22+/renTp2v27NnaunWrihUrpqioKF25cqWAK707txvn5cuXtXPnTr366qvauXOnli5dqv3796tdu3YOqPTu3elnet2yZcu0ZcuWu/4IB0e50zgPHz6shx9+WCEhIdq4caN++uknvfrqq3J3dy/gSu/encY6bNgwrVmzRh999JH27dunIUOGKDY2VitWrMjdiQwUuDNnzhiSjISEBEeXki8uXLhgVK1a1Vi3bp0RERFhDB482NEl2d3LL79sPPzww44uI9+1bdvWePrpp222dezY0ejRo4eDKsofkoxly5ZZ32dmZhr+/v7Gv/71L+u28+fPG25ubsbHH3/sgArt4+ZxZmfbtm2GJOPYsWMFU1Q+udVYf/vtN6NcuXLGnj17jKCgIGPKlCkFXps9ZTfOrl27Gj179nRMQfkou7HWqlXLmDBhgs22Bx54wHjllVdy1TczQw6QkpIiSfLx8XFwJfkjJiZGbdu2VWRkpKNLyTcrVqxQw4YN9eSTT8rX11cNGjTQ+++/7+iy7O6hhx7S+vXrdeDAAUnSjz/+qG+++UatW7d2cGX568iRI0pKSrL5Hfby8lLjxo2VmJjowMryX0pKiiwWS4F+XmNByczMVK9evfTiiy+qVq1aji4nX2RmZuqLL75QtWrVFBUVJV9fXzVu3Pi2lwwLs4ceekgrVqzQyZMnZRiG4uPjdeDAAbVs2TJX/RCGClhmZqaGDBmi8PBw1a5d29Hl2N3//vc/7dy5U3FxcY4uJV/9+uuvmjVrlqpWraq1a9dqwIABGjRokD744ANHl2ZXI0aMULdu3RQSEqIiRYqoQYMGGjJkiHr06OHo0vJVUlKSJGX5SB8/Pz/rvvvRlStX9PLLL6t79+735Yd8vvXWW3JxcdGgQYMcXUq+OXPmjC5evKiJEyeqVatW+uqrr9ShQwd17NhRCQkJji7P7mbMmKGaNWuqfPnycnV1VatWrTRz5kw1bdo0V/3cNx/HUVjExMRoz549+uabbxxdit2dOHFCgwcP1rp16wrltencyMzMVMOGDfXmm29Kkho0aKA9e/Zo9uzZ6t27t4Ors5/Fixdr4cKFWrRokWrVqqVdu3ZpyJAhCggIuK/GiWuLqbt06SLDMDRr1ixHl2N3O3bs0LRp07Rz505ZLBZHl5NvMjMzJUlPPPGEhg4dKkmqX7++vvvuO82ePVsRERGOLM/uZsyYoS1btmjFihUKCgrSpk2bFBMTo4CAgFxdnWBmqADFxsZq1apVio+PV/ny5R1djt3t2LFDZ86c0QMPPCAXFxe5uLgoISFB06dPl4uLizIyMhxdot2ULVtWNWvWtNlWo0aNQnm3xu28+OKL1tmhOnXqqFevXho6dOh9P/Pn7+8vSTp9+rTN9tOnT1v33U+uB6Fjx45p3bp19+Ws0ObNm3XmzBlVqFDB+u/TsWPHNHz4cAUHBzu6PLspXbq0XFxcTPHv019//aVRo0Zp8uTJio6OVt26dRUbG6uuXbtq0qRJueqLmaECYBiGBg4cqGXLlmnjxo2qWLGio0vKF82bN9fu3btttvXt21chISF6+eWX5ezs7KDK7C88PDzL4xEOHDigoKAgB1WUPy5fviwnJ9v/Z3J2drb+3+f9qmLFivL399f69etVv359SVJqaqq2bt2qAQMGOLY4O7sehA4ePKj4+HiVKlXK0SXli169emWZKYiKilKvXr3Ut29fB1Vlf66urmrUqJEp/n26evWqrl69apd/owhDBSAmJkaLFi3S559/rhIlSljXHHh5ecnDw8PB1dlPiRIlsqyDKlasmEqVKnXfrY8aOnSoHnroIb355pvq0qWLtm3bpjlz5mjOnDmOLs2uoqOj9cYbb6hChQqqVauWfvjhB02ePFlPP/20o0u7axcvXtShQ4es748cOaJdu3bJx8dHFSpU0JAhQ/T666+ratWqqlixol599VUFBASoffv2jis6D243zrJly6pz587auXOnVq1apYyMDOu/Tz4+PnJ1dXVU2Xlyp5/pzUGvSJEi8vf3V/Xq1Qu61Ltyp3G++OKL6tq1q5o2bapHH31Ua9as0cqVK7Vx40bHFZ1HdxprRESEXnzxRXl4eCgoKEgJCQn68MMPNXny5Nyd6K7uc0OOSMr2NW/ePEeXlu/u11vrDcMwVq5cadSuXdtwc3MzQkJCjDlz5ji6JLtLTU01Bg8ebFSoUMFwd3c3KlWqZLzyyitGWlqao0u7a/Hx8dn+d9m7d2/DMK7dXv/qq68afn5+hpubm9G8eXNj//79ji06D243ziNHjtzy36f4+HhHl55rd/qZ3qyw3lqfk3H+97//NapUqWK4u7sb9erVM5YvX+64gu/CncZ66tQpo0+fPkZAQIDh7u5uVK9e3XjnnXeMzMzMXJ3HYhj3waNkAQAA8ogF1AAAwNQIQwAAwNQIQwAAwNQIQwAAwNQIQwAAwNQIQwAAwNQIQwAAwNQIQwAAwNQIQwAKrfnz58vb27tAzrV//375+/vrwoULBXLuvXv3qnz58rp06VK+nQPANYQhALfVp08fWSwWWSwWFSlSRH5+fmrRooXmzp1boB/YGhwcrKlTp9ps69q1qw4cOFAg5x85cqQGDhyoEiVKFMi5a9asqSZNmuT+M5YA5BphCMAdtWrVSqdOndLRo0e1evVqPfrooxo8eLAef/xx/f3333nu1zCMuzrew8NDvr6+eT4+p44fP65Vq1apT58+BXruvn37atasWXf1PQJwZ4QhAHfk5uYmf39/lStXTg888IBGjRqlzz//XKtXr9b8+fMlSUePHpXFYtGuXbusx50/f14Wi8X6adkbN26UxWLR6tWrFRoaKjc3N33zzTc6fPiwnnjiCfn5+al48eJq1KiRvv76a2s/zZo107FjxzR06FDrLJWU/aWqWbNmqXLlynJ1dVX16tW1YMECm/0Wi0X/+c9/1KFDBxUtWlRVq1bVihUrbjv+xYsXq169eipXrpx1283nHjdunOrXr68FCxYoODhYXl5e6tatm/WyWnaOHTum6OholSxZUsWKFVOtWrX05ZdfWve3aNFCycnJSkhIuG19AO4OYQhAnjz22GOqV6+eli5dmutjR4wYoYkTJ2rfvn2qW7euLl68qDZt2mj9+vX64Ycf1KpVK0VHR+v48eOSpKVLl6p8+fKaMGGCTp06pVOnTmXb77JlyzR48GANHz5ce/bsUf/+/dW3b1/Fx8fbtBs/fry6dOmin376SW3atFGPHj2UnJx8y3o3b96shg0b3nFchw8f1vLly7Vq1SqtWrVKCQkJmjhx4i3bx8TEKC0tTZs2bdLu3bv11ltvqXjx4tb9rq6uql+/vjZv3nzHcwPIO8IQgDwLCQnR0aNHc33chAkT1KJFC1WuXFk+Pj6qV6+e+vfvr9q1a6tq1ap67bXXVLlyZeuMjY+Pj5ydnVWiRAn5+/vL398/234nTZqkPn366Pnnn1e1atU0bNgwdezYUZMmTbJp16dPH3Xv3l1VqlTRm2++qYsXL2rbtm23rPfYsWMKCAi447gyMzM1f/581a5dW4888oh69eql9evX37L98ePHFR4erjp16qhSpUp6/PHH1bRpU5s2AQEBOnbs2B3PDSDvCEMA8swwDOslq9y4eZbl4sWLeuGFF1SjRg15e3urePHi2rdvn3VmKKf27dun8PBwm23h4eHat2+fzba6detavy5WrJg8PT115syZW/b7119/yd3d/Y7nDw4Oti6wlqSyZcvett9Bgwbp9ddfV3h4uMaOHauffvopSxsPDw9dvnz5jucGkHeEIQB5tm/fPlWsWFGS5OR07Z8TwzCs+69evZrtccWKFbN5/8ILL2jZsmV68803tXnzZu3atUt16tRRenp6vtRdpEgRm/cWi+W2d8aVLl1a586ds3u///znP/Xrr7+qV69e2r17txo2bKgZM2bYtElOTlaZMmXueG4AeUcYApAnGzZs0O7du9WpUydJsv7BvnE9z42LqW/n22+/VZ8+fdShQwfVqVNH/v7+WS6/ubq6KiMj47b91KhRQ99++22WvmvWrJmjOm6lQYMG2rt37131cSuBgYF67rnntHTpUg0fPlzvv/++zf49e/aoQYMG+XJuANe4OLoAAPe+tLQ0JSUlKSMjQ6dPn9aaNWsUFxenxx9/XP/4xz8kXbuc06RJE02cOFEVK1bUmTNnNHr06Bz1X7VqVS1dulTR0dGyWCx69dVXs8yoBAcHa9OmTerWrZvc3NxUunTpLP28+OKL6tKlixo0aKDIyEitXLlSS5cutbkzLS+ioqL0z3/+UxkZGXJ2dr6rvm40ZMgQtW7dWtWqVdO5c+cUHx+vGjVqWPcfPXpUJ0+eVGRkpN3OCSArZoYA3NGaNWtUtmxZBQcHq1WrVoqPj9f06dP1+eef24SDuXPn6u+//1ZoaKiGDBmi119/PUf9T548WSVLltRDDz2k6OhoRUVF6YEHHrBpM2HCBB09elSVK1e+5WWj9u3ba9q0aZo0aZJq1aql9957T/PmzVOzZs3yPHZJat26tVxcXO46VN0sIyNDMTExqlGjhlq1aqVq1arp3Xffte7/+OOP1bJlSwUFBdn1vABsWYwbL/ADALI1c+ZMrVixQmvXri2Q86Wnp6tq1apatGhRlkXhAOyLy2QAkAP9+/fX+fPndeHCBZs7xvLL8ePHNWrUKIIQUACYGQIAAKbGmiEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBqhCEAAGBq/w9gUWVx4DhhfgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import torchaudio\n", "import numpy\n", "import glob\n", "import os\n", "\n", "# fetching all flac files in MiniLibriSpeech\n", "all_flacs = glob.glob(os.path.join(\"/content/LibriSpeech/train-clean-5\", \"**/*.flac\"), recursive=True)\n", "\n", "print(\"Number of audio files in MiniLibriSpeech train-clean-5: \", len(all_flacs))\n", "\n", "# step-by-step\n", "# collect durations\n", "all_durations = numpy.zeros(len(all_flacs))\n", "for i, audio in enumerate(all_flacs):\n", " wav_meta = torchaudio.info(audio)\n", " all_durations[i] = wav_meta.num_frames / wav_meta.sample_rate\n", "\n", "# plot histogram\n", "_ = plt.hist(all_durations, bins='auto')\n", "plt.title(\"MiniLibriSpeech: train-clean-5\")\n", "plt.xlabel(\"Duration (in s)\")\n", "plt.ylabel(\"# audios\")\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": { "id": "HYqdrRzYHCOf" }, "source": [ "We can see that most files have a length between 14 and 16 seconds. Moreover, there is a large variance in the file length.\n", "So if we sample randomly without any particular strategy a certain number of examples (e.g., 8), pad them, and batch them together we will end up with lots of padded values.\n", "\n", "This way, we will waste a significant portion of computation on padded values.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "fZhST3jGLgRV" }, "source": [ "We can try to effectively compute the total number of samples which belong to padding when iterating over the whole dataset with a fixed batch size.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "X0WcpKPoiDSi" }, "source": [ "We follow here SpeechBrain data preparation best practices.\n", "We parse all examples into a `.json` file so that parsing occurs only once and not at the start of each new experiment. In fact, parsing many small files can take a lot of time on networked storage or slow physical hard-drives." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "executionInfo": { "elapsed": 18596, "status": "ok", "timestamp": 1718826607399, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "r0RYowCdOHp3" }, "outputs": [], "source": [ "# prepare LibriSpeech dataset using pre-made, downloaded parse_data.py script from\n", "# the data-io tutorial available here: https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html\n", "from parse_data import parse_to_json\n", "parse_to_json(\"/content/LibriSpeech/train-clean-5\")\n", "# this produced a manifest data.json file:" ] }, { "cell_type": "markdown", "metadata": { "id": "2j--OL9sideZ" }, "source": [ "We can briefly look at each `.json` file. In particular we are interested in the `length` field which contains the length in samples for each audio in the dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 28, "status": "ok", "timestamp": 1718826607399, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "N0Il1pGvhLZ9", "outputId": "efc95e9f-5e6e-40ec-8559-5c4b4e4b0357" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " },\n", " \"4640-19188-0038\": {\n", " \"file_path\": \"/content/LibriSpeech/train-clean-5/4640/19188/4640-19188-0038.flac\",\n", " \"words\": \"THE FIFTH MAN WAS SAVED\",\n", " \"spkID\": \"speaker_4640\",\n", " \"length\": 41200\n", " },\n", " \"4640-19188-0005\": {\n", " \"file_path\": \"/content/LibriSpeech/train-clean-5/4640/19188/4640-19188-0005.flac\",\n", " \"words\": \"COME SAID HE YOU MUST HAVE A LITTLE PITY DO YOU KNOW WHAT THE QUESTION IS HERE IT IS A QUESTION OF WOMEN SEE HERE ARE THERE WOMEN OR ARE THERE NOT ARE THERE CHILDREN OR ARE THERE NOT\",\n", " \"spkID\": \"speaker_4640\",\n", " \"length\": 247920\n", " },\n", " \"4640-19188-0035\": {\n", " \"file_path\": \"/content/LibriSpeech/train-clean-5/4640/19188/4640-19188-0035.flac\",\n", " \"words\": \"DO YOU DESIGNATE WHO IS TO REMAIN YES SAID THE FIVE CHOOSE WE WILL OBEY YOU MARIUS DID NOT BELIEVE THAT HE WAS CAPABLE OF ANOTHER EMOTION\",\n", " \"spkID\": \"speaker_4640\",\n", " \"length\": 184720\n", " }\n", "}" ] } ], "source": [ "!tail -n 20 data.json" ] }, { "cell_type": "markdown", "metadata": { "id": "9FNKDqIeisYY" }, "source": [ "We can use this `.json` manifest file to instantiate a SpeechBrain `DynamicItemDataset` object.\n", "\n", "If this is not clear refer to the [data-io tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html).\n", "\n", "We also define a `data-io pipeline` to read the audio file." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 2044, "status": "ok", "timestamp": 1718826609437, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "FBIoa_DQhNt2", "outputId": "35c9632c-4f14-472f-b43a-d5358d3859a8" }, "outputs": [ { "data": { "text/plain": [ "{'signal': tensor([ 7.9346e-04, 6.7139e-04, 4.8828e-04, ..., -2.1362e-04,\n", " -1.2207e-04, 3.0518e-05]),\n", " 'file_path': '/content/LibriSpeech/train-clean-5/3664/178355/3664-178355-0029.flac'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# initializing a sb dataset object from this json\n", "from speechbrain.dataio.dataset import DynamicItemDataset\n", "import speechbrain\n", "train_data = speechbrain.dataio.dataset.DynamicItemDataset.from_json(\"data.json\")\n", "# we define a pipeline to read audio\n", "@speechbrain.utils.data_pipeline.takes(\"file_path\")\n", "@speechbrain.utils.data_pipeline.provides(\"signal\")\n", "def audio_pipeline(file_path):\n", " sig = speechbrain.dataio.dataio.read_audio(file_path)\n", " return sig\n", "# setting the pipeline\n", "train_data.add_dynamic_item(audio_pipeline)\n", "train_data.set_output_keys([\"signal\", \"file_path\"])\n", "train_data[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "4CsCgHaHjPj0" }, "source": [ "Voilà, we now can start to iterate over this dataset using a torch `Dataloader`.\n", "By using `PaddedBatch` as a `collate_fn` SpeechBrain will handle padding automatically for us. Neat!\n", "\n", "We can also define a simple function `count_samples` to count samples that belong to padding in each batch" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1718826609438, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "lypSm786W1GH" }, "outputs": [], "source": [ "import torch\n", "import time\n", "from torch.utils.data import DataLoader\n", "from speechbrain.dataio.batch import PaddedBatch\n", "\n", "# counting tot padded values when batching the dataset with batch_size = 8\n", "batch_size = 32\n", "\n", "# PaddedBatch will pad audios to the right\n", "dataloader = DataLoader(train_data, collate_fn=PaddedBatch, batch_size=batch_size)\n", "\n", "def count_samples(dataloader):\n", " true_samples = 0\n", " padded_samples = 0\n", " t1 = time.time()\n", " for batch in dataloader:\n", " audio, lens = batch.signal\n", "\n", " true_samples += torch.sum(audio.shape[-1]*lens).item()\n", " padded_samples += torch.sum(audio.shape[-1]*(1-lens)).item()\n", "\n", " elapsed = time.time() - t1\n", " tot_samples = true_samples + padded_samples\n", " return true_samples / tot_samples, padded_samples / tot_samples, elapsed\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 1087, "status": "ok", "timestamp": 1718826610518, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "rlfkiahZa0Qx", "outputId": "205dd4c3-7e8a-46b4-f770-09ff090e410c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PaddedData(data=tensor([[ 7.9346e-04, 6.7139e-04, 4.8828e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00],\n", " [-9.7656e-04, -4.8828e-04, -2.7466e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00],\n", " [ 1.8311e-04, 9.1553e-05, 3.0518e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00],\n", " ...,\n", " [-4.8828e-04, -3.6621e-04, -4.8828e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00],\n", " [ 0.0000e+00, -6.1035e-05, -3.6621e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00],\n", " [-7.6294e-04, -8.8501e-04, -8.8501e-04, ..., 0.0000e+00,\n", " 0.0000e+00, 0.0000e+00]]), lengths=tensor([0.7254, 0.9600, 0.9525, 0.9864, 0.8919, 0.9579, 0.2834, 0.9282, 0.5404,\n", " 0.8429, 0.9552, 0.9667, 0.6845, 0.8650, 0.9164, 0.8892, 0.3215, 0.9579,\n", " 0.7363, 0.7172, 0.8601, 0.8959, 0.8529, 0.7826, 1.0000, 0.9325, 0.9818,\n", " 0.9679, 0.8974, 0.7914, 0.9912, 0.9319]))\n", "PaddedData(data=tensor([[ 0.0006, 0.0003, -0.0003, ..., 0.0000, 0.0000, 0.0000],\n", " [ 0.0005, 0.0004, 0.0005, ..., 0.0000, 0.0000, 0.0000],\n", " [ 0.0003, 0.0004, 0.0004, ..., 0.0000, 0.0000, 0.0000],\n", " ...,\n", " [-0.0055, -0.0057, -0.0051, ..., 0.0000, 0.0000, 0.0000],\n", " [ 0.0010, -0.0007, -0.0013, ..., 0.0000, 0.0000, 0.0000],\n", " [ 0.0015, 0.0007, 0.0022, ..., 0.0000, 0.0000, 0.0000]]), lengths=tensor([0.9501, 0.9389, 0.8989, 0.9055, 0.9780, 0.7591, 0.8813, 0.7880, 1.0000,\n", " 0.9442, 0.2604, 0.7607, 0.9253, 0.9048, 0.8974, 0.7514, 0.9895, 0.2610,\n", " 0.8360, 0.6321, 0.5701, 0.9231, 0.9764, 0.7725, 0.3549, 0.8633, 0.7337,\n", " 0.7446, 0.9309, 0.8590, 0.9262, 0.5115]))\n", "PaddedData(data=tensor([[ 0.0000, 0.0003, 0.0006, ..., 0.0000, 0.0000, 0.0000],\n", " [-0.0150, -0.0154, -0.0150, ..., 0.0000, 0.0000, 0.0000],\n", " [-0.0012, -0.0012, -0.0022, ..., 0.0000, 0.0000, 0.0000],\n", " ...,\n", " [-0.0011, -0.0031, -0.0020, ..., 0.0000, 0.0000, 0.0000],\n", " [-0.0011, 0.0016, 0.0015, ..., 0.0000, 0.0000, 0.0000],\n", " [-0.0012, -0.0031, -0.0022, ..., 0.0000, 0.0000, 0.0000]]), lengths=tensor([0.7143, 0.9765, 0.8430, 0.9280, 0.8743, 0.9289, 0.8849, 0.6190, 0.8590,\n", " 1.0000, 0.7652, 0.3936, 0.7022, 0.8803, 0.7474, 0.9388, 0.9602, 0.8933,\n", " 0.9331, 0.9370, 0.8327, 0.8547, 0.7664, 0.6492, 0.7902, 0.8996, 0.8267,\n", " 0.9524, 0.8189, 0.8502, 0.6377, 0.7077]))\n" ] } ], "source": [ "for i, d in enumerate(dataloader):\n", " print(d.signal)\n", " # few example are enough to demonstrate what's going on here\n", " if i == 2:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "id": "Qp0vH1iCjwXl" }, "source": [ "Let's count the samples when using a fixed batch size of 32 (as above) and the examples are sampled randomly." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 11139, "status": "ok", "timestamp": 1718826621655, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "uKOZybJ0jtQM", "outputId": "78f4f270-940f-42e7-f268-1ea0b70d7bbc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Random Sampling: % True samples 76.8, % of padding 23.2, Total time 11.06s\n" ] } ], "source": [ "percent_true, percent_padded, elapsed = count_samples(dataloader)\n", "print(\"Random Sampling: % True samples {:.1f}, % of padding {:.1f}, Total time {:.2f}s\".format(percent_true*100, percent_padded*100, elapsed))" ] }, { "cell_type": "markdown", "metadata": { "id": "Tid6zWnK2_as" }, "source": [ "*We* are wasting more than 20% of computations in each training iteration on useless values which are only there to enable batched computations.\n", "\n", "Can we avoid such waste, speed up training, and consume less energy?\n", "\n", "Sure, we can simply sort the dataset according to the length of the examples in ascending or descending order and then batch the examples together.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10682, "status": "ok", "timestamp": 1718826632325, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "at-_sxv8w6hs", "outputId": "1fe00421-59ae-475b-87bd-98528e4cc1a7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After sorting: % True samples 98.8, % of padding 1.2, Total time 10.65\n" ] } ], "source": [ "# if you followed the data-io tutorial you already know that sorting is super simple:\n", "sorted_data = train_data.filtered_sorted(sort_key=\"length\")\n", "dataloader = DataLoader(sorted_data, collate_fn=PaddedBatch, batch_size=batch_size)\n", "percent_true, percent_padded, elapsed = count_samples(dataloader)\n", "print(\"After sorting: % True samples {:.1f}, % of padding {:.1f}, Total time {:.2f}\".format(percent_true*100, percent_padded*100, elapsed))" ] }, { "cell_type": "markdown", "metadata": { "id": "pmmBM6E_5CFh" }, "source": [ "That is quite a reduction. Now, we are almost not wasting any compute on padded values as we have minimized padding by taking audios with roughly the same length in each batch. Iterating over one epoch is also significantly faster.\n", "\n", "But this means that we must train with a sorted dataset.\n", "In some applications, this might hurt the performance as the network sees the examples always in the same order.\n", "\n", "In other applications sorting the examples can instead bring better performance as it can be seen as a sort of curriculum learning. This is the case for example for our TIMIT recipes.\n", "\n", "Dynamic Batching allows users to trade-off between full random sampling of the examples and deterministic sampling from sorted examples." ] }, { "cell_type": "markdown", "metadata": { "id": "hOxnX03A1eS6" }, "source": [ "Another problem with fixed batch size is that we are under-utilizing our resources for the shortest examples.\n", "Suppose we use a fixed batch size of 8, and our dataset is sorted in ascending order. This means we must have sufficient memory to train on the 8 longest examples. But we also train on the 8 shortest ones!\n", "In many instances, we can afford to batch a larger number of shorter examples together and optimize the GPU usage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "iEuEnbEz75sr" }, "source": [ "## SpeechBrain `DynamicBatchSampler` class" ] }, { "cell_type": "markdown", "metadata": { "id": "Mecsawr--Fff" }, "source": [ "SpeechBrain provides a useful abstraction to perform Dynamic Batching:\n", "\n", "---\n", "\n", "**DynamicBatchSampler**.\n", "\n", "In particular, with the right settings, it allows us to train large models even with 12 GB VRAM GPUs in a reasonable time. When using high-performance high VRAM GPUs, instead, it can significantly reduce training time.\n", "\n", "**This abstraction allows us to select a good trade-off between training speed, randomization of sampling, and VRAM usage.**\n", "\n", "It is up to you, depending on your application scenario and hardware, which of these characteristics should be prioritized.\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "j8M-LuNBobTk" }, "source": [ "`DynamicBatchSampler` belongs to the `torch.utils.data` `Sampler` class and is a torch *Batch Sampler*:\n", "\n", "Being a batch Sampler, it is just a *python generator* which returns, at each call, a list containing the indexes of the examples which should be batched together by the `DataLoader` using the `collate_fn`. These indexes are used to fetch the actual examples in the `torch.utils.data.Dataset` class using the `__getitem__` method.\n", "\n", "Here is an example with batch_size 2. The DataLoader is responsible for taking care of parallelization of the Dataset `__getitem__` method. The indexes of the examples are provided by the Batch Sampler.\n", "For more info, you can refer to the official [Pytorch documentation on torch.utils.data](https://pytorch.org/docs/stable/data.html)." ] }, { "cell_type": "markdown", "metadata": { "id": "TaRTseKUgz5D" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "O4OQaEM5kQMD" }, "source": [ "### Using `speechbrain.dataio.samplers.DynamicBatchSampler`" ] }, { "cell_type": "markdown", "metadata": { "id": "XFzGEwfJkZxx" }, "source": [ "`DynamicBatchSampler` has several input arguments upon instantiation and provides a great deal of flexibility.\n", "\n", "We will practically illustrate what is the effect of some of these using MiniLibriSpeech and how each of these can change the trade-off between speed, randomization, and VRAM usage." ] }, { "cell_type": "markdown", "metadata": { "id": "_tM5iEaYolWS" }, "source": [ "**NOTE:** you should be highly familiar with SpeechBrain [data-io](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html) to follow this tutorial." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1718826632325, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "lTogAfnOrEjp" }, "outputs": [], "source": [ "# initializing a sb dataset object from this json\n", "from speechbrain.dataio.dataset import DynamicItemDataset\n", "import speechbrain\n", "\n", "# we instantiate here the train data dataset from the json manifest file\n", "train_data = DynamicItemDataset.from_json(\"data.json\")\n", "\n", "# we define a pipeline to read audio\n", "@speechbrain.utils.data_pipeline.takes(\"file_path\")\n", "@speechbrain.utils.data_pipeline.provides(\"signal\")\n", "def audio_pipeline(file_path):\n", " sig = speechbrain.dataio.dataio.read_audio(file_path)\n", " return sig\n", "\n", "# setting the pipeline\n", "train_data.add_dynamic_item(audio_pipeline)\n", "train_data.set_output_keys([\"signal\", \"file_path\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "Ggol_VzvrwME" }, "source": [ "Crucially to use `DynamicBatchSampler` **it is important that the manifest/dataset description file** (`json` or `csv`) **contains**, for each example, **an entry which specifies the duration or length of each example**.\n", "The `DynamicBatchSampler` will use this information to batch efficiently examples together." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1718826632325, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "X3JSZOecrzud", "outputId": "cb0a9bfe-6fef-4aad-c851-28eb9d71a35f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \"spkID\": \"speaker_4640\",\n", " \"length\": 247920\n", " },\n", " \"4640-19188-0035\": {\n", " \"file_path\": \"/content/LibriSpeech/train-clean-5/4640/19188/4640-19188-0035.flac\",\n", " \"words\": \"DO YOU DESIGNATE WHO IS TO REMAIN YES SAID THE FIVE CHOOSE WE WILL OBEY YOU MARIUS DID NOT BELIEVE THAT HE WAS CAPABLE OF ANOTHER EMOTION\",\n", " \"spkID\": \"speaker_4640\",\n", " \"length\": 184720\n", " }\n", "}" ] } ], "source": [ "!tail -n 10 data.json" ] }, { "cell_type": "markdown", "metadata": { "id": "elu2P2lar5lH" }, "source": [ "We can see that in this case we have a length key containing, for each audio, the length in samples." ] }, { "cell_type": "markdown", "metadata": { "id": "QFEPu-d6q8r-" }, "source": [ "#### Instantiating `DynamicBatchSampler`: Core Parameters\n", "\n", "---\n", "At its core, `DynamicBatchSampler` batches examples with similar lengths based on \"buckets\". Upon instantiation, based on the input args, several buckets are created. These buckets define a number of contiguous intervals e.g. $0\\leq x < 200, 200 \\leq x < 400$ and so on. \n", "Examples whose lengths fall into a certain bucket are assumed as they have the same length and can be batched together. In some way, we are \"quantizing\" the lengths of the examples in the dataset.\n", "\n", "In the Figure below we have N buckets, each defined by his right boundary.\n", "For each bucket, we can have a different `batch_size` because we can fit more examples falling in the leftmost bucket than the rightmost one.\n", "\n", "For the first bucket, the batch size is 8 because 1725 // 200 = 8.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XilSLmHmtHYY" }, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5tQSe564wK2U" }, "source": [ "In the Figure below we illustrate how 14 examples with different lengths are \"bucketized\": 3 examples in the first bucket, 5 examples in the second, 2 in the third, 2 in the fourth and one in the last.\n", "\n", "One example is discarded because it is too long (its length is more than `max_batch_size`)." ] }, { "cell_type": "markdown", "metadata": { "id": "zXmT-3KWwIkV" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "eNco4lz-JurP" }, "source": [ "A minimal instantiation of `DynamicBatchSampler` requires four arguments at least:\n", "\n", "1. A `Dataset` object (`train_data` here, note it can also be validation or test set).\n", "2. `max_batch_length`: the maximum length we want in a batch. This will be the maximum aggregated length of all examples in a batch we are going to allow and must be chosen carefully to avoid OOM errors.\n", "A higher number means we are going to have, on average, an higher batch size so you must apply the same \"tricks\" as when batch size is increased for standard fixed batch size training.
E.g. increase learning rate.\n", "3. `num_buckets`: number of buckets one wishes to use. If just one bucket is used, all examples can be batched together, and dynamic batching in this instance is the same as uniform random sampling of the examples.\n", "If too many buckets are specified the training will be slow because some buckets will be half empty.\n", "As a rule of thumb: num_buckets trades-off speed with randomization.\n", "
Low number -> better randomization, High number -> faster training.\n", "\n", "4. `length_func`: function to be applied to each dataset element to get its length. In our case, we can see that the `.json` manifest contains a key *length* which specifies each audio length in samples. This can be used for example to convert the length into seconds or the number of feature frames. So that `max_batch_length` and the bucket boundaries will be specified not anymore in samples.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xmNzc38_Hfow" }, "source": [ "We can specify `max_batch_length` in terms of seconds" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "executionInfo": { "elapsed": 164, "status": "ok", "timestamp": 1718826632486, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "cEk1NgP9HjwH" }, "outputs": [], "source": [ "from speechbrain.dataio.sampler import DynamicBatchSampler\n", "\n", "max_batch_len = 17*32\n", "\n", "dynamic_batcher = DynamicBatchSampler(\n", " train_data,\n", " max_batch_length=max_batch_len,\n", " num_buckets=60,\n", " length_func=lambda x: x[\"length\"] / 16000,\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 7, "status": "ok", "timestamp": 1718826632486, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "I-KXWrvq_ZuO", "outputId": "0fb5912d-6ca9-4c1a-f2e1-ca6a53d22e6b" }, "outputs": [ { "data": { "text/plain": [ "11.98" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dynamic_batcher._ex_lengths['0']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1718826632486, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "OjLGSB-IhD0K", "outputId": "224bd1f5-c8cf-42c7-c54f-279ae8aa8726" }, "outputs": [ { "data": { "text/plain": [ "41" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(dynamic_batcher)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1718826632486, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "bwDemOO-U5G9", "outputId": "4b3251e0-81ba-403b-d4ec-47a052e45b2a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "34\n", "44\n", "34\n", "34\n", "34\n", "44\n", "34\n", "57\n", "34\n", "34\n", "38\n", "34\n", "38\n", "53\n", "38\n", "17\n", "71\n", "38\n", "34\n", "16\n", "34\n", "34\n", "35\n", "38\n", "34\n", "30\n", "38\n", "34\n", "34\n", "8\n", "30\n", "44\n", "34\n", "38\n", "53\n", "38\n", "26\n", "71\n", "38\n", "34\n", "34\n" ] } ], "source": [ "for b in dynamic_batcher:\n", " print(len(b))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 9886, "status": "ok", "timestamp": 1718826642370, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "qhlzdqqNUKTL", "outputId": "8e1f986c-7804-4435-f140-e68608e47141" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "511.5\n", "515.8\n", "509.3\n", "506.6\n", "491.4\n", "345.4\n", "516.5\n", "506.6\n", "514.2\n", "479.3\n", "478.4\n", "195.0\n", "74.9\n", "501.8\n", "514.2\n", "328.9\n", "510.0\n", "514.6\n", "270.3\n", "514.0\n", "517.8\n", "519.4\n", "507.3\n", "505.9\n", "508.4\n", "467.8\n", "517.4\n", "511.2\n", "514.0\n", "424.3\n", "512.6\n", "503.3\n", "241.4\n", "510.6\n", "506.0\n", "512.4\n", "512.5\n", "508.0\n", "516.7\n", "489.6\n", "513.6\n" ] } ], "source": [ "for b in dynamic_batcher:\n", " print(\"%.1f\" % sum([train_data[i]['signal'].shape[0]/16000 for i in b]))" ] }, { "cell_type": "markdown", "metadata": { "id": "NfewXTPxuHI9" }, "source": [ "#### Using `DynamicBatchSampler`" ] }, { "cell_type": "markdown", "metadata": { "id": "6pMNyqs-vvAj" }, "source": [ "Once this special batch sampler is instantiated it can be used in the standard Pytorch way by using it as a DataLoader argument:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1718826642370, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "mwyIQz5ROaw5" }, "outputs": [], "source": [ "dataloader = DataLoader(train_data, batch_sampler=dynamic_batcher, collate_fn=PaddedBatch)\n", "# note that the batch size in the DataLoader cannot be specified when a batch sampler is used.\n", "# the batch size is handled by the batch_sampler and in this case is dynamic" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 429, "status": "ok", "timestamp": 1718826642798, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "1YWQI8bUwN2v", "outputId": "be2ff8ef-b5c5-41a5-cbcb-57886a57926a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([34])\n" ] } ], "source": [ "# we can iterate now over the data in an efficient way using dynamic batching.\n", "# our DynamicBatchSampler will sample the index of the examples such that padding is minimized\n", "# while PaddedBatch will handle the actual padding and batching.\n", "# everything happens in parallel thanks to the torch DataLoader.\n", "first_batch = next(iter(dataloader))\n", "print(first_batch.signal.lengths.shape)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1718826642798, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "jXOYNdw4wvD-", "outputId": "b35f4e40-14e9-4e50-ea7f-29d19749771a" }, "outputs": [ { "data": { "text/plain": [ "torch.Size([34, 255280])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_batch.signal.data.shape" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10431, "status": "ok", "timestamp": 1718826653227, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "bY0AwDiSxdVE", "outputId": "164f7843-d935-4d27-ef03-f57356cd2007" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "With Dynamic Batching: % True samples 92.1, % of padding 7.9, Total time 10.38s\n" ] } ], "source": [ "percent_true, percent_padded, elapsed = count_samples(dataloader)\n", "print(\"With Dynamic Batching: % True samples {:.1f}, % of padding {:.1f}, Total time {:.2f}s\".format(percent_true*100, percent_padded*100, elapsed))" ] }, { "cell_type": "markdown", "metadata": { "id": "5AuQKRKIxv-j" }, "source": [ "**The amount of padded values is significantly reduced vs the fixed batch size and full uniform random sampling.**\n", "\n", "It indeed is close to what is obtained with fully deterministic sorting and fixed batch size.\n", "The difference is that, here, with the DynamiBatchSampler we can still allow for some randomness in the sampling strategy.\n", "\n", "Moreover, by batching together examples changing the batch size we use our hardware at the fullest with each batch significantly speeding up training.\n", "\n", "We can look at the maximum number of examples that are batched together:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1718826653227, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "2zjPhvgoZihU", "outputId": "79856ee9-5d37-42a6-e837-69a76b4e2864" }, "outputs": [ { "data": { "text/plain": [ "41" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(dynamic_batcher)" ] }, { "cell_type": "markdown", "metadata": { "id": "8guqeYr1Z9_q" }, "source": [ "Using the DynamicBatchSampler with the current parameters we have 41 batches.\n", "\n", "While using a fixed batch size of 32 we would end up with:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1718826653227, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "EiO7KoM0Z1mi", "outputId": "040c1075-52c4-48bd-8b09-70ec8156bd97" }, "outputs": [ { "data": { "text/plain": [ "48" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train_data) // 32 + 1" ] }, { "cell_type": "markdown", "metadata": { "id": "1djPRPX7aKEc" }, "source": [ "so more training iterations, with more padded values --> longer training time." ] }, { "cell_type": "markdown", "metadata": { "id": "YxhMuUFeuSgx" }, "source": [ "Another way to use `DynamicBatchSampler` straightforwardly is by feeding it directly to the Brain class as an additional argument via `run_opts`. In this case, the Brain class will implicitly instantiate for you a `DataLoader`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1718826653228, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "6Ig9zJi0ugjn" }, "outputs": [], "source": [ "## dummy Brain class here with dummy model\n", "class SimpleBrain(speechbrain.Brain):\n", " def compute_forward(self, batch, stage):\n", " return model(batch[\"signal\"][0].unsqueeze(1))\n", "\n", " def compute_objectives(self, predictions, batch, stage):\n", " loss_dummy = torch.mean(predictions)\n", " return loss_dummy" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 40473, "status": "ok", "timestamp": 1718826693699, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "-HdYfcSO_v1X", "outputId": "6f73a392-3faa-485a-c009-f77a12ea66b5" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1519/1519 [00:37<00:00, 40.24it/s, train_loss=-75.8]\n" ] } ], "source": [ "model = torch.nn.Conv1d(1, 1, 3)\n", "brain = SimpleBrain({\"model\": model}, opt_class=lambda x: torch.optim.SGD(x, 0.1), run_opts={\"batch_sampler\": dynamic_batcher})\n", "brain.fit(range(1), train_data)" ] }, { "cell_type": "markdown", "metadata": { "id": "FrsjDadz0AP_" }, "source": [ "### Advanced Parameters: Full control over randomness, training speed, and VRAM consumption.\n", "---\n", "Right now we have explored the most basilar input args for `DynamicBatchSampler`.\n", "Let's see more advanced parameters.\n", "\n", "#### Controlling Randomness\n", "\n", "\n", "Randomness in `DynamicBatchSampler` is controlled with `shuffle` and `batch_ordering`.\n", "\n", "`shuffle` is a flag:\n", "\n", "* if `true`, then dynamic batches are created based on random sampling (deterministically based on `epoch` and `seed` parameters) at each epoch (included upon `DynamicBatchSampler` instantiation or epoch 0);\n", "* if `false`, then dynamic batches are created taking the examples from the database as they are. If the dataset is sorted in ascending or descending order this ordering is preserved. Note that if `false` the batches will be created once and never change during training (their permutation can change however see next).\n", "\n", "\n", "\n", "Batch permutation depends on `batch_ordering`:\n", "\n", "* `\"random\"` deterministically shuffles batches based on `epoch` and `seed` parameters\n", "* `\"ascending\"` and `\"descending\"` sort the batches based on the duration of the longest example in the batch.\n", "\n", "This argument is independent of `shuffle`.`shuffle` controls if we have to shuffle the examples before creating the batches. `batch_ordering` instead controls the shuffling of the batches after they have been created.\n", "For example, if set to `\"ascending\"` the first batch returned by the batch sampler will be the one with the shortest example in the dataset (examples belonging to the leftmost bucket); while the last one will contain the longest example in the dataset.\n", "\n", "\n", "NOTE: when iterating the `DynamicBatchSampler` (calling its `__iter__` function):\n", "\n", "* dynamic batches are re-generated at each epoch if `shuffle == True`; or\n", "* dynamic batches are permuted at each epoch if `batch_ordering == \"random\"`\n", "\n", "\n", "\n", "Note that also `num_buckets` affects randomization of training. As we stated before if `num_buckets`-->1 we obtain full random sampling as all examples can be batched together at least if `shuffle` is True and `batch_ordering` is random. Curiously even if `num_buckets` is very large we also obtain full random sampling if `shuffle` is True and `batch_ordering` is random as practically every example in the dataset is batched alone (we will have closer to batch size == 1 and very slow training, probably you want to avoid this)." ] }, { "cell_type": "markdown", "metadata": { "id": "2GR81Tjmeqxl" }, "source": [ "Here we create the batches by firstly shuffling the examples (so the batches will be different at each epoch) but then sort them so always the one with the shortest example comes first." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 292, "status": "ok", "timestamp": 1718826693986, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "A1brcnzleTp1", "outputId": "bc0a5a4f-51a3-44f2-ffda-b0043c343c3d" }, "outputs": [ { "data": { "text/plain": [ "torch.Size([71, 120480])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from speechbrain.dataio.sampler import DynamicBatchSampler\n", "\n", "max_batch_len = 17*32\n", "\n", "dynamic_batcher = DynamicBatchSampler(train_data,\n", " max_batch_length=max_batch_len,\n", " num_buckets= 60,\n", " length_func=lambda x: x[\"length\"] / 16000,\n", " shuffle=True,\n", " batch_ordering=\"ascending\"\n", " )\n", "\n", "dataloader = DataLoader(train_data, batch_sampler=dynamic_batcher, collate_fn=PaddedBatch)\n", "\n", "first_batch = next(iter(dataloader))\n", "\n", "first_batch.signal[0].shape" ] }, { "cell_type": "markdown", "metadata": { "id": "A07LSqX7e9PN" }, "source": [ "We can use instead descending order" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 299, "status": "ok", "timestamp": 1718826694284, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "ONwRup3NfAzy", "outputId": "2d3e34e3-ae7c-42de-f16b-0911fc1b3b5c" }, "outputs": [ { "data": { "text/plain": [ "torch.Size([30, 276400])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from speechbrain.dataio.sampler import DynamicBatchSampler\n", "\n", "max_batch_len = 17*32\n", "\n", "dynamic_batcher = DynamicBatchSampler(train_data,\n", " max_batch_length=max_batch_len,\n", " num_buckets= 60,\n", " length_func=lambda x: x[\"length\"] / 16000,\n", " shuffle=True,\n", " batch_ordering=\"descending\"\n", " )\n", "\n", "dataloader = DataLoader(train_data, batch_sampler=dynamic_batcher, collate_fn=PaddedBatch)\n", "\n", "first_batch = next(iter(dataloader))\n", "\n", "first_batch.signal[0].shape" ] }, { "cell_type": "markdown", "metadata": { "id": "XqAasFaEfIrQ" }, "source": [ "We can see that it now returns the batch with longest example." ] }, { "cell_type": "markdown", "metadata": { "id": "vwXc_DTLJ4t0" }, "source": [ "##### Specifying manually the buckets" ] }, { "cell_type": "markdown", "metadata": { "id": "MXTgObmHJWYa" }, "source": [ "The argument `bucket_boundaries` can be used to manually specify how many buckets and what are their boundaries.\n", "\n", "Needless to say, this arg will supersede `num_buckets`.\n", "\n", "Let's see an example:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "executionInfo": { "elapsed": 2, "status": "ok", "timestamp": 1718826694284, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "npoZZCNaKZkO" }, "outputs": [], "source": [ "# trivial example just one bucket\n", "dynamic_batcher = DynamicBatchSampler(train_data,\n", " max_batch_length=max_batch_len,\n", " bucket_boundaries=[max_batch_len],\n", " length_func=lambda x: x[\"length\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "s9mYhtGiSH5q" }, "source": [ "It is easy to see that having just one bucket in this case all examples can be batched together. Even the shortest ones with the longest ones.\n", "\n", "When just one bucket is used the `DynamicBatchSampler` will be inefficient as it will not minimize at all the amount of padding in each batch with a behavior similar to having a fixed batch size.\n", "\n", "As we said previously we have the maximal amount of randomness in each batch as each example can be batched with any other one, regardless of its length.\n", "We can now see more clearly the trade-off between training speed and randomness.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "4pYvO6cqTVTz" }, "source": [ "Here, in a more practical example, we use `bucket_boundaries` argument to specify a distribution for the buckets, given the distribution of the length of the audio files in our dataset, which we have plotted before and has, a **reversed log-normal distribution**." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 12118, "status": "ok", "timestamp": 1718826706400, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "xaykn4vlUFLc", "outputId": "c3f0a8f5-a486-493a-adb6-b19e659a3bf3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "With Dynamic Batching: % True samples 89.8, % of padding 10.2, Total time 12.07\n", "\n" ] }, { "data": { "text/plain": [ "[512.8205128205128,\n", " 1025.6410256410256,\n", " 1538.4615384615386,\n", " 2051.2820512820513,\n", " 2564.102564102564,\n", " 3076.923076923077,\n", " 3589.74358974359,\n", " 4102.5641025641025,\n", " 4615.384615384615,\n", " 5128.205128205128,\n", " 5641.025641025641,\n", " 6153.846153846154,\n", " 6666.666666666667,\n", " 7179.48717948718,\n", " 7692.307692307692,\n", " 8205.128205128205,\n", " 8717.948717948719,\n", " 9230.76923076923,\n", " 9743.589743589744,\n", " 10256.410256410256,\n", " 10769.23076923077,\n", " 11282.051282051281,\n", " 11794.871794871795,\n", " 12307.692307692309,\n", " 12820.51282051282,\n", " 13333.333333333334,\n", " 13846.153846153846,\n", " 14358.97435897436,\n", " 14871.794871794871,\n", " 15384.615384615385,\n", " 15897.435897435897,\n", " 16410.25641025641,\n", " 16923.076923076922,\n", " 17435.897435897437,\n", " 17948.71794871795,\n", " 18461.53846153846,\n", " 18974.358974358973,\n", " 19487.17948717949,\n", " 20000.0]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# number of buckets --> less buckets more randomness\n", "n_buckets = 40\n", "\n", "# we can create n_buckets linearly spaced\n", "max_batch_len = 20000\n", "import numpy as np\n", "buckets = np.linspace(0, max_batch_len, n_buckets)\n", "buckets_bounds = buckets[1:].tolist()\n", "dynamic_batcher = DynamicBatchSampler(train_data,\n", " max_batch_length=max_batch_len,\n", " bucket_boundaries=buckets_bounds,\n", " length_func=lambda x: x[\"length\"] / 160)# length in terms of 10ms\n", "\n", "dataloader = DataLoader(train_data, batch_sampler=dynamic_batcher, collate_fn=PaddedBatch)\n", "percent_true, percent_padded, elapsed = count_samples(dataloader)\n", "print(\"With Dynamic Batching: % True samples {:.1f}, % of padding {:.1f}, Total time {:.2f}\\n\".format(percent_true*100, percent_padded*100, elapsed))\n", "\n", "import numpy as np\n", "max_batch_len = 20000\n", "n_buckets = 40\n", "buckets = np.linspace(0, max_batch_len, n_buckets)\n", "buckets[1:].tolist()" ] }, { "cell_type": "markdown", "metadata": { "id": "uqYwDqcDW4GE" }, "source": [ "*However*, having linearly spaced buckets when our length distribution is not uniform is sub-optimal.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "0dJX0aJvp1zc" }, "source": [ "Intuitively one better way to generate the buckets is using an exponential distribution as we can employ coarser buckets for longer examples.\n", "Indeed, more padding for longer examples has less impact as overall the examples are longer." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 10923, "status": "ok", "timestamp": 1718826717317, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "9rnF21yhqjUF", "outputId": "7f0d5533-a3f3-49c3-edd9-a4e599dd908e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "With Dynamic Batching: % True samples 94.0, % of padding 6.0, Total time 10.81\n", "\n" ] }, { "data": { "text/plain": [ "[200,\n", " 240.0,\n", " 288.0,\n", " 345.59999999999997,\n", " 414.71999999999997,\n", " 497.66399999999993,\n", " 597.1967999999999,\n", " 716.6361599999999,\n", " 859.9633919999999,\n", " 1031.9560703999998,\n", " 1238.3472844799996,\n", " 1486.0167413759996,\n", " 1783.2200896511995,\n", " 2139.8641075814394,\n", " 2567.836929097727,\n", " 3081.4043149172726,\n", " 3697.685177900727,\n", " 4437.222213480873,\n", " 5324.666656177047,\n", " 6389.599987412456,\n", " 7667.519984894947,\n", " 9201.023981873936,\n", " 11041.228778248722,\n", " 13249.474533898467,\n", " 15899.36944067816,\n", " 19079.24332881379,\n", " 22895.09199457655,\n", " 27474.110393491857,\n", " 32968.93247219023,\n", " 39562.71896662827,\n", " 47475.26275995393,\n", " 56970.31531194471,\n", " 68364.37837433365,\n", " 82037.25404920038,\n", " 98444.70485904044,\n", " 118133.64583084853,\n", " 141760.37499701823,\n", " 170112.44999642187,\n", " 204134.93999570623,\n", " 244961.92799484747,\n", " 293954.31359381694]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# number of buckets --> less buckets more randomness\n", "n_buckets = 40\n", "# we can create n_buckets linearly spaced\n", "max_batch_len = 20000\n", "import numpy as np\n", "batch_multiplier = 1.2\n", "buckets_bounds = [200]\n", "for x in range(n_buckets):\n", " buckets_bounds.append(buckets_bounds[-1]*batch_multiplier)\n", "\n", "dynamic_batcher = DynamicBatchSampler(train_data,\n", " max_batch_length=max_batch_len,\n", " bucket_boundaries=buckets_bounds,\n", " length_func=lambda x: x[\"length\"] / 160) # length in terms of 10ms\n", "\n", "dataloader = DataLoader(train_data, batch_sampler=dynamic_batcher, collate_fn=PaddedBatch)\n", "percent_true, percent_padded, elapsed = count_samples(dataloader)\n", "print(\"With Dynamic Batching: % True samples {:.1f}, % of padding {:.1f}, Total time {:.2f}\\n\".format(percent_true*100, percent_padded*100, elapsed))\n", "\n", "# number of buckets --> less buckets more randomness\n", "n_buckets = 40\n", "# we can create n_buckets linearly spaced\n", "max_batch_len = 20000\n", "import numpy as np\n", "batch_multiplier = 1.2\n", "buckets_bounds = [200]\n", "for x in range(n_buckets):\n", " buckets_bounds.append(buckets_bounds[-1]*batch_multiplier)\n", "\n", "buckets_bounds" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 5, "status": "ok", "timestamp": 1718826717317, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "ioyIneckuE1_", "outputId": "9b8bcd54-d640-499f-8b16-517a3817f6e5" }, "outputs": [ { "data": { "text/plain": [ "115" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(dynamic_batcher._batches)" ] }, { "cell_type": "markdown", "metadata": { "id": "cgRNd2latc2a" }, "source": [ "The amount of padding is reduced by using a more appropriate distribution.\n", "\n", "---\n", "\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "executionInfo": { "elapsed": 8850, "status": "ok", "timestamp": 1718826726164, "user": { "displayName": "adel moumen", "userId": "01620107593621714109" }, "user_tz": 240 }, "id": "RtHZft5Vg010" }, "outputs": [], "source": [ "lengths = np.array([torchaudio.info(x).num_frames for x in all_flacs])\n", "from scipy.stats import beta\n", "lengths = (lengths - np.amin(lengths)) / (np.amax(lengths)- np.amin(lengths))\n", "lengths = np.clip(lengths, 1e-6, 1-1e-6)\n", "a, b, loc, upper = beta.fit(lengths, floc=0, fscale=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "HHAEaTMc9rg7" }, "source": [ "## How to find good hyper-parameters and speed up training with DynamicBatchSampler\n", "\n", "\n", "Training speed largely depends on:\n", "\n", "\n", "* `max_batch_length`: you want to set this as high as possible without getting OOM errors.\n", "* `num_buckets`: you want to avoid too low values and too high values for this parameter. As said previously: too low values and shorter examples will be batched also with longer ones, too high and almost all examples are batched alone. In both cases, your training will be extremely slow.\n", "\n", "\n", "Finding a good value for `max_batch_length`:\n", "\n", "\n", "1. Sort the dataset in descending order, set `shuffle = False` and `batch_ordering = \"descending\"` and do multiple short runs increasing `max_batch_length` till you get an OOM error. Choose a value slightly below the one that leads to OOM.\n", "\n", "Finding a good value for `num_buckets`:\n", "\n", "1. Without using `DynamicBatchSampler`, sort the dataset in descending order and find the maximum batch size that your GPU can handle. Look at the estimated time and number of batches for this configuration given in the very first iterations.\n", "2. Sort the dataset in descending order, set `shuffle = False` and `batch_ordering = \"descending\"` and `max_batch_length` with the value found before. Start with a `num_buckets` between 10 and 20 and do some guesses by doing some short runs looking at the estimated time and number of batches for each configuration. Choose the value which gives fewer batches than the one in step 1 (without dynamic batching) and whose estimated time is lower." ] }, { "cell_type": "markdown", "metadata": { "id": "gVJiv2igqSyb" }, "source": [ "### Dynamic Batching with Web dataset\n", "When working on an HPC cluster it is crucial to copy the dataset to the SSD of the local computing node. This step significantly improves the data-io performance and avoids slowing down a shared filesystem. In some cases, the dataset could be too big that might not fit into the SSD. This scenario is getting more common these days with the adoption of larger and larger datasets.\n", "\n", "SpeechBrain supports [Webdataset](https://github.com/webdataset/webdataset), which allows users to efficiently read datasets from the shared file system.\n", "The proposed Webdataset-based solution also supports dynamic batching. For more information, please take a look at [this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/advanced/data-loading-for-big-datasets-and-shared-filesystems.html)." ] }, { "cell_type": "markdown", "metadata": { "id": "B3jUpS4pxv3q" }, "source": [ "## Acknowledgements\n", "\n", "SpeechBrain DynamicBatchSampler has been developed by Ralf Leibold and Andreas Nautsch with the help of Samuele Cornell" ] }, { "cell_type": "markdown", "metadata": { "id": "sb_auto_footer", "tags": [ "sb_auto_footer" ] }, "source": [ "## Citing SpeechBrain\n", "\n", "If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:\n", "\n", "```bibtex\n", "@misc{speechbrainV1,\n", " title={Open-Source Conversational AI with {SpeechBrain} 1.0},\n", " author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},\n", " year={2024},\n", " eprint={2407.00463},\n", " archivePrefix={arXiv},\n", " primaryClass={cs.LG},\n", " url={https://arxiv.org/abs/2407.00463},\n", "}\n", "@misc{speechbrain,\n", " title={{SpeechBrain}: A General-Purpose Speech Toolkit},\n", " author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},\n", " year={2021},\n", " eprint={2106.04624},\n", " archivePrefix={arXiv},\n", " primaryClass={eess.AS},\n", " note={arXiv:2106.04624}\n", "}\n", "```" ] } ], "metadata": { "colab": { "provenance": [ { "file_id": "1Z5JmWionKAgTkWEbzLpb_kea6VEdRVAZ", "timestamp": 1639958144782 }, { "file_id": "19y3Z2moUYJA_ofvear6IG9LqpN1-uYYE", "timestamp": 1639958094720 }, { "file_id": "1SKvv_hO9R6vlBIb7_9fpQG6VBKQ8DJON", "timestamp": 1639432761997 } ] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 4 }