Models
======

This section provides a brief overview of the models used in this project, categorized by task type.

Text Models
------------

.. list-table:: Available Text Models
   :header-rows: 1

   * - Model
     - Description
   * - BERT
     - A transformer model that excels at tasks like text classification, QA, and more.
   * - RoBERTa
     - An optimized variant of BERT with improved training techniques for better NLP performance.
   * - GPT-2
     - A large-scale model known for text generation, capable of producing coherent text.

Image Models
------------

.. list-table:: Available Image Models
   :header-rows: 1

   * - Model
     - Description
   * - ResNet18/34/50
     - A family of CNNs with 18, 34, and 50 layers respectively, utilizing residual connections to improve training in deep networks.
   * - DenseNet121/161
     - CNNs where each layer is connected to every other layer, reducing parameters while maintaining high accuracy in image classification tasks.
   * - MobileNetV2
     - A lightweight CNN optimized for mobile and resource-constrained environments, effective for image classification.
   * - InceptionV3
     - Known for its inception modules, this model efficiently handles multi-scale features for image classification.
   * - GoogleNet
     - Similar to InceptionV3, GoogleNet uses inception modules and is designed for efficient computation.
   * - ShuffleNetV2_x1_0
     - A lightweight CNN designed for fast computation on mobile devices, balancing accuracy and efficiency.
   * - EfficientNet-B0
     - Part of the EfficientNet family, this model scales depth, width, and resolution to achieve high performance on image tasks.
   * - AlexNet
     - One of the earliest deep CNNs that popularized deep learning, effective in image classification tasks.
   * - VGG11/16/19
     - A set of deep CNNs with 11, 16, or 19 layers, known for their simplicity and effectiveness in image classification.
   * - Vision Transformer (ViT-B_16)
     - A transformer model applied to image classification, treating images as sequences of patches instead of traditional convolutions.
   * - R3D (ResNet3D)
     - A 3D CNN for video classification tasks, extending 2D convolutions to three dimensions to handle spatial and temporal information.

Audio Models
------------

.. list-table:: Available Audio Models
   :header-rows: 1

   * - Model
     - Description
   * - Hubert
     - A transformer model designed for speech recognition and audio classification tasks, using self-supervised learning on audio data.
   * - AudioCNN
     - A CNN specifically designed to process and classify raw audio signals.
   * - AudioLSTM
     - An LSTM-based model tailored for sequential audio data, used in tasks like speech recognition and audio classification.
   * - X-Vector
     - A model used for speaker verification, embedding speaker identity for classification.
   * - VGGVox
     - A CNN-based model for speaker recognition tasks, adapted from the VGG architecture for audio inputs.
   * - SpeechEmbedder
     - A model that extracts speaker embeddings from audio, used in speaker verification tasks.