Models

This section provides a brief overview of the models used in this project, categorized by task type.

Text Models

Available Text Models
Model	Description
BERT	A transformer model that excels at tasks like text classification, QA, and more.
RoBERTa	An optimized variant of BERT with improved training techniques for better NLP performance.
GPT-2	A large-scale model known for text generation, capable of producing coherent text.

Available Image Models
Model	Description
ResNet18/34/50	A family of CNNs with 18, 34, and 50 layers respectively, utilizing residual connections to improve training in deep networks.
DenseNet121/161	CNNs where each layer is connected to every other layer, reducing parameters while maintaining high accuracy in image classification tasks.
MobileNetV2	A lightweight CNN optimized for mobile and resource-constrained environments, effective for image classification.
InceptionV3	Known for its inception modules, this model efficiently handles multi-scale features for image classification.
GoogleNet	Similar to InceptionV3, GoogleNet uses inception modules and is designed for efficient computation.
ShuffleNetV2_x1_0	A lightweight CNN designed for fast computation on mobile devices, balancing accuracy and efficiency.
EfficientNet-B0	Part of the EfficientNet family, this model scales depth, width, and resolution to achieve high performance on image tasks.
AlexNet	One of the earliest deep CNNs that popularized deep learning, effective in image classification tasks.
VGG11/16/19	A set of deep CNNs with 11, 16, or 19 layers, known for their simplicity and effectiveness in image classification.
Vision Transformer (ViT-B_16)	A transformer model applied to image classification, treating images as sequences of patches instead of traditional convolutions.
R3D (ResNet3D)	A 3D CNN for video classification tasks, extending 2D convolutions to three dimensions to handle spatial and temporal information.

Available Audio Models
Model	Description
Hubert	A transformer model designed for speech recognition and audio classification tasks, using self-supervised learning on audio data.
AudioCNN	A CNN specifically designed to process and classify raw audio signals.
AudioLSTM	An LSTM-based model tailored for sequential audio data, used in tasks like speech recognition and audio classification.
X-Vector	A model used for speaker verification, embedding speaker identity for classification.
VGGVox	A CNN-based model for speaker recognition tasks, adapted from the VGG architecture for audio inputs.
SpeechEmbedder	A model that extracts speaker embeddings from audio, used in speaker verification tasks.