LoRA: Low Rank Adaptation

LoRA or Low Rank Adaptation is mostly used for fine-tuning LLMs(GPT-4, Claude 2, LLaMA 70b, etc.)
fine-tuning is the process of training a pre-trained model on a specific, smaller dataset to specialize its performance on a particular task or domain.
Using LoRA, the trainable parameters for GPT-3 will be reduced roughly by 10,000x and GPU memory requirements by 3x.
LoRA used the concept of low rank matrices(the rank of a matrix is a measure of the "information content" or the "dimensionality" of the data represented by the matrix)
Decompose larger weight matrices into smaller weight matrices using the low rank
Refer to the paper LoRA: Low-Rank Adaptation of Large Language Models and the HuggingFace PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware
A New LoRA training method was introduced: Quantized Low-Rank Adaptation (QLoRA). It leverages the bitsandbytes library for on-the-fly and near-lossless quantization of language models and applies it to the LoRA training procedure. This results in massive reductions in memory requirement - enabling the training of models as large as 70 billion parameters on 2x NVIDIA RTX 3090s! For comparison, you would normally require over 16x A100-80GB GPUs for fine-tuning a model of that size class; the associated cost would be tremendous.

Modified forward pass using low-rank decomposition.