
- LoRA or Low Rank Adaptation is mostly used for fine-tuning LLMs(GPT-4, Claude 2, LLaMA 70b, etc.)
- fine-tuning is the process of training a pre-trained model on a specific, smaller dataset to specialize its performance on a particular task or domain.
- Using LoRA, the trainable parameters for GPT-3 will be reduced roughly by 10,000x and GPU memory requirements by 3x.
- LoRA used the concept of low rank matrices(the rank of a matrix is a measure of the "information content" or the "dimensionality" of the data represented by the matrix)
- Decompose larger weight matrices into smaller weight matrices using the low rank
- Refer to the paper LoRA: Low-Rank Adaptation of Large Language Models and the HuggingFace PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware
- A New LoRA training method was introduced: Quantized Low-Rank Adaptation (QLoRA). It leverages the bitsandbytes library for on-the-fly and near-lossless quantization of language models and applies it to the LoRA training procedure. This results in massive reductions in memory requirement - enabling the training of models as large as 70 billion parameters on 2x NVIDIA RTX 3090s! For comparison, you would normally require over 16x A100-80GB GPUs for fine-tuning a model of that size class; the associated cost would be tremendous.
Modified forward pass using low-rank decomposition.
Read more here: https://www.datacamp.com/tutorial/mastering-low-rank-adaptation-lora-enhancing-large-language-models-for-efficient-adaptation
LoRA training guide: https://rentry.org/lora_train