Quantization vs Pruning vs Knowledge Distillation
In the LLM world, when it comes to inference, there are mostly two major requirements from a well run and a well built LLM system:
- High Accuracy vis-a-vis hence more usage
- Low latency vis-a-vis hence low costs
And these two requirements are the part of a tradeoff that plagues almost all ML models/systems. Most accurate LLMs are highly complex multi-layer transformer models trained on trillions of tokens. Due to this, during inference time, billions of computations are required to generate a single token. Hence, it eats up the computational resources. In order to solve this problem, there are three common techniques:
- Quantization:
- Pruning:
- Distillation: