A "quantized" guide to quantization in LLMs
Tl;dr Quantization helps you shrink and speed up LLMs without sacrificing too much performance. From 8-bit variants, down all the way to even 1-bit, the flavors are wild and necessary if you want to run powerful models on weak hardware. What even is quantization? Fig: Example of quantization from FP32 to INT8 When I was reading up a paper on Google’s TPU for a grad course, I came across their explanation on what ‘quantization’ is which has stuck with me till today. ...