A "quantized" guide to quantization in LLMs

Tl;dr Quantization helps you shrink and speed up LLMs without sacrificing too much performance. From 8-bit variants, down all the way to even 1-bit, the flavors are wild and necessary if you want to run powerful models on weak hardware. What even is quantization? Fig: Example of quantization from FP32 to INT8 When I was reading up a paper on Google’s TPU for a grad course, I came across their explanation on what ‘quantization’ is which has stuck with me till today. ...

August 7, 2025 · Abishek Padaki (Abiks)

ONNX Runtime: Unleash On-Device AI

Tl;dr ONNX Runtime is your cross-platform, hardware-accelerated inference engine to deploy fast and private AI anywhere. What even is ONNX? ONNX Runtime (pronounced like “Onix”, the rock-snake Pokémon) is a performance-focused scoring and inference engine for models in the Open Neural Network Exchange (ONNX) format, designed to support heavy workloads and deliver fast, reliable predictions in production scenarios. Developed by Microsoft, it provides APIs in Python, C++, C#, Java, and JavaScript, and runs on Windows, Linux, and macOS. Its core is implemented in C++ for speed, and it seamlessly interoperates with ONNX-exported models from PyTorch, TensorFlow, Keras, scikit-learn, LightGBM, XGBoost, and more. ...

May 17, 2025 · Abishek Padaki (Abiks)