Lab For AI

Lab For AI

The 2-bit Quantization is Insane! See How to Run Mixtral-8x7B on Free-tier Colab.

A Quick Tutorial for AQLM-2-bit Quantization and its Implementation

Yeyu Huang's avatar
Yeyu Huang
Feb 19, 2024
∙ Paid
Share
Image by author

There has been an increasing demand to make open-source language models more accessible to end-users for local inference and fine-tuning. However, this requires the models to significantly reduce their computational resource demands, allowing their usage on more affordable hardware.

A particularly effective method is called quantization. Practically, quantization is the technique that lowers the bit-width used for representing the weights of a model which effectively decreases its overall size and eases RAM transfer. 

The traditional approach of quantization includes selecting a specific quantization grid and normalizer for different parts of the model and then mapping the model weights onto this grid. This mapping algorithm might go through simple rounding or more complex allocations. However, this inevitably creates a trade-off between size and accuracy especially for heavy compression, which can be easily reflected through perplexity — a metric indicating the predictive capability of the model.

Quantization process between fp16 and int8

Lab For AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

AQLM

Against this backdrop, a team from Austria recently released a new state-of-the-art method for 2–3bit width LLM quantization, AQLM (Additive Quantization of Language Models), in a paper that differs fundamentally from some previous methods which required complicated formats to manage quantitative outliers.

To evaluate a quantization method, two factors are normally concerned, perplexity (intelligence) and speed. From the comparison result of QuIP#-2bit, AQLM-2bit, and original FP16 for three scales of Llama-2 models, it’s surprising to see the AQLM-2bit is always performing better than QuIP#-2bit, and the AQLM-2bit for 70b model scores 4 in perplexity on Wikitext2 which is much better than original 13b model. For me, when the perplexity score is under 4, the model inference is qualified for use in practical applications.

The speed performance running on the Nvidia 3090 is also in the range of acceptance compared to the original one.

The detailed algorithm of AQLM is explained in their paper. Basically, the team advances the usage of Additive Quantization (AQ), a specific type of MCQ, Multi-Codebook Quantization, for compressing LLM weights — a technique classically employed in information retrieval systems for efficient database compression and search. This approach is achieved by learned AQ performed in an input-adaptive fashion and joint optimization of codebooks across blocks of layers. 

The validation result looks very interesting that not only does the 2.07-bit of Llama-2–70B model score <4 WikiText2 benchmark but also the 2.5-bit of Llama-2–13B model is a little more accurate than the original 7B with a much smaller size running on a budget GPU.

AQLM-2.53bit of 13B is better than 16bit of 7B

You can find in the paper for details of:

  1. AQLM algorithm, including layer calibration of input/output activations and intra-layer quantization parameter optimization.

  2. Validation of AQLM’s effectiveness 

Now, let’s go through the code to see how much cost we can save using AQLM models.

Code Walkthrough

In this demo, we will use AQLM transformers to run inference on a compressed version of the Mixtral-8x7b model in 2bit on the free tier of Google Colab with only ~13GB T4 GPU. 

Mixtral-8x7b is a sparse mixture-of-experts model provided by Mistral AI and proved its performance by outperforming Llama-2–70B on generation quality entirely with 6x faster speed, and particularly matches or outperforms GPT-3.5 on most benchmarks. With 32k tokens’ context, this open-source model is pretty much usable for practical generative applications.

Mixtral has 47B total parameters but with its innovative architecture, it processes each token with 13B parameters equivalently which supports much of its inference speed. However, even a Mistral-7B needs a GPU with 24GB VRAM for inference, so it will require almost 64GB for a well-facilitated Mixtral-8x7b model and that costs as much as around 4.5$/h.

Now, by using the 2-bit AQLM version, we can run this model for totally free on the Colab notebook.

Keep reading with a 7-day free trial

Subscribe to Lab For AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Yeyu Huang
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture