AWQ 4bit Inference

Table of contents

Install AWQ
Chat with the CLI
Benchmark

We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.

Install AWQ

Setup environment (please refer to this link for more details):

conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip    # enable PEP 660 support
pip install -e .             # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .             # install awq package

cd awq/kernels				
python setup.py install	     # install awq CUDA kernels

Chat with the CLI

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
    --awq-wbits 4 \
    --awq-groupsize 128 

Benchmark

Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

Benchmark on NVIDIA RTX A6000:

Model	Bits	Max Memory (MiB)	Speed (ms/token)	AWQ Speedup
vicuna-7b	16	13543	26.06	/
vicuna-7b	4	5547	12.43	2.1x
llama2-7b-chat	16	13543	27.14	/
llama2-7b-chat	4	5547	12.44	2.2x
vicuna-13b	16	25647	44.91	/
vicuna-13b	4	9355	17.30	2.6x
llama2-13b-chat	16	25647	47.28	/
llama2-13b-chat	4	9355	20.28	2.3x

NVIDIA RTX 4090:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
vicuna-7b 8.61 19.09 2.2x
llama2-7b-chat 8.66 19.97 2.3x
vicuna-13b 12.17 OOM /
llama2-13b-chat 13.54 OOM /
NVIDIA Jetson Orin:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
vicuna-7b 65.34 93.12 1.4x
llama2-7b-chat 75.11 104.71 1.4x
vicuna-13b 115.40 OOM /
llama2-13b-chat 136.81 OOM /

Model	AWQ 4bit Speed (ms/token)	FP16 Speed (ms/token)	AWQ Speedup
vicuna-7b	8.61	19.09	2.2x
llama2-7b-chat	8.66	19.97	2.3x
vicuna-13b	12.17	OOM	/
llama2-13b-chat	13.54	OOM	/

Model	AWQ 4bit Speed (ms/token)	FP16 Speed (ms/token)	AWQ Speedup
vicuna-7b	65.34	93.12	1.4x
llama2-7b-chat	75.11	104.71	1.4x
vicuna-13b	115.40	OOM	/
llama2-13b-chat	136.81	OOM	/