Link Search Menu Expand Document

AWQ 4bit Inference

Table of contents

We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.

Install AWQ

Setup environment (please refer to this link for more details):

conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip    # enable PEP 660 support
pip install -e .             # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .             # install awq package

cd awq/kernels				
python setup.py install	     # install awq CUDA kernels

Chat with the CLI

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
    --awq-wbits 4 \
    --awq-groupsize 128 

Benchmark

  • Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

  • Benchmark on NVIDIA RTX A6000:

    ModelBitsMax Memory (MiB)Speed (ms/token)AWQ Speedup
    vicuna-7b161354326.06/
    vicuna-7b4554712.432.1x
    llama2-7b-chat161354327.14/
    llama2-7b-chat4554712.442.2x
    vicuna-13b162564744.91/
    vicuna-13b4935517.302.6x
    llama2-13b-chat162564747.28/
    llama2-13b-chat4935520.282.3x
  • NVIDIA RTX 4090:

    ModelAWQ 4bit Speed (ms/token)FP16 Speed (ms/token)AWQ Speedup
    vicuna-7b8.6119.092.2x
    llama2-7b-chat8.6619.972.3x
    vicuna-13b12.17OOM/
    llama2-13b-chat13.54OOM/
  • NVIDIA Jetson Orin:

    ModelAWQ 4bit Speed (ms/token)FP16 Speed (ms/token)AWQ Speedup
    vicuna-7b65.3493.121.4x
    llama2-7b-chat75.11104.711.4x
    vicuna-13b115.40OOM/
    llama2-13b-chat136.81OOM/