llama2使用gptq量化踩坑记录

我刚开始直接pip install auto-gptq，产生了一系列的问题。本地是CUDA11.6，首先的问题是无法量化。gptq本身是一个github仓库，随后集成在了Transformers 库里，介绍如下：optimum🤗Transformers**已经整合了*，**用于对语言模型进行GPTQ量化。您可以以8、4、3甚至2位加载和量化您的模型，而不会对性能产生很大的影响，并且能够提高推理速

文章共2,914字 · 阅读需要大约10分钟

一键AI生成摘要，助你高效阅读

问答

我随风而来

2902人浏览 · 2023-11-07 15:50:19

我随风而来 · 2023-11-07 15:50:19 发布

gptq

我刚开始直接pip install auto-gptq，产生了一系列的问题。本地是CUDA11.6，首先的问题是无法量化。

gptq本身是一个github仓库，

https://github.com/PanQiWei/AutoGPTQ

随后集成在了Transformers 库里，介绍如下：

🤗 Transformers has integrated optimum API to perform GPTQ quantization on language models. You can load and quantize your model in 8, 4, 3 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.

🤗 Transformers**已经整合了 optimum API*，**用于对语言模型进行GPTQ量化。您可以以8、4、3甚至2位加载和量化您的模型，而不会对性能产生很大的影响，并且能够提高推理速度！这在大多数GPU硬件上都得到支持。
*

要了解更多关于量化模型的信息，请查看：

the GPTQ paper
the optimum guide on GPTQ quantization
the AutoGPTQ library used as the backend

1 安装gptq的包

首先，查看不同的CUDA版本，然后安装不同的gptq的包。


You can install the latest stable release of AutoGPTQ from pip with pre-built wheels compatible with PyTorch 2.1 and PyTorch nightly:
您可以使用pip安装与PyTorch 2.1和PyTorch nightly兼容的预构建轮子，以获取AutoGPTQ的最新稳定版本。

For CUDA 12.1: pip install auto-gptq 对于CUDA 12.1： pip install auto-gptq
For CUDA 11.8: pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ 对于CUDA 11.8： pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

2 CUDA 11.6

我自己使用CUDA 11.6的成功了，步骤如下首先，在以上地方安装对应的版本，最低的找到了11.7的，抱着试试看的安装下：

https://huggingface.github.io/autogptq-index/whl/cu117/auto-gptq/

Directory Tree
[       4096 24-août-2023 19:19]    .
├── [    1630158 23-août-2023 10:20]    auto_gptq-0.4.1+cu117-cp310-cp310-linux_x86_64.whl
├── [    1268947 23-août-2023 10:32]    auto_gptq-0.4.1+cu117-cp310-cp310-win_amd64.whl
├── [    1630131 23-août-2023 10:27]    auto_gptq-0.4.1+cu117-cp311-cp311-linux_x86_64.whl
├── [    1269935 23-août-2023 10:21]    auto_gptq-0.4.1+cu117-cp311-cp311-win_amd64.whl
├── [    1629946 23-août-2023 10:19]    auto_gptq-0.4.1+cu117-cp38-cp38-linux_x86_64.whl
├── [    1268816 23-août-2023 10:13]    auto_gptq-0.4.1+cu117-cp38-cp38-win_amd64.whl
├── [    1630382 23-août-2023 10:19]    auto_gptq-0.4.1+cu117-cp39-cp39-linux_x86_64.whl
├── [    1269036 23-août-2023 10:34]    auto_gptq-0.4.1+cu117-cp39-cp39-win_amd64.whl
├── [    1631883 24-août-2023 16:32]    auto_gptq-0.4.2+cu117-cp310-cp310-linux_x86_64.whl
├── [    1507238 24-août-2023 16:39]    auto_gptq-0.4.2+cu117-cp310-cp310-win_amd64.whl
├── [    1631851 24-août-2023 16:30]    auto_gptq-0.4.2+cu117-cp311-cp311-linux_x86_64.whl
├── [    1507659 24-août-2023 16:25]    auto_gptq-0.4.2+cu117-cp311-cp311-win_amd64.whl
├── [    1631728 24-août-2023 16:32]    auto_gptq-0.4.2+cu117-cp38-cp38-linux_x86_64.whl
├── [    1506568 24-août-2023 17:05]    auto_gptq-0.4.2+cu117-cp38-cp38-win_amd64.whl
├── [    1632187 24-août-2023 16:24]    auto_gptq-0.4.2+cu117-cp39-cp39-linux_x86_64.whl
└── [    1506867 24-août-2023 17:11]    auto_gptq-0.4.2+cu117-cp39-cp39-win_amd64.whl

我安装的版本是：

pip install  auto_gptq-0.4.2+cu117-cp310-cp310-linux_x86_64.whl

查看本地gptq虚拟环境的torch版本：

pip list|grep torch

然后需要卸载高级版本的pytorch，安装普通的pytorch：

pytorch安装地址：https://download.pytorch.org/whl/torch_stable.html

根据本地CUDA版本和linux系统/win系统下载对应的版本。我下载的是：torch-1.13.1+cu116-cp310-cp310-linux_x86_64.whl
然后需要：pip install torch-1.13.1+cu116-cp310-cp310-linux_x86_64.whl
然后就可以使用量化了。

3 量化：gptq github代码和huggingface的代码

https://github.com/PanQiWei/AutoGPTQ

https://huggingface.co/docs/transformers/main_classes/quantization#autogptq-integration

这两者都可以，但是有个问题，量化虽然成功了，并没有使用官方和论文里推荐的c4数据集：

auto-gptq github代码

import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import logging

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
# os.environ["CUDA_VISIBLE_DEVICES"] = " GPU-5a65605b-a87b-8833-9064-e4904fd979da"
# os.environ["CUDA_VISIBLE_DEVICES"] = "GPU-93cad46d-8b7b-dfdf-dd69-a8e288115fe6"

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')



model = "/home/nlp/model/checkpoint-1200"
tokenizer = AutoTokenizer.from_pretrained(model)
gptq_config = GPTQConfig(bits=8, dataset = "c4", tokenizer=tokenizer)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=gptq_config)
model.save_pretrained("/home/nlp/model/checkpoint-1200")
tokenizer.save_pretrained("/home/nlp/model/int_token")

huggingface代码

import torch
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
from transformers import LlamaForCausalLM, LlamaTokenizer, AutoTokenizer, AutoModelForCausalLM, AutoConfig, GenerationConfig
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
# os.environ["CUDA_VISIBLE_DEVICES"] = " GPU-5a65605b-a87b-8833-9064-e4904fd979da"
# os.environ["CUDA_VISIBLE_DEVICES"] = "GPU-93cad46d-8b7b-dfdf-dd69-a8e288115fe6"

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print('device: ', device)
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "/home/nlp/model/checkpoint-1200"
quantized_model_dir = "/home/nlp/model/int_token"

#可选择公开数据集量化
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_dir)
gptq_config = GPTQConfig(bits=8, dataset = "c4", tokenizer=tokenizer)
#或者采用自定义数据集量化
dataset = ["auto-gptq 是一个基于 GPTQ 算法的易于使用的模型量化库，具有用户友好的 api。"]
quantization = GPTQConfig(bits=8, dataset = dataset, tokenizer=tokenizer)

#注意，quantization_config用于选择数据集，输出量化后的模型
quant_model = AutoModelForCausalLM.from_pretrained(pretrained_model_dir, device_map="auto",quantization_config=quantization)
#输出量化后权重，验证是否量化了
print(quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__)
#测试量化后的模型
text = "My name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = quant_model.generate(**inputs)
print(tokenizer.decode(out[0], skip_special_tokens=True))

examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]
#保存量化模型：

quant_model.save_pretrained(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

4 auto-gptq官方量化代码

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

5 推理-遗留问题：

推理的时候，只需要把之前的LlamaForCausalLM变成AutoGPTQForCausalLM,如下：

model = AutoGPTQForCausalLM.from_quantized(MODEL_PATH, device="cuda:0")

或者

model = AutoGPTQForCausalLM.from_quantized(model_dir,     # 存放模型的文件路径，里面包含 config.json, tokenizer.json 等模型配置文件
    model_basename="/nfs/chattt_int8_notice/gptq/gptq_model-8bit-128g",
    use_safetensors=True,device="cuda:0")

虽然在CUDA11.6可以正常推理了。但是推理的时候速度异常慢，报以上信息：

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
loadtokenizer成功
开始loadmodel
Exllama kernel is not installed, reset disable_exllama to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
CUDA extension not installed.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.


您正在使用 <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'> 的默认旧版行为。这是预期的，并且意味着将使用 legacy（以前的）行为，所以对您没有任何变化。如果您想使用新的行为，请设置 legacy=False。只有在您理解它的含义，并彻底阅读了为什么会添加这个选项的原因后，才应该设置这个选项，详细信息请参阅 https://github.com/huggingface/transformers/pull/24565。

加载分词器成功。
开始加载模型。
Exllama kernel未安装，将disable_exllama重置为True。这可能是因为您在Windows上使用预构建的wheel安装了auto_gptq，而在此情况下exllama_kernels并没有被编译。如果您想使用exllama_kernels来进一步加速推理，请从源代码重新安装auto_gptq。
auto_gptq的CUDA内核未安装，这将导致非常慢的推理速度。这可能是因为：

当您从源代码安装auto_gptq时，通过设置BUILD_CUDA_EXT=0禁用了CUDA扩展的编译。
您使用的是不支持CUDA的pytorch。
您的设备未安装CUDA和nvcc。
CUDA扩展未安装。
跳过模块注入，因为FusedLlamaMLPForQuantizedModel尚不支持在没有triton的情况下集成。

所以仍然需要升级CUDA版本,升级12.2后推理就没有问题了。

参考其他csdn：

基于hugging face的autogptq量化实践-CSDN博客

LLM - Model Load_in_8bit For LLaMA_BIT_666的博客-CSDN博客

LLaMaの量化部署-CSDN博客