240831-Qwen2-VL-7B/2B部署测试

原创

GuokLiu 2024-09-10 09:59:52 ©著作权

文章标签 Qwen2 Qwen2-VL 部署测试 ide 文章分类 Html/CSS 前端开发

©著作权归作者所有：来自51CTO博客作者GuokLiu的原创作品，请联系作者获取转载授权，否则将追究法律责任

A. 运行效果

A.1 界面版本

python web_demo.py --flash-attn2

截图

240831-Qwen2-VL-7B/2B部署测试_Qwen2-VL

代码

# Copyright (c) Alibaba Cloud.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # 使用GPU 0, 在import torch之前定义


import copy
import re
from argparse import ArgumentParser
from threading import Thread

import gradio as gr
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration, TextIteratorStreamer

# DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
DEFAULT_CKPT_PATH = '/home/lgk/Downloads/Qwen2-VL-2B-Instruct'




def _get_args():
    parser = ArgumentParser()

    parser.add_argument('-c',
                        '--checkpoint-path',
                        type=str,
                        default=DEFAULT_CKPT_PATH,
                        help='Checkpoint name or path, default to %(default)r')
    parser.add_argument('--cpu-only', action='store_true', help='Run demo with CPU only')

    parser.add_argument('--flash-attn2',
                        action='store_true',
                        default=False,
                        help='Enable flash_attention_2 when loading the model.')
    parser.add_argument('--share',
                        action='store_true',
                        default=False,
                        help='Create a publicly shareable link for the interface.')
    parser.add_argument('--inbrowser',
                        action='store_true',
                        default=False,
                        help='Automatically launch the interface in a new tab on the default browser.')
    parser.add_argument('--server-port', type=int, default=5000, help='Demo server port.')
    parser.add_argument('--server-name', type=str, default='0.0.0.0', help='Demo server name.')

    args = parser.parse_args()
    return args


def _load_model_processor(args):
    if args.cpu_only:
        device_map = 'cpu'
    else:
        # device_map = 'auto'
        device_map = 'balanced_low_0'

    # Check if flash-attn2 flag is enabled and load model accordingly
    if args.flash_attn2:
        model = Qwen2VLForConditionalGeneration.from_pretrained(args.checkpoint_path,
                                                                torch_dtype='auto',
                                                                attn_implementation='flash_attention_2',
                                                                device_map=device_map)
    else:
        model = Qwen2VLForConditionalGeneration.from_pretrained(args.checkpoint_path, device_map=device_map)

    processor = AutoProcessor.from_pretrained(args.checkpoint_path)
    return model, processor


def _parse_text(text):
    lines = text.split('\n')
    lines = [line for line in lines if line != '']
    count = 0
    for i, line in enumerate(lines):
        if '```' in line:
            count += 1
            items = line.split('`')
            if count % 2 == 1:
                lines[i] = f'<pre><code class="language-{items[-1]}">'
            else:
                lines[i] = '<br></code></pre>'
        else:
            if i > 0:
                if count % 2 == 1:
                    line = line.replace('`', r'\`')
                    line = line.replace('<', '<')
                    line = line.replace('>', '>')
                    line = line.replace(' ', ' ')
                    line = line.replace('*', '*')
                    line = line.replace('_', '_')
                    line = line.replace('-', '-')
                    line = line.replace('.', '.')
                    line = line.replace('!', '!')
                    line = line.replace('(', '(')
                    line = line.replace(')', ')')
                    line = line.replace('$', '$')
                lines[i] = '<br>' + line
    text = ''.join(lines)
    return text


def _remove_image_special(text):
    text = text.replace('<ref>', '').replace('</ref>', '')
    return re.sub(r'<box>.*?(</box>|$)', '', text)


def _is_video_file(filename):
    video_extensions = ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm', '.mpeg']
    return any(filename.lower().endswith(ext) for ext in video_extensions)


def _gc():
    import gc
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()


def _transform_messages(original_messages):
    transformed_messages = []
    for message in original_messages:
        new_content = []
        for item in message['content']:
            if 'image' in item:
                new_item = {'type': 'image', 'image': item['image']}
            elif 'text' in item:
                new_item = {'type': 'text', 'text': item['text']}
            elif 'video' in item:
                new_item = {'type': 'video', 'video': item['video']}
            else:
                continue
            new_content.append(new_item)

        new_message = {'role': message['role'], 'content': new_content}
        transformed_messages.append(new_message)

    return transformed_messages


def _launch_demo(args, model, processor):

    def call_local_model(model, processor, messages):

        messages = _transform_messages(messages)

        text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        image_inputs, video_inputs = process_vision_info(messages)
        inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors='pt')
        inputs = inputs.to(model.device)

        tokenizer = processor.tokenizer
        streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)

        gen_kwargs = {'max_new_tokens': 512, 'streamer': streamer, **inputs}

        thread = Thread(target=model.generate, kwargs=gen_kwargs)
        thread.start()

        generated_text = ''
        for new_text in streamer:
            generated_text += new_text
            yield generated_text

    def create_predict_fn():

        def predict(_chatbot, task_history):
            nonlocal model, processor
            chat_query = _chatbot[-1][0]
            query = task_history[-1][0]
            if len(chat_query) == 0:
                _chatbot.pop()
                task_history.pop()
                return _chatbot
            print('User: ' + _parse_text(query))
            history_cp = copy.deepcopy(task_history)
            full_response = ''
            messages = []
            content = []
            for q, a in history_cp:
                if isinstance(q, (tuple, list)):
                    if _is_video_file(q[0]):
                        content.append({'video': f'file://{q[0]}'})
                    else:
                        content.append({'image': f'file://{q[0]}'})
                else:
                    content.append({'text': q})
                    messages.append({'role': 'user', 'content': content})
                    messages.append({'role': 'assistant', 'content': [{'text': a}]})
                    content = []
            messages.pop()

            for response in call_local_model(model, processor, messages):
                _chatbot[-1] = (_parse_text(chat_query), _remove_image_special(_parse_text(response)))

                yield _chatbot
                full_response = _parse_text(response)

            task_history[-1] = (query, full_response)
            print('Qwen-VL-Chat: ' + _parse_text(full_response))
            yield _chatbot

        return predict

    def create_regenerate_fn():

        def regenerate(_chatbot, task_history):
            nonlocal model, processor
            if not task_history:
                return _chatbot
            item = task_history[-1]
            if item[1] is None:
                return _chatbot
            task_history[-1] = (item[0], None)
            chatbot_item = _chatbot.pop(-1)
            if chatbot_item[0] is None:
                _chatbot[-1] = (_chatbot[-1][0], None)
            else:
                _chatbot.append((chatbot_item[0], None))
            _chatbot_gen = predict(_chatbot, task_history)
            for _chatbot in _chatbot_gen:
                yield _chatbot

        return regenerate

    predict = create_predict_fn()
    regenerate = create_regenerate_fn()

    def add_text(history, task_history, text):
        task_text = text
        history = history if history is not None else []
        task_history = task_history if task_history is not None else []
        history = history + [(_parse_text(text), None)]
        task_history = task_history + [(task_text, None)]
        return history, task_history, ''

    def add_file(history, task_history, file):
        history = history if history is not None else []
        task_history = task_history if task_history is not None else []
        history = history + [((file.name,), None)]
        task_history = task_history + [((file.name,), None)]
        return history, task_history

    def reset_user_input():
        return gr.update(value='')

    def reset_state(_chatbot, task_history):
        task_history.clear()
        _chatbot.clear()
        _gc()
        return []

    with gr.Blocks(fill_height=True) as demo:
        gr.Markdown("""\
<p align="center"><img src="https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png" style="height: 80px"/><p>"""
                   )
        gr.Markdown("""<center><font size=8>Qwen2-VL</center>""")
        gr.Markdown("""\
<center><font size=3>This WebUI is based on Qwen2-VL, developed by Alibaba Cloud.</center>""")
        gr.Markdown("""<center><font size=3>本WebUI基于Qwen2-VL。</center>""")

        chatbot = gr.Chatbot(label='Qwen2-VL', elem_classes='control-height')
        query = gr.Textbox(lines=2, label='Input')
        task_history = gr.State([])

        with gr.Row():
            addfile_btn = gr.UploadButton('📁 Upload (上传文件)', file_types=['image', 'video'])
            submit_btn = gr.Button('🚀 Submit (发送)')
            regen_btn = gr.Button('🤔️ Regenerate (重试)')
            empty_bin = gr.Button('🧹 Clear History (清除历史)')

        submit_btn.click(add_text, [chatbot, task_history, query],
                         [chatbot, task_history]).then(predict, [chatbot, task_history], [chatbot], show_progress=True)
        submit_btn.click(reset_user_input, [], [query])
        empty_bin.click(reset_state, [chatbot, task_history], [chatbot], show_progress=True)
        regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True)
        addfile_btn.upload(add_file, [chatbot, task_history, addfile_btn], [chatbot, task_history], show_progress=True)

        gr.Markdown("""\
<font size=2>Note: This demo is governed by the original license of Qwen2-VL. \
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \
including hate speech, violence, pornography, deception, etc. \
(注：本演示受Qwen2-VL的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，\
包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)""")

    demo.queue().launch(
        share=args.share,
        inbrowser=args.inbrowser,
        server_port=args.server_port,
        server_name=args.server_name,
    )


def main():
    args = _get_args()
    model, processor = _load_model_processor(args)
    _launch_demo(args, model, processor)


if __name__ == '__main__':
    main()

A. 2 命令行版本

240831-Qwen2-VL-7B/2B部署测试_测试_02

B. 配置部署

如果可以执行下面就执行下面：

pip install git+https://github.com/huggingface/transformers accelerate

否则分开执行

git clone https://github.com/huggingface/transformers
cd transformers
pip install . accelerate

随后，执行

pip install qwen-vl-utils
pip install torchvision

git clone https://github.com/QwenLM/Qwen2-VL.git
cd Qwen2-VL
pip install -r requirements_web_demo.txt
pip install av # 视频解析

C. 模型测试

C.1 测试代码与注意事项

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # 使用GPU 0 
# ⚠️ 注意事项1: 如果是混合显卡，且中有一块不支持Flash2-Attention，则需要在代码最开始的地方指定可用显卡

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    # "/home/lgk/Downloads/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map = "auto"
    "/home/lgk/Downloads/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map = "balanced_low_0"
)
# ⚠️ 注意事项2: 模型与输入需要选择与开头对应的设备，tokenizer没有要求，这里需要更改device_map = "balanced_low_0"

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "/home/lgk/Downloads/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("/home/lgk/Downloads/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("/home/lgk/Downloads/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

C.2 测试

mode=1

(qwen2-vl) (base) lgk@WIN-20240401VAM:~/Projects/transformers$ python -u "/home/lgk/Projects/transformers/test.py"
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.58s/it]
['The image depicts a serene beach scene with a woman and her dog. The woman is sitting on the sand, wearing a plaid shirt and black pants, and appears to be smiling. She is holding up her hand in a high-five gesture towards the dog, which is also sitting on the sand. The dog has a harness on, and its front paws are raised in a playful manner. The background shows the ocean with gentle waves, and the sky is clear with a soft glow from the setting or rising sun, casting a warm light over the entire scene. The overall atmosphere is peaceful and joyful.']

mode=2

(qwen2-vl) (base) lgk@WIN-20240401VAM:~/Projects/transformers$ python test.py 
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.03s/it]
['The image depicts a serene beach scene with a woman and her dog. The woman is sitting on the sand, wearing a plaid shirt and black pants, and appears to be smiling. She is holding up her hand in a high-five gesture towards the dog, which is also sitting on the sand. The dog has a harness on, and its front paws are raised in a playful manner. The background shows the ocean with gentle waves, and the sky is clear with a soft glow from the setting or rising sun, casting a warm light over the entire scene. The overall atmosphere is peaceful and joyful, capturing a moment of connection between the']

D. 问题分析

D.1 Flash-Attention版本选择

flash_attn-2.3.5+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl与flash_attn-2.3.5+cu117torch2.0cxx11abiTrue-cp310-cp310-linux_x86_64.whl两个版本有什么区别？

这两个版本的 flash_attn 包的区别在于它们的编译选项，特别是关于 cxx11 ABI（Application Binary Interface）的设置，这影响到包的二进制兼容性。具体来说：

cxx11abiFALSE：

编译时未启用 C++11 ABI（应用程序二进制接口）的选项。
使用了旧的 ABI（即 cxx03 ABI），通常用于与旧版本的系统或编译器（例如 GCC 4.x 系列）兼容。
对于那些需要与旧版本的 C++ 库兼容的系统或环境，这是合适的选择。

cxx11abiTrue：

编译时启用了 C++11 ABI 选项。
使用了新的 C++11 ABI，这通常是在较新的系统上运行的默认选择，兼容性较好，性能也有一定的提升。
适用于较新的编译器（如 GCC 5 及以上）以及那些完全支持 C++11 标准的系统环境。

主要区别：

二进制兼容性：启用 C++11 ABI 的版本（cxx11abiTrue）通常更现代化并且与新编译器及标准库更兼容。而未启用 C++11 ABI 的版本（cxx11abiFALSE）则是为了兼容旧的二进制接口。
性能和特性：使用 cxx11abiTrue 可能会带来性能上的微小提升，因为它使用了较新的 ABI 优化。

选择哪个版本：

如果你的环境中所有相关的 C++ 库和编译器都是较新的版本（支持 C++11 及以上），那么使用 cxx11abiTrue 版本是更合适的选择。
如果你需要兼容老版本的系统或编译器，或者有其他旧的库依赖未启用 C++11 ABI，那么选择 cxx11abiFALSE。

总之，选择时要考虑你系统的编译器版本、与其他库的兼容性以及性能需求。

D.2 如何判断

要判断你的环境中的所有相关 C++ 库和编译器是否支持 C++11 及以上版本，可以按照以下步骤进行检查：

1. 检查编译器版本

对于大多数系统，常用的 C++ 编译器是 GCC 或 Clang。以下是检查编译器版本的方法：

GCC (GNU Compiler Collection):

gcc --version

如果版本是 5.1 或更高，则默认支持 C++11 ABI。

Clang:

clang --version

Clang 3.3 及以上版本支持 C++11，但 Clang 3.5 及以上版本默认启用 C++11 ABI。

2. 检查编译器的默认 ABI 设置

检查你的编译器是否默认启用了 C++11 ABI，可以通过编译一个简单的测试程序并查看 ABI 信息：

编译测试程序:
创建一个简单的 C++ 文件（如 abi_check.cpp）：

#include <iostream>

int main() {
    std::cout << "__GLIBCXX_USE_CXX11_ABI = " << __GLIBCXX_USE_CXX11_ABI << std::endl;
    return 0;
}

编译并运行:

g++ abi_check.cpp -o abi_check
./abi_check

输出结果 __GLIBCXX_USE_CXX11_ABI = 1 表示启用了 C++11 ABI，而 0 表示未启用。

3. 检查系统中已安装的 C++ 库

有些 C++ 库可能也需要支持 C++11 ABI。检查已安装的库是否与 C++11 ABI 兼容：

查看已安装库的版本：可以使用包管理器（如 apt, yum, dnf 等）查看安装的 C++ 库的版本。
例如，查看 libstdc++ 版本：

apt list --installed | grep libstdc++

查看符号信息：对于已安装的库，使用 nm 或 objdump 查看符号信息，确保符号表中的符号与 C++11 ABI 兼容。

4. 检查构建工具链配置

如果你的项目使用 CMake、Makefile 或其他构建系统：

CMake: 确保 CMAKE_CXX_STANDARD 设置为 11 或更高：

set(CMAKE_CXX_STANDARD 11)

Makefile: 在编译选项中加入 -std=c++11 或更高的标准：

CXXFLAGS = -std=c++11

总结

通过以上步骤，你可以确认你的编译器、库和构建工具链是否默认支持并启用了 C++11 及以上的 ABI。如果所有检查结果都表明支持 C++11，那么你可以安全地使用 cxx11abiTrue 版本的包。

E. 参考文献

pip install flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation

上一篇：240831-Gradio之RAG文档对话工具Kotaemon的安装与配置

下一篇：240901-通过端口转发在局域网内访问WLS2中Gradio的Web应用

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯