Aurora_开源AI项目-程序员客栈

Aurora: Activatig chiese chat capability for Mistral-8x7B sparse Mixture-of-Experts through Istructio-Tuig

Rogsheg Wag, Haomig Che, Ruizhe Zhou, Yaofei Dua, Kuya Cai, Ha Ma, Jiaxi Cui, Jia Li, Patrick Cheog-Iao Pag, Yapeg Wag, Tao Ta☨

☨Correspodig author

Please follow our Github: https://github.com/WagRogsheg/Aurora

Overview

Existig research has demostrated that refiig large laguage models (LLMs) through the utilizatio of machie-geerated istructio-followig data empowers these models to exhibit impressive zero-shot capabilities for ovel tasks, without requirig huma-authored istructios. I this paper, we systematically ivestigate, preprocess, ad itegrate three Chiese istructio-followig datasets with the aim of ehacig the Chiese coversatioal capabilities of Mixtral-8x7B sparse Mixture-of-Experts model. Through istructio fie-tuig o this carefully processed dataset, we successfully costruct the Mixtral-8x7B sparse Mixture-of-Experts model amed "Aurora." To assess the performace of Aurora, we utilize three widely recogized bechmark tests: C-Eval, MMLU, ad CMMLU. Empirical studies validate the effectiveess of istructio fie-tuig applied to Mixtral-8x7B sparse Mixture-of-Experts model. This work is pioeerig i the executio of istructio fie-tuig o a sparse expert-mixed model, markig a sigificat breakthrough i ehacig the capabilities of this model architecture.

Evaluatio

It is kow that LLM evaluatio remais a sigificat challege. We use three public bechmarks i our study.

Next are some refereces we gave you about GPU memory usage durig the traiig ad iferece stage. Please ote that we did all iferece ad traiig o a sigle GPU.

|Stage|GPU Memory Usage| |:-|:-| |Traiig|~43 GiB| |Iferece|~25 GiB|

Quick-Use

import gradio as gr
import torch
from trasformers import AutoModelForCausalLM, AutoTokeizer, StoppigCriteria, StoppigCriteriaList, TextIteratorStreamer
from threadig import Thread
from peft import PeftModel
import time

model_ame_or_path = "mistralai/Mixtral-8x7B-Istruct-v0.1" # dowload weights from https://huggigface.co/mistralai/Mixtral-8x7B-Istruct-v0.1
lora_weights = "wagrogsheg/Aurora-Mixtral-8x7B" # dowload weights from https://modelscope.c/models/wagrogsheg/Aurora-Mixtral-8x7B

tokeizer = AutoTokeizer.from_pretraied(model_ame_or_path)
model0 = AutoModelForCausalLM.from_pretraied(model_ame_or_path, load_i_4bit=True, device_map="auto", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretraied(
    model0,
    lora_weights,
)

class StopOTokes(StoppigCriteria):
    def __call__(self, iput_ids: torch.LogTesor, scores: torch.FloatTesor, **kwargs) -> bool:
        stop_ids = [0,]
        for stop_id i stop_ids:
            if iput_ids[0][-1] == stop_id:
                retur True
        retur False

def covert_history_to_text(history):
    text = ""
    if le(history) > 1:
        text = "<s> " + "".joi(
                [
                    "".joi(
                        [
                            f"[INST]{item[0]}[/INST] {item[1]} ",
                        ]
                    )
                    for item i history[:-1]
                ]
            ) + "</s> "
    text += "".joi(
        [
            "".joi(
                [
                    f"[INST]{history[-1][0]}[/INST]",
                ]
            )
        ]
    )
    retur text

def predict(message, history):

    history_trasformer_format = history + [[message, ""]]
    stop = StopOTokes()

    messages = covert_history_to_text(history_trasformer_format)

    model_iputs = tokeizer([messages], retur_tesors="pt").to("cuda")
    streamer = TextIteratorStreamer(tokeizer, timeout=10., skip_prompt=True, skip_special_tokes=True)
    geerate_kwargs = dict(
        model_iputs,
        streamer=streamer,
        max_ew_tokes=4096,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        um_beams=1,
        pad_toke_id=tokeizer.eos_toke_id,
        stoppig_criteria=StoppigCriteriaList([stop])
        )
    t = Thread(target=model.geerate, kwargs=geerate_kwargs)
    t.start()

    partial_message  = ""
    t1 = time.time()
    cout = 0
    for ew_toke i streamer:
        if ew_toke != '<':
            partial_message += ew_toke
            cout += 1
            yield partial_message
    t2 = time.time()
    speed = cout/(t2-t1)
    prit("iferece speed: %f tok/s" % speed)


gr.ChatIterface(predict,chatbot=gr.Chatbot(height=600,),title="MoE").queue().lauch()

Easy-to-Use

1. Cloe ad Set up

https://github.com/WagRogsheg/Aurora.git
cd Aurora
pip istall -r requiremets.txt

2. Dowload Model

The huge model parameters are ot coveiet for you to maage your task, so we provide LoRA weights, which will be merged with the base model before iferece. You do't have to worry about it.

3. Iferece

Web:

CUDA_VISIBLE_DEVICES=0 pytho src/web_demo.py \
    --model_ame_or_path ./Mixtral-8x7B-Istruct-v0.1 \
    --checkpoit_dir Aurora \
    --fietuig_type lora \
    --quatizatio_bit 4 \
    --template mistral

The you ca visit: http://127.0.0.1:7860/

CLI:

CUDA_VISIBLE_DEVICES=0 pytho src/cli_demo.py \
    --model_ame_or_path ./Mixtral-8x7B-Istruct-v0.1 \
    --checkpoit_dir Aurora \
    --fietuig_type lora \
    --quatizatio_bit 4 \
    --template mistral

API:

CUDA_VISIBLE_DEVICES=0 pytho src/api_demo.py \
    --model_ame_or_path ./Mixtral-8x7B-Istruct-v0.1 \
    --checkpoit_dir Aurora \
    --fietuig_type lora \
    --quatizatio_bit 4 \
    --template mistral

If you eed to load weights for specific checkpoits, you ca set them up like this: --checkpoit_dir Aurora/checkpoit-5000.

Trai

If you have a sigle GPU ad its GPU memory size is larger tha 48GB, you ca trai your ow models.

Trai your MoE model

CUDA_VISIBLE_DEVICES=5 pytho   src/trai_bash.py \
    --stage sft \
    --model_ame_or_path ./Mixtral-8x7B-Istruct-v0.1 \
    --do_trai \
    --dataset alpaca_zh,alpaca_gpt4_zh,sharegpt \
    --fietuig_type lora \
    --quatizatio_bit 4 \
    --overwrite_cache \
    --output_dir output/ \
    --per_device_trai_batch_size 2 \
    --gradiet_accumulatio_steps 4 \
    --lr_scheduler_type cosie \
    --loggig_steps 100 \
    --save_steps 1000 \
    --learig_rate 5e-5 \
    --um_trai_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --template mistral \
    --lora_target q_proj,v_proj

--quatizatio_bit 4 meas you will use QLoRA, If you have a larger GPU memory size you ca remove it ad use LoRA.

Evaluatio your MoE model

CUDA_VISIBLE_DEVICES=0 pytho src/evaluate.py \
    --model_ame_or_path ./Mixtral-8x7B-Istruct-v0.1 \
    --checkpoit_dir Aurora/checkpoit-5000 \
    --fietuig_type lora \
    --quatizatio_bit 4 \
    --template mistral \
    --task cmmlu \ # cmmlu, mmlu, ceval
    --split test \
    --lag e \ # zh, e
    --_shot 5 \
    --batch_size 8

Ackowledgmets

This work is maily doe by the Faculty of Applied Scieces of the Macao Polytechic Uiversity. The computatioal resources used i this work were obtaied from AWS servers. The fie-tuig framework we used is LLaMA-Factory, which brigs a lot of coveiece to our work. We also thak the public datasets from the ope source commuity, such as shareAI, staford_alpaca ad GPT-4-LLM. Most importatly we are very grateful to Mistral AI, who are leadig a ew techology boom that will dramatically chage the future of techology developmet.

Citatio

If you fid our work helpful, feel free to give us a cite.

@misc{wag2023auroraactivatig,
      title={Aurora:Activatig Chiese chat capability for Mistral-8x7B sparse Mixture-of-Experts through Istructio-Tuig}, 
      author={Rogsheg Wag ad Haomig Che ad Ruizhe Zhou ad Yaofei Dua ad Kuya Cai ad Ha Ma ad Jiaxi Cui ad Jia Li ad Patrick Cheog-Iao Pag ad Yapeg Wag ad Tao Ta},
      year={2023},
      eprit={2312.14557},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Licese

Please follow the Apache 2.0 Licese.

Aurora

技术信息

作品详情