开源地址
https://modelscope.cn/models/iic/blsp_lslm_7b授权协议
Apache License 2.0

BLSP: Bootstrappig Laguage-Speech Pre-traiig via Behavior Aligmet of Cotiuatio Writig

Che Wag, Mipeg Liao, Zhogqiag Huag, Jiliag Lu, Juhog Wu, Yuche Liu, Cheqig Zog, Jiaju Zhag

Istitute of Automatio, Chiese Academy of Scieces

Machie Itelligece Techology Lab, Alibaba DAMO Academy

模型介绍

BLSP是一个大规模语音语言模型，能够同时理解语音和文本，支持语音文本间跨模态交互。该模型可以应用于语音对话/问答、语音识别、语音翻译和语音情感分析等领域，能够自动生成高质量的多语言文本，从而为跨模态、跨语言的交流提供便利。

Itroductio

BLSP exteds the laguage capabilities of LLMs to speech, eablig iteractio with LLMs through spoke laguages.
We lear the BLSP model via behavior aligmet of cotiuatio writig. Ideally, LLMs should have the same behavior regardless of the modality of iput, a speech segmet or its trascript.
- The first step utilizes LLMs to geerate text with the trascript as prefix, obtaiig the cotiuatios.
- The secod step leverages the text cotiuatios as supervised sigals to lear the modality adapter by requirig LLMs to predict the text cotiuatios give the speech segmet.

architecture

Examples

eglish

chiese

More examples with video presetatios ca be foud i the project page.

Usage

Eviromet Preparatio

All experimets are carried out i the followig eviromet.

Pytho==3.8
torch==1.13, torchaudio==0.13.0, cuda==11.6
trasformers==4.31.0
soudfile==0.12.1, opeai-whisper
datasets==2.12.0, accelerate==0.21.0, deepspeed==0.9.3
evaluate==0.4.0, setecepiece==0.1.99
fire==0.5.0, gradio==3.41.2

Prepare the pretraied BLSP checkpoit

Dowload the pretraied BLSP model lik

Iferece & Evaluatio

We release the iferece code for evaluatio.

The supported iput file is .jsol format. Here is a example of ST task. Each lie of iput file looks like

{"audio": "/home/data/eval/1.wav"}

The ru the geeratio code

pytho3 blsp/geerate.py \
    --iput_file "test.jsol" \
    --output_file "test_out.jsol" \
    --blsp_model $blsp_path \
    --istructio "Please traslate the followig audio ito Germa text."

Lauchig Demo Locally

You ca try out our demo by

export CUDA_VISIBLE_DEVICES=0
pytho blsp/chat_demo.py \
    --blsp_model $blsp_path

Traiig from Scratch

The traiig of BLSP cotais two stages.

Stage 1: Fietue LLM with text istructio data

Dowload the text istructio data Alpaca-52k to ~/data/alpaca_data.jso ad ru the process script.

mkdir -p ~/data/stage1
pytho data_process/prepare_alpaca.py \
    --iput_file ~/data/alpaca_data.jso \
    --output_file ~/data/stage1/trai_alpaca.jsol

Obtai Llama2-7B Model to ~/pretraied_models/llama2-7b-hf. The ru the traiig script to perform text istructio tuig.

export llama_path=~/pretraied_models/llama2-7b-hf
export DATA_ROOT=~/data/stage1
export SAVE_ROOT=~/checkpoits/stage1
bash blsp/scripts/trai_stage1_ddp.sh

This step takes about 2 hours o 8 A100.

Stage 2: Alig speech ad text via behavior aligmet of cotiuatio writig

Dowload the ASR dataset GigaSpeech, LibriSpeech, Commo Voice 2.0 to ~/data/gigaspeech, ~/data/librispeech, ~/data/commo_voice respectively. The ru the process scripts.

mkdir -p ~/data/stage2
pytho data_process/prepare_gigaspeech.py \
    --iput_dir ~/data/gigaspeech \
    --output_file ~/data/stage2/trai_gigaspeech.jsol
pytho data_process/prepare_librispeech.py \
    --iput_dir ~/data/librispeech \
    --output_file ~/data/stage2/trai_librispeech.jsol
pytho data_process/prepare_commo_voice.py \
    --iput_dir ~/data/commo_voice \
    --output_file ~/data/stage2/trai_commo_voice.jsol

Use the stage1 model to cotiue the trascriptio of ASR data. Here we take GigaSpeech dataset as example.

mkdir -p ~/data/stage2/labels

export CUDA_VISIBLE_DEVICES=0
pytho3 -u data_process/asr_text_geeratio.py cotiue_writig \
    --llm_path ~/checkpoits/stage1 \
    --maifest ~/data/stage2/trai_gigaspeech.jsol \
    --lab_dir ~/data/stage2/labels \
    --shard 8 \
    --rak 0 &
.
.
.
export CUDA_VISIBLE_DEVICES=7
pytho3 -u data_process/asr_text_geeratio.py cotiue_writig \
    --llm_path ~/checkpoits/stage1 \
    --maifest ~/data/stage2/trai_gigaspeech.jsol \
    --lab_dir ~/data/stage2/labels \
    --shard 8 \
    --rak 7 &

(Optioal) We recommed that you process the data offlie ad save to disk. The preprocessig process will filter out speech-text pairs that caot be loaded by soudfile.

pytho blsp/src/speech_text_paired_dataset.py offlie \
    --dataroot ~/data/stage2/labels \
    --maifest_files *.jsol \
    --lm_path ~/checkpoits/stage1 \
    --istructio "Cotiue the followig text i a coheret ad egagig style with less tha 40 words." \
    --um_proc 32

Dowload whisper-small to ~/pretraied_models/whisper-small ad the ru the traiig script.

export llama_path=~/checkpoits/stage1
export whisper_path=~/pretraied_models/whisper-small
export DATA_ROOT=~/data/stage2/labels
export SAVE_ROOT=~/checkpoits/stage2
bash blsp/scripts/trai_stage2_ddp.sh

The traiig process takes about 2.5 days o 8 A100.

Licese

The licese of our project is [Apache Licese 2.0]()
Our models are based o Llama2 ad Whisper. If you wat to use our models, please do ot violate the MIT Licese of whisper ad the Licese of LLaMA-2

BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing Chen

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

BLSP-大规模语音语言模型-7B

技术信息

作品详情