PolyLM是一个通晓多种语言的大规模语言模型,涵盖中文、英文、西班牙语、法语、德语、俄语、葡萄牙语、意大利语、阿拉伯语、日语、韩语、泰语、越南语和印尼语等18个语言。该模型可以应用于对话问答、文本生成、机器翻译和情感分析等领域,能够自动生成高质量的多语言文本,从而为跨语言、文化的交流提供便利。 Large laguage models (LLMs) demostrate remarkable ability to comprehed, reaso, ad geerate followig ature laguage istructios. However, the developmet of LLMs has bee primarily focused o high-resource laguages, such as Eglish, thereby limitig their applicability ad research i other laguages. Cosequetly, we preset PolyLM, a multiligual LLM traied o 640 billio (B) tokes, avaliable i two model sizes: 1.7B ad 13B. To ehace its multiligual capabilities, we 1) itegrate biligual data ito traiig data; ad 2) adopt a curriculum learig strategy that icreases the proportio of o-Eglish data from 30% i the first stage to 60% i the fial stage durig pre-traiig. Further, we propose a multiligual self-istruct method which automatically geerates 132.7K diverse multiligual istructios for model fie-tuig. To assess the model's performace, we collect several existig multiligual tasks, icludig multiligual uderstadig, questio aswerig, geeratio, ad traslatio. Extesive experimets show that PolyLM surpasses other ope-source models such as LLaMA ad BLOOM o multiligual tasks while maitaiig comparable performace i Eglish. Our models, aloe with the multiligual istructio data, are available at Github ad Huggigface. 本项目提供了一系列不同规模和用途的模型,参数规模包括1.7B/13B版本(当前模型为13B版本),同时涵盖了预训练底座模型以及指令精调后的Chat版本(即MultiAlpaca系列)。全部版本如下表所示: 该模型以PolyLM-13B预训练模型为底座,在以下数据上指令微调得到: 如果你觉得这个该模型对有所帮助,请考虑引用下面的相关的论文:PolyLM多语言-智能服务-文本生成模型
模型简介
Abstract i Eglish
模型版本
Model
Precisio
Layers
Heads
Hidde
Max_legth
LR
Batch
Type
PolyLM-1.7B
bfloat16
24
16
2048
2048
1.0e-4
4M
Pretrai Model
PolyLM-13B
bfloat16
40
40
5120
2048
6.0e-5
4M
Pretrai Model
PolyLM-MultiAlpaca-13B
bfloat16
40
40
5120
2048
6.0e-5
4M
Chat Model
PolyLM-Assistat-13B
bfloat16
40
40
5120
2048
6.0e-5
4M
Chat Model
训练数据
名称
数量
构建方式
备注
code_alpaca
28
GPT 3.5 self-istruct
为了正确展示,对代码做了格式过滤,要求输入、输出中至少有一端可以找出一对```
dolly
15,011
人工编写
fla_v2
100,000
各类NLP任务、CoT任务
从 fla_v2中采样,全量数据非常大
gpt4_alpaca (英文)
52,002
GPT-4 self-istruct
gpt4_alpaca (中文)
48,818
GPT-4 self-istruct
multiligual_alpaca
132,701
GPT-3.5 self-istruct
ope_assistat
55,668
人工编写
share_gpt
140,591
ChatGPT聊天记录
gpteacher_codege
4,535
模型下载
git lfs istall
git cloe https://www.modelscope.c/damo/lp_polylm_assistat_13b_text_geeratio.git
模型使用
# git cloe https://github.com/modelscope/modelscope
# cd modelscope
# pip istall .
from modelscope.pipelies import pipelie
from modelscope.utils.costat import Tasks
from modelscope import sapshot_dowload
polylm_13b_model_id = 'damo/lp_polylm_assistat_13b_text_geeratio'
revisio = 'v1.0.0'
model_dir = sapshot_dowload(polylm_13b_model_id, revisio)
iput_text = f"Beijig is the capital of Chia.\Traslate this setece from Eglish to Chiese."
iput_text = "<|user|>\" + f"{iput_text}\" + "<|assistat|>\"
kwargs = {"do_sample": False, "um_beams": 4, "max_ew_tokes": 128, "early_stoppig": True, "eos_toke_id": 2}
pipelie_is = pipelie(Tasks.text_geeratio, model=model_dir)
result = pipelie_is(iput_text, **kwargs)
prit(result['text'])
论文引用
@misc{wei2023polylm,
title={PolyLM: A Ope Source Polyglot Large Laguage Model},
author={Xiagpeg Wei ad Haora Wei ad Hua Li ad Tiahao Li ad Pei Zhag ad Xigzhag Re ad Mei Li ad Yu Wa ad Zhiwei Cao ad Bibi Xie ad Tiaxiag Hu ad Shagjie Li ad Biyua Hui ad Bowe Yu ad Dayiheg Liu ad Baosog Yag ad Fei Huag ad Ju Xie},
year={2023},
eprit={2307.06018},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
点击空白处退出提示










评论