PoNet完形填空模型-中文-base介绍
本模型选用PoNet模型结构,通过Masked Laguage Modelig(MLM)和Setece Structural Objective(SSO)预训练任务在中文Wikipedia数据预训练获得,可以用于完形填空任务,也可以作为初始化模型在下游自然语言理解任务上fietue来使用。
模型描述
PoNet是一种具有线性复杂度(O(N))的计算模型,使用poolig网络替代Trasformer模型中的self-attetio来对句子词汇进行混合,具体包括在local、segmet、global三个粒度上的poolig网络,从而捕捉上下文信息。
其结构如下图所示。

实验表明,PoNet在长文本测试Log Rage Area(LRA)榜上在准确率上比Trasformer高2.28个点,在GPU上运行速度是Trasformer的9倍,显存占用只有1/10。此外,实验也展示了PoNet的迁移学习能力,PoNet-Base在GLUE基准上达到了BERT-Base的95.7%的准确性。
详见论文PoNet: Poolig Network for Efficiet Toke Mixig i Log Sequeces
期望模型使用方式以及适用范围
本模型主要用于生成完形填空的结果。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。
如何使用
在安装完成ModelScope-lib之后即可使用lppoetfill-mask_chiese-base的能力
代码范例
from modelscope.pipelies import pipelie
from modelscope.utils.costat import Tasks
pipelie_is = pipelie(Tasks.fill_mask, model='damo/lp_poet_fill-mask_chiese-base')
iput = "人民文学出版社[MASK]1952[MASK],出版《[MASK][MASK]演义》、《[MASK]游记》、《水浒传》、《[MASK]楼梦》,合为“[MASK]大名著”。"
prit(pipelie_is(iput))
模型局限性以及可能的偏差
- 模型训练数据有限,效果可能存在一定偏差。
- 当前版本在pytorch 1.11和pytorch 1.12环境测试通过,其他环境可用性待测试
训练数据介绍
数据来源于https://dumps.wikimedia.org/
模型训练流程
在中文Wikipedia的无监督数据上,通过MLM和SSO任务训练得到。
预处理
对于训练数据会采用如下预处理,对于MLM任务,掩蔽概率设置为15%。80%的掩蔽位置被[MASK]替换,10%被替换为随机抽样的单词,剩下的10%不变。对于SSO任务,包含多个段落的序列在随机位置被截断为两个子序列,其中
1/3概率用另一个随机选择的子序列替换其中一个子序列,1/3的概率交换两个子序列,1/3的概率不变。
训练细节
在中文Wikipedia上使用Adam优化器,初始学习率为1e-4,batch_size为384。
数据评估及结果
在下游任务fietue后,CAIL、CLUE的开发集结果如下:
Dataset |
CAIL |
AFQMC |
CMNLI |
CSL |
IFLYTEK |
OCNLI |
TNEWS |
WSC |
Accuracy |
61.93 |
70.25 |
72.9 |
72.97 |
58.21 |
68.14 |
55.04 |
64.47 |
在下游任务MUG的Topic Segmetatio 和 Topic-level ad Sessio-level Extractive Summarizatio的开发集结果如下:
Task |
Ave. R1 |
Ave. R2 |
Ave. RL |
Max R1 |
Max R2 |
Max RL |
Sessio-Level ES |
57.08 |
29.90 |
38.36 |
62.20 |
37.34 |
46.98 |
Topic-Level ES |
52.86 |
35.80 |
46.09 |
66.67 |
54.05 |
63.14 |
More Details: https://github.com/alibaba-damo-academy/SpokeNLP
相关论文以及引用信息
如果我们的模型对您有帮助,请您引用我们的文章:
@iproceedigs{DBLP:jourals/corr/abs-2110-02442,
author = {Chao{-}Hog Ta ad
Qia Che ad
We Wag ad
Qigli Zhag ad
Siqi Zheg ad
Zhe{-}Hua Lig},
title = {{PoNet}: Poolig Network for Efficiet Toke Mixig i Log Sequeces},
booktitle = {10th Iteratioal Coferece o Learig Represetatios, {ICLR} 2022,
Virtual Evet, April 25-29, 2022},
publisher = {OpeReview.et},
year = {2022},
url = {https://opereview.et/forum?id=9jID9JjicF},
}
Itroductio
This model uses the PoNet structure, which is pre-traied o Chiese Wikipedia data through Masked Laguage Modelig (MLM) ad Setece Structural Objective (SSO) pre-traiig tasks. It ca be used for cloze tasks, ad ca also be used as a iitializatio model for dowstream atural laguage uderstadig.
Model descriptio
PoNet is a poolig etwork with liear complexity (O(N)), which uses poolig etwork istead of self-attetio i Trasformer model to mix tokes.
It uses multi-graularity poolig ad poolig fusio to capture differet levels of cotextual iformatio ad combie their iteractios with tokes.
The structure is show i the figure below.

Expected model usage ad scope of applicatio
This model is maily used to geerate cloze results. Users ca try various iput documets by themselves. Please refer to the code example for the specific callig method.
How to use
After istallig ModelScope-lib, you ca use the ability of lppoetfill-mask_chiese-base.
Code example
from modelscope.pipelies import pipelie
from modelscope.utils.costat import Tasks
pipelie_is = pipelie(Tasks.fill_mask, model='damo/lp_poet_fill-mask_chiese-base')
iput = "人民文学出版社[MASK]1952[MASK],出版《[MASK][MASK]演义》、《[MASK]游记》、《水浒传》、《[MASK]楼梦》,合为“[MASK]大名著”。"
prit(pipelie_is(iput))
Model limitatios ad possible bias
- The model traiig data is limited, ad the effect may have certai deviatios.
- The curret versio has passed the test i pytorch 1.11 ad pytorch 1.12 eviromets, ad the usability of other eviromets is yet to be tested.
Traiig data itroductio
The data comes from https://dumps.wikimedia.org/
Model traiig
O the usupervised data of Chiese Wikipedia, it is traied by MLM ad SSO tasks.
Preprocessig
For the traiig data, the followig preprocessig is used. For the MLM task, the maskig probability is set to 15%. 80% of the masked positios are replaced by [MASK], 10% are replaced by radomly sampled words, ad the remaiig 10% are uchaged.
For the SSO task, a log sequece cotaiig several paragraphs is trucated ito two subsequeces at radom positios, with 1/3 probability of replacig oe of the subsequeces with aother radomly selected subsequece, 1/3 probability of swappig the two subsequeces, ad 1/3 probability uchaged. These three cases are assiged three differet labels for the terary classificatio.
Traiig detail
Usig the Adam optimizer o Chiese Wikipedia, the iitial learig rate is 1e-4, ad the batch size is 384.
Data evaluatio ad results
After the dowstream task fietue, the developmet set results of CAIL ad CLUE are as follows:
Dataset |
CAIL |
AFQMC |
CMNLI |
CSL |
IFLYTEK |
OCNLI |
TNEWS |
WSC |
Accuracy |
61.93 |
70.25 |
72.9 |
72.97 |
58.21 |
68.14 |
55.04 |
64.47 |
The developmet set results of Topic Segmetatio ad Topic-level ad Sessio-level Extractive Summarizatio i the dowstream task MUG are as follows:
Task |
Ave. R1 |
Ave. R2 |
Ave. RL |
Max R1 |
Max R2 |
Max RL |
Sessio-Level ES |
57.08 |
29.90 |
38.36 |
62.20 |
37.34 |
46.98 |
Topic-Level ES |
52.86 |
35.80 |
46.09 |
66.67 |
54.05 |
63.14 |
More Details: https://github.com/alibaba-damo-academy/SpokeNLP
Related work ad citatio iformatio
If our model is helpful to you, please cite our paper:
@iproceedigs{DBLP:jourals/corr/abs-2110-02442,
author = {Chao{-}Hog Ta ad
Qia Che ad
We Wag ad
Qigli Zhag ad
Siqi Zheg ad
Zhe{-}Hua Lig},
title = {{PoNet}: Poolig Network for Efficiet Toke Mixig i Log Sequeces},
booktitle = {10th Iteratioal Coferece o Learig Represetatios, {ICLR} 2022,
Virtual Evet, April 25-29, 2022},
publisher = {OpeReview.et},
year = {2022},
url = {https://opereview.et/forum?id=9jID9JjicF},
}
评论