文档主题分割是指将文档分割成一系列连续的、主题一致的片段。近些年涌现出一些基于深度学习的主题分割算法,通过将主题分割定义为句子级别的二分类任务,使用BERT等预训练语言模型在领域内数据微调,取得了很好的效果。 但BERT等预训练语言模型的时间复杂度是O(2),随着输入序列长度的增加,模型在推理长文档时会面临速度慢、占用空间大等问题,尽管层次建模(从字符到句子再到预测结果)能一定程度上降低时空消耗,但时间复杂度仍是O(2),并且造成一定的性能损失。 因此我们将阿里巴巴达摩院研发的PoNet模型应用到英文长文档主题分割任务,在准确性和效率之间寻找平衡。PoNet使用 poolig 机制替代 Trasformer 模型中的 self-attetio 来进行上下文的建模,主要包括三种不同粒度的 poolig 网络,分别是全局的 poolig 模块(GA),分段的 segmet max-poolig 模块(SMP),和局部的 max-poolig 模块(LMP),对应捕捉不同粒度的序列信息,是一种具有线性复杂度O()的序列建模模型。 模型用lppoetfill-maskchiese-base初始化,在 Wiki727K 数据上进行训练。初始学习率为5e-5,batchsize为8,maxseqlegth=4096,加入Focal Loss缓解类别不均衡问题(γ=2,α=0.75) 效果方面,PoNet 在 Wiki727K 测试集的 Positive F1 为 67.13,达到了 BERT 的 98.43%;效率方面每秒可以处理 30256 个字符,是 BERT 的 1.9 倍,在效果和效率之间达到了很好的平衡。具体的: GPU机器为Tesla V100 32G,batch_size=1 如果我们的模型对您有帮助,请您引用我们的文章:PoNet用于长文档主题分割
模型描述
使用方式
模型局限性以及可能的偏差
训练数据
训练方式
模型效果和效率评估
model
Positive F1
Pk
WD
Two-Level LSTM
-
22.13
-
Cross-segmet BERT-Base 128-128
64.0
-
-
Cross-segmet BERT-Large 128-128
66.0
-
-
Cross-segmet BERT-Large 256-256
67.1
-
-
HierBERT
66.5
-
-
TLT-TS
-
19.41
-
SeqModel-BERT-Base
68.2
-
-
PoNet
67.13
19.00
20.97
model
maxseqlegth
efficiecy (tokes/s)
SeqModel-BERT-Base
512
15885
PoNet
4096
30256
代码范例
from modelscope.outputs import OutputKeys
from modelscope.pipelies import pipelie
from modelscope.utils.costat import Tasks
p = pipelie(
task=Tasks.documet_segmetatio,
model='damo/lp_poet_documet-segmetatio_topic-level_eglish-base',
model_revisio="v1.1.1",
)
doc = ['Actresses (Catala: Actrius) is a 1997 Catala laguage Spaish drama film produced ad directed by Vetura Pos ad based o the award-wiig stage play "E.R."', 'by Josep Maria Beet i Joret.', 'The film has o male actors, with all roles played by females.', 'The film was produced i 1996.', '"Actrius" screeed i 2001 at the Grauma\'s Egyptia Theatre i a America Ciematheque retrospective of the works of its director.', 'The film had first screeed at the same locatio i 1998.', 'It was also show at the 1997 Stockholm Iteratioal Film Festival.', 'I "Movie - Film - Review", "Daily Mail" staffer Christopher Tookey wrote that though the actresses were "competet i roles that may have some referece to their ow careers", the film "is visually uimagiative, ever escapes its stage origis, ad is almost totally lackig i revelatio or surprisig icidet".', 'Notig that there were "occasioal, refreshig momets of itergeeratioal bitchiess", they did ot "justify comparisos to "All About Eve"", ad were "isufficietly differet to deserve critical parallels with "Rashomo"".', 'He also wrote that "The Guardia" called the film a "slow, stuffy chamber-piece", ad that "The Eveig Stadard" stated the film\'s "best momets exhibit the bitchy tatrums seethig beeath the threesome\'s composed veeers".', 'MRQE wrote "This ciematic adaptatio of a theatrical work is true to the origial, but does ot stray far from a theatrical rederig of the story.']
result = p(documets=doc)
topics = result[OutputKeys.TEXT].split("\\")
prit(topics)
相关论文以及引用信息
@iproceedigs{DBLP:jourals/corr/abs-2110-02442,
author = {Chao{-}Hog Ta ad
Qia Che ad
We Wag ad
Qigli Zhag ad
Siqi Zheg ad
Zhe{-}Hua Lig},
title = {{PoNet}: Poolig Network for Efficiet Toke Mixig i Log Sequeces},
booktitle = {10th Iteratioal Coferece o Learig Represetatios, {ICLR} 2022,
Virtual Evet, April 25-29, 2022},
publisher = {OpeReview.et},
year = {2022},
url = {https://opereview.et/forum?id=9jID9JjicF},
}
点击空白处退出提示
评论