开源地址
https://modelscope.cn/models/iic/nlp_ponet_document-segmentation_topic-level_english-base授权协议
Apache License 2.0

PoNet用于长文档主题分割

模型描述

文档主题分割是指将文档分割成一系列连续的、主题一致的片段。近些年涌现出一些基于深度学习的主题分割算法，通过将主题分割定义为句子级别的二分类任务，使用BERT等预训练语言模型在领域内数据微调，取得了很好的效果。

但BERT等预训练语言模型的时间复杂度是O(²)，随着输入序列长度的增加，模型在推理长文档时会面临速度慢、占用空间大等问题，尽管层次建模（从字符到句子再到预测结果）能一定程度上降低时空消耗，但时间复杂度仍是O(²)，并且造成一定的性能损失。

因此我们将阿里巴巴达摩院研发的PoNet模型应用到英文长文档主题分割任务，在准确性和效率之间寻找平衡。PoNet使用 poolig 机制替代 Trasformer 模型中的 self-attetio 来进行上下文的建模，主要包括三种不同粒度的 poolig 网络，分别是全局的 poolig 模块(GA)，分段的 segmet max-poolig 模块(SMP)，和局部的 max-poolig 模块(LMP)，对应捕捉不同粒度的序列信息，是一种具有线性复杂度O()的序列建模模型。

使用方式

直接输入长篇未分割文章，得到输出结果

模型局限性以及可能的偏差

模型采用公开语料进行训练，在某些特定领域文本上（如口语）的分割性能可能会有影响。

训练数据

使用公开的英文数据 Wiki727K，Wiki727K共包含727746篇文档，平均文档长度为2000字符左右，按照8:1:1划分训练集、验证集和测试集。

训练方式

模型用lppoetfill-maskchiese-base初始化，在 Wiki727K 数据上进行训练。初始学习率为5e-5，batchsize为8，maxseqlegth=4096，加入Focal Loss缓解类别不均衡问题（γ=2，α=0.75）

模型效果和效率评估

效果方面，PoNet 在 Wiki727K 测试集的 Positive F1 为 67.13，达到了 BERT 的 98.43%；效率方面每秒可以处理 30256 个字符，是 BERT 的 1.9 倍，在效果和效率之间达到了很好的平衡。具体的：

Wiki727K的测试集结果

model	Positive F1	Pk	WD
Two-Level LSTM	-	22.13	-
Cross-segmet BERT-Base 128-128	64.0	-	-
Cross-segmet BERT-Large 128-128	66.0	-	-
Cross-segmet BERT-Large 256-256	67.1	-	-
HierBERT	66.5	-	-
TLT-TS	-	19.41	-
SeqModel-BERT-Base	68.2	-	-
PoNet	67.13	19.00	20.97

推理效率对比

GPU机器为Tesla V100 32G，batch_size=1

model	maxseqlegth	efficiecy (tokes/s)
SeqModel-BERT-Base	512	15885
PoNet	4096	30256

代码范例

from modelscope.outputs import OutputKeys
from modelscope.pipelies import pipelie
from modelscope.utils.costat import Tasks


p = pipelie(
    task=Tasks.documet_segmetatio,
    model='damo/lp_poet_documet-segmetatio_topic-level_eglish-base',
    model_revisio="v1.1.1",
    )
doc = ['Actresses (Catala: Actrius) is a 1997 Catala laguage Spaish drama film produced ad directed by Vetura Pos ad based o the award-wiig stage play "E.R."', 'by Josep Maria Beet i Joret.', 'The film has o male actors, with all roles played by females.', 'The film was produced i 1996.', '"Actrius" screeed i 2001 at the Grauma\'s Egyptia Theatre i a America Ciematheque retrospective of the works of its director.', 'The film had first screeed at the same locatio i 1998.', 'It was also show at the 1997 Stockholm Iteratioal Film Festival.', 'I "Movie - Film - Review", "Daily Mail" staffer Christopher Tookey wrote that though the actresses were "competet i roles that may have some referece to their ow careers", the film "is visually uimagiative, ever escapes its stage origis, ad is almost totally lackig i revelatio or surprisig icidet".', 'Notig that there were "occasioal, refreshig momets of itergeeratioal bitchiess", they did ot "justify comparisos to "All About Eve"", ad were "isufficietly differet to deserve critical parallels with "Rashomo"".', 'He also wrote that "The Guardia" called the film a "slow, stuffy chamber-piece", ad that "The Eveig Stadard" stated the film\'s "best momets exhibit the bitchy tatrums seethig beeath the threesome\'s composed veeers".', 'MRQE wrote "This ciematic adaptatio of a theatrical work is true to the origial, but does ot stray far from a theatrical rederig of the story.']

result = p(documets=doc)
topics = result[OutputKeys.TEXT].split("\\")
prit(topics)

相关论文以及引用信息

如果我们的模型对您有帮助，请您引用我们的文章：

@iproceedigs{DBLP:jourals/corr/abs-2110-02442,
  author    = {Chao{-}Hog Ta ad
               Qia Che ad
               We Wag ad
               Qigli Zhag ad
               Siqi Zheg ad
               Zhe{-}Hua Lig},
  title     = {{PoNet}: Poolig Network for Efficiet Toke Mixig i Log Sequeces},
  booktitle = {10th Iteratioal Coferece o Learig Represetatios, {ICLR} 2022,
               Virtual Evet, April 25-29, 2022},
  publisher = {OpeReview.et},
  year      = {2022},
  url       = {https://opereview.et/forum?id=9jID9JjicF},
}

PoNet用于长文档主题分割模型描述文档主题分割是指将文档分割成一系列连续的、主题一致的片段。近些年涌现出一些基于深度学习的主题分割算法，通过将主题分割定义为句子级别的二分类任务，使用BERT等预

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

PoNet文档主题分割-英文-通用领域

技术信息

作品详情