BERT实战——(6)生成任务-摘要生成

引言

这一篇将介绍如何使用 🤗 Transformers代码库中的模型来解决生成任务中的摘要生成问题。

任务介绍

摘要生成,用一些精炼的话(摘要)来概括整片文章的大意,用户通过读文摘就可以了解到原文要表达。

主要分为以下几个部分:

  1. 数据加载;
  2. 数据预处理;
  3. 微调预训练模型:使用transformer中的Seq2SeqTrainer接口对预训练模型进行微调(注意这里是Seq2SeqTrainer接口,之前的任务都是调用Trainer接口)。

前期准备

安装以下库:

pip install datasets transformers rouge-score nltk
#transformers==4.9.2
#datasets==1.11.0
#rouge-score==0.0.4
#nltk==3.6.2

数据加载

数据集介绍

我们使用XSum dataset数据集,其中包含了多篇BBC的文章和一句对应的摘要。

加载数据

该数据的加载方式在transformers库中进行了封装,我们可以通过以下语句进行数据加载:

from datasets import load_dataset
raw_datasets = load_dataset("xsum")

给定一个数据切分的key(train、validation或者test)和下标即可查看数据。

raw_datasets["train"][0]
# {'document': 'Recent reports have linked some France-based players with returns to Wales.\n"I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales," said Davies.\nThe WRU provides £2m to the fund and £1.3m comes from the regions.\nFormer Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.\nHe is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.\nDavies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in a £60m deal in August this year.\nIn the wake of that deal being done, Davies said the £3.3m should be spent on ensuring current Wales-based stars remain there.\nIn recent weeks, Racing Metro flanker Dan Lydiate was linked with returning to Wales.\nLikewise the Paris club\'s scrum-half Mike Phillips and centre Jamie Roberts were also touted for possible returns.\nWales coach Warren Gatland has said: "We haven\'t instigated contact with the players.\n"But we are aware that one or two of them are keen to return to Wales sooner rather than later."\nSpeaking to Scrum V on BBC Radio Wales, Davies re-iterated his stance, saying keeping players such as Scarlets full-back Liam Williams and Ospreys flanker Justin Tipuric in Wales should take precedence.\n"It\'s obviously a limited amount of money [available]. The union are contributing 60% of that contract and the regions are putting £1.3m in.\n"So it\'s a total pot of just over £3m and if you look at the sorts of salaries that the... guys... have been tempted to go overseas for [are] significant amounts of money.\n"So if we were to bring the players back, we\'d probably get five or six players.\n"And I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales.\n"There are players coming out of contract, perhaps in the next year or so… you\'re looking at your Liam Williams\' of the world; Justin Tipuric for example - we need to keep these guys in Wales.\n"We actually want them there. They are the ones who are going to impress the young kids, for example.\n"They are the sort of heroes that our young kids want to emulate.\n"So I would start off [by saying] with the limited pot of money, we have to retain players in Wales.\n"Now, if that can be done and there\'s some spare monies available at the end, yes, let\'s look to bring players back.\n"But it\'s a cruel world, isn\'t it?\n"It\'s fine to take the buck and go, but great if you can get them back as well, provided there\'s enough money."\nBritish and Irish Lions centre Roberts has insisted he will see out his Racing Metro contract.\nHe and Phillips also earlier dismissed the idea of leaving Paris.\nRoberts also admitted being hurt by comments in French Newspaper L\'Equipe attributed to Racing Coach Laurent Labit questioning their effectiveness.\nCentre Roberts and flanker Lydiate joined Racing ahead of the 2013-14 season while scrum-half Phillips moved there in December 2013 after being dismissed for disciplinary reasons by former club Bayonne.',
# 'id': '29750031',
# 'summary': 'New Welsh Rugby Union chairman Gareth Davies believes a joint £3.3m WRU-regions fund should be used to retain home-based talent such as Liam Williams, not bring back exiled stars.'}

下面的函数将从数据集里随机选择几个例子进行展示:

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=1):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)

df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, datasets.ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))
show_random_elements(raw_datasets["train"][1:])
document id summary
0 Media playback is unsupported on your device18 December 2014 Last updated at 10:28 GMThas successfully tackled poverty over the last four decades by drawing on its rich natural resources.to the World Bank, some 49% of Malaysians in 1970 were extremely poor, and that figure has been reduced to 1% today. However, the government's next challenge is to help the lower income group to move up to the middle class, the bank says.Zahau, the World Bank's Southeast Asia director, spoke to the BBC's Jennifer Pak. 30530533 In Malaysia the "aspirational" low-income part of the population is helping to drive economic growth through consumption, according to the World Bank.

数据预处理

在将数据喂入模型之前,我们需要对数据进行预处理。

仍然是两个数据预处理的基本流程:

  1. 分词;
  2. 转化成对应任务输入模型的格式;

Tokenizer用于上面两步数据预处理工作:Tokenizer首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token ID,再转化为模型需要的输入格式。

初始化Tokenizer

之前的博客已经介绍了一些Tokenizer的内容,并做了Tokenizer分词的示例,这里不再重复。use_fast=True指定使用fast版本的tokenizer。

我们使用t5-small模型checkpoint来做该任务。

from transformers import AutoTokenizer
model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

转化成对应任务输入模型的格式

模型的输入为待翻译的句子。

注意:为了给模型准备好翻译的targets,使用as_target_tokenizer来为targets设置tokenizer:

with tokenizer.as_target_tokenizer():
print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))
#{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

如果使用的是T5预训练模型的checkpoints(比如我们这里用的t5-small),需要对特殊的前缀进行检查。T5使用特殊的前缀("summarize: ")来告诉模型具体要做的任务,具体前缀例子如下:

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "summarize: "
else:
prefix = ""

现在我们可以把上面的内容放在一起组成预处理函数preprocess_function。对样本进行预处理的时候,使用truncation=True参数来确保超长文本被截断。默认情况下,对与比较短的句子会自动padding。max_input_length控制了输入文本的长度,max_target_length控制了摘要的最长长度。

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

model_inputs["labels"] = labels["input_ids"]
return model_inputs

以上的预处理函数可以处理一个样本,也可以处理多个样本exapmles。如果是处理多个样本,则返回的是多个样本被预处理之后的结果list。

接下来使用map函数对数据集datasets里面三个样本集合的所有样本进行预处理,将预处理函数preprocess_function应用到(map)所有样本上。参数batched=True可以批量对文本进行编码。这是为了充分利用前面加载fast_tokenizer的优势,它将使用多线程并发地处理批中的文本。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

微调预训练模型

数据已经准备好了,我们需要下载并加载预训练模型,然后微调预训练模型。

加载预训练模型

seq2seq任务,那么需要一个能解决这个任务的模型类。我们使用AutoModelForSeq2SeqLM 这个类

和之前几篇博客提到的加载方式相同不再赘述。

from transformers import AutoModelForSeq2SeqLM, 
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

设定训练参数

为了能够得到一个Seq2SeqTrainer训练工具,我们还需要训练的设定/参数 Seq2SeqTrainingArguments。这个训练设定包含了能够定义训练过程的所有属性

由于数据集比较大,Seq2SeqTrainer训练时会同时不断保存模型,我们用save_total_limit=3参数控制至多保存3个模型。

from transformers import Seq2SeqTrainingArguments
batch_size = 16
args = Seq2SeqTrainingArguments(
"test-summarization",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,#至多保存3个模型
num_train_epochs=1,
predict_with_generate=True,
fp16=True,
)

数据收集器data collator

接下来需要告诉Trainer如何从预处理的输入数据中构造batch。我们使用数据收集器DataCollatorForSeq2Seq,将经预处理的输入分batch再次处理后喂给模型。

from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

定义评估方法

我们使用'rouge'指标,利用metric.compute计算该指标对模型进行评估。

使用metric.compute对比predictions和labels,从而计算得分。predictions和labels都需要是一个list。具体格式见下面的例子:

fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)
#{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
# 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
# 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
# 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

将模型预测送入评估之前,还需要写postprocess_text函数做一些数据后处理。

nltk是一个自然语言处理的python工具包,我们这里用到了其中一个按句子分割的函数nltk.sent_tokenize()

import nltk
import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# Rouge expects a newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds] #按句子分割后换行符拼接
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

# Add mean generated length
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)

return {k: round(v, 4) for k, v in result.items()}

开始训练

将数据/模型/参数传入Trainer即可:

from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)

调用train方法开始训练:

trainer.train()

参考文献

4.7-生成任务-摘要生成.md

transformers官方文档

BERT相关——(7)将BERT应用到下游任务