BERT实战——(3)问答任务-多选问答

引言

我们将展示如何使用 🤗 Transformers代码库中的模型来解决问答任务中的多选问答问题

任务介绍

虽然叫多选问答,但实际上是指给出一个问题的多个可能的答案(备选项),选出其中一个最合理的,其实类似于我们平常做的单选题。该任务的实质同样是分类任务,在多个备选项中进行二分类,找到答案。

比如输入一句话的上半句,给出几个后半句的备选项,选出哪个选项是这个上半句的后半句:

输入:("离离原上草",["天安门一游","一岁一枯荣","春风吹又生"])
输出:1

主要分为以下几个部分:

  1. 数据加载
  2. 数据预处理
  3. 微调预训练模型:使用transformer中的Trainer接口对预训练模型进行微调。

前期准备

安装以下库:

pip install datasets transformers
#transformers==4.9.2
#datasets==1.11.0

数据加载

数据集介绍

我们使用的数据集是SWAG。SWAG是一个关于常识推理的数据集,每个样本描述一种情况,然后给出四个可能的选项。

加载数据

该数据的加载方式在transformers库中进行了封装,我们可以通过以下语句进行数据加载:

from datasets import load_dataset
datasets = load_dataset("swag", "regular")

如果你使用的是自己的数据,参考第一篇实战博客【定位词:加载数据】加载自己的数据。

如果上述代码数据集在下载过程中出现了一些问题,可以此链接下载数据并解压,将解压后的3个csv文件复制到代码目录下,然后用加载cache的方式进行加载:

import os

data_path = '.' #数据路径
cache_dir = os.path.join(data_path, 'cache')
data_files = {'train': os.path.join(data_path, 'train.csv'), 'val': os.path.join(data_path, 'val.csv'), 'test': os.path.join(data_path, 'test.csv')}
datasets = load_dataset(data_path, 'regular', data_files=data_files, cache_dir=cache_dir)

给定一个数据切分的key(train、validation或者test)和下标即可查看数据。

datasets["train"][0]
#{'ending0': 'passes by walking down the street playing their instruments.',
# 'ending1': 'has heard approaching them.',
# 'ending2': "arrives and they're outside dancing and asleep.",
# 'ending3': 'turns the lead singer watches the performance.',
# 'fold-ind': '3416',
# 'gold-source': 'gold',
# 'label': 0,
# 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
# 'sent2': 'A drum line',
# 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
# 'video-id': 'anetv_jkn6uvmqwh4'}

下面的函数将从数据集里随机选择几个例子进行展示:

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=3):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)

df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))
show_random_elements(datasets["train"])
ending0 ending1 ending2 ending3 fold-ind gold-source label sent1 sent2 startphrase video-id
0 are seated on a field. are skiing down the slope. are in a lift. are pouring out in a man. 16668 gold 1 A man is wiping the skiboard. Group of people A man is wiping the skiboard. Group of people anetv_JmL6BiuXr_g
1 performs stunts inside a gym. shows several shopping in the water. continues his skateboard while talking. is putting a black bike close. 11424 gold 0 The credits of the video are shown. A lady The credits of the video are shown. A lady anetv_dWyE0o2NetQ
2 is emerging into the hospital. are strewn under water at some wreckage. tosses the wand together and saunters into the marketplace. swats him upside down. 15023 gen 1 Through his binoculars, someone watches a handful of surfers being rolled up into the wave. Someone Through his binoculars, someone watches a handful of surfers being rolled up into the wave. Someone lsmdc3016_CHASING_MAVERICKS-6791
3 spies someone sitting below. opens the fridge and checks out the photo. puts a little sheepishly. staggers up to him. 5475 gold 3 He tips it upside down, and its little umbrella falls to the floor. Back inside, someone He tips it upside down, and its little umbrella falls to the floor. Back inside, someone lsmdc1008_Spider-Man2-75503

可以看到,数据集中的每个示例都有一个上下文,它是由第一个句子(字段sent1)和第二个句子的简介(字段sent2)组成,并给出四种结尾句子的备选项(字段ending0ending1ending2ending3),然后让模型从中选择正确的一个(由字段label表示)。

下面的函数让我们更直观地看到一个示例:

def show_one(example):
print(f"Context: {example['sent1']}")
print(f" A - {example['sent2']} {example['ending0']}")
print(f" B - {example['sent2']} {example['ending1']}")
print(f" C - {example['sent2']} {example['ending2']}")
print(f" D - {example['sent2']} {example['ending3']}")
print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")
show_one(datasets["train"][0])
#Context: Members of the procession walk down the street holding small horn brass instruments.
# A - A drum line passes by walking down the street playing their instruments.
# B - A drum line has heard approaching them.
# C - A drum line arrives and they're outside dancing and asleep.
# D - A drum line turns the lead singer watches the performance.

#Ground truth: option A

数据预处理

在将数据喂入模型之前,我们需要对数据进行预处理。

仍然是两个数据预处理的基本流程:

  1. 分词;
  2. 转化成对应任务输入模型的格式;

Tokenizer用于上面两步数据预处理工作:Tokenizer首先对输入进行tokenize,然后将tokens转化为预模型中需要对应的token ID,再转化为模型需要的输入格式。

初始化Tokenizer

之前的博客已经介绍了一些Tokenizer的内容,并做了Tokenizer分词的示例,这里不再重复。use_fast=True指定使用fast版本的tokenizer。

from transformers import AutoTokenizer
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

转化成对应任务输入模型的格式

这一类型任务的模型输入是什么呢?

事实上,我们应该将问题和备选项分别进行组合,相当于一个样本为输入备选项个数相同的句子对列表,如下所示:

[("Members of the procession walk down the street holding small horn brass instruments.","A drum line passes by walking down the street playing their instruments."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line has heard approaching them."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line arrives and they're outside dancing and asleep."),
("Members of the procession walk down the street holding small horn brass instruments.","A drum line turns the lead singer watches the performance.")]

之前已经介绍过Tokenizer的输入可以是一个单句,也可以是两个句子。

那么显然在调用tokenizer之前,我们需要预处理数据集先生成输入Tokenizer的样本。

preprocess_function函数中:

  1. 首先将样本中问题和备选项分别放在两个嵌套列表(两个嵌套列表分别存储了每个样本的问题和备选项)中;

    比如,e1_sen1表示样本1的问题(相当于输入tokenizer的句子1),e1_sen2_1表示样本1的备选项1(相当于输入tokenizer的句子2).....

    [[e1_sen1,e1_sen1,e1_sen1,e1_sen1],
    [e2_sen1,e2_sen1,e2_sen1,e2_sen1],
    [e3_sen1,e3_sen1,e3_sen1,e3_sen1]]

    [[e1_sen2_1,e1_sen2_2,e1_sen2_3,e1_sen2_4],
    [e2_sen2_1,e2_sen2_2,e2_sen2_3,e2_sen2_4],
    [e3_sen2_1,e3_sen2_2,e3_sen2_3,e3_sen2_4]]
  2. 然后将问题列表和备选项列表拉平Flatten(两个嵌套列表各自去掉嵌套),以便tokenizer进行批处理,以问题列表为例:

    after flatten->
    [e1_sen1,e1_sen1,e1_sen1,e1_sen1,
    e2_sen1,e2_sen1,e2_sen1,e2_sen1,
    e3_sen1,e3_sen1,e3_sen1,e3_sen1]
    after Tokenize->
    [e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1,
    e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1,
    e3_tokens1,e3_tokens1,e3_tokens1]
  3. 经过tokenizer后,再转回每个样本有备选项个数输入id、注意力掩码等。

    after unflatten->
    [[e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1],
    [e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1]
    [e3_tokens1,e3_tokens1,e3_tokens1]]

参数truncation=True使得比模型所能接受最大长度还长的输入被截断。

代码如下:

ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
# 预处理输入tokenizer的输入
# Repeat each first sentence four times to go with the four possibilities of second sentences.
first_sentences = [[context] * 4 for context in examples["sent1"]]#构造和备选项个数相同的问题句,也是tokenizer的第一个句子
# Grab all second sentences possible for each context.
question_headers = examples["sent2"] #tokenizer的第二个句子的上半句
second_sentences = [[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)]#构造上半句拼接下半句作为tokenizer的第二个句子(也就是备选项)

# Flatten everything
first_sentences = sum(first_sentences, []) #合并成一个列表方便tokenizer一次性处理:[[e1_sen1,e1_sen1,e1_sen1,e1_sen1],[e2_sen1,e2_sen1,e2_sen1,e2_sen1],[e3_sen1,e3_sen1,e3_sen1,e3_sen1]]->[e1_sen1,e1_sen1,e1_sen1,e1_sen1,e2_sen1,e2_sen1,e2_sen1,e2_sen1,e3_sen1,e3_sen1,e3_sen1,e3_sen1]
second_sentences = sum(second_sentences, [])#合并成一个列表方便tokenizer一次性处理

# Tokenize
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
# Un-flatten
# 转化成每个样本(一个样本中包括了四个k=[问题1,问题1,问题1,问题1],v=[备选项1,备选项2,备选项3,备选项4])
# [e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1,e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1,e3_tokens1,e3_tokens1,e3_tokens1,e3_tokens1]->[[e1_tokens1,e1_tokens1,e1_tokens1,e1_tokens1],[e2_tokens1,e2_tokens1,e2_tokens1,e2_tokens1],[e3_tokens1,e3_tokens1,e3_tokens1]]
return {k: [v[i:i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

以上的预处理函数可以处理一个样本,也可以处理多个样本exapmles。如果是处理多个样本,则返回的是多个样本被预处理之后的结果list。

让我们解码一下给定示例的输入,可以看到一个样本对应四个问题和备选项合并的句子。

examples = datasets["train"][:5]
features = preprocess_function(examples)
idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]
#['[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
# '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession wait slowly towards the cadets. [SEP]',
# '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
# '[CLS] a drum line passes by walking down the street playing their instruments. [SEP] members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']

接下来使用map函数对数据集datasets里面三个样本集合的所有样本进行预处理,将预处理函数prepare_train_features应用到(map)所有样本上。参数batched=True可以批量对文本进行编码。这是为了充分利用前面加载fast_tokenizer的优势,它将使用多线程并发地处理批中的文本。

tokenized_datasets = datasets.map(preprocess_function, batched=True)

微调预训练模型

数据已经准备好了,我们需要下载并加载预训练模型,然后微调预训练模型。

加载预训练模型

多项选择任务,那么需要一个能解决这个任务的模型类。我们使用AutoModelForMultipleChoice 这个类

和之前几篇博客提到的加载方式相同不再赘述。

from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)

设定训练参数

为了能够得到一个Trainer训练工具,我们还需要训练的设定/参数 TrainingArguments。这个训练设定包含了能够定义训练过程的所有属性

task='ner'
batch_size = 16

from transformers import TrainingArguments

args = TrainingArguments(    
"test-glue",    
evaluation_strategy = "epoch",    
learning_rate=5e-5,    
per_device_train_batch_size=batch_size,    
per_device_eval_batch_size=batch_size,    
num_train_epochs=3,    
weight_decay=0.01,
)

数据收集器data collator

接下来需要告诉Trainer如何从预处理的输入数据中构造batch。我们使用数据收集器data collator,将经预处理的输入分batch再次处理后喂给模型。

由前面preprocess_function函数的输出我们可以看到,每个样本都还没有做padding,我们在data collator中按照batch将每个batch的句子padding到每个batch最长的长度。注意,因为不同batch中最长的句子不一定都和整个数据集中的最长句子一样长,也就是说不是每个batch都需要那么长的padding,所以这里不直接padding到最大长度,可以有效提升训练效率

由于transformers库中没有合适的data collator来处理这样特定的问题,我们根据DataCollatorWithPadding稍作改动改编一个合适的。我在代码中补充了features和batch逐步转化的格式变化过程:

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None

def __call__(self, features):
#features:[{'attention_mask':[[],[],...],'input_ids':[[],[],...,'label':_},{'attention_mask':[[],[],...],'input_ids':[[],[],...,'label':_}]
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features] #将label单独弹出,features:[{'attention_mask':[[],[],...],'input_ids':[[],[],...]},{'attention_mask':[[],[],...],'input_ids':[[],[],...]}]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])

#feature:{'attention_mask':[[],[],...],'input_ids':[[],[],...]}
#flattened_features:[[{'attention_mask':[],'input_ids':[]},{},{},{}],[]....]
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
#flattened_features:[{'attention_mask':[],'input_ids':[]},{},{},{},{}....]
flattened_features = sum(flattened_features, [])

# batch: {'attention_mask':[[],[],[],[],[],[],...],'input_ids':[[],[],[],[],[],[],...]}
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)

# Un-flatten
# batch: {'attention_mask':[[[],[],[],[]],[[],[],[],[]],[...],...],'input_ids':[[[],[],[],[]],[[],[],[],[]],[...],...]}
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
# batch: {'attention_mask':[[[],[],[],[]],[[],[],[],[]],[...],...],'input_ids':[[[],[],[],[]],[[],[],[],[]],[...],...],'label':[]}
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch

在一个10个样本的batch上检查data collator是否正常工作。

在这里我们需要确保features中只有被模型接受的输入特征(但这一步在后面Trainer自动会筛选)

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

然后让我们检查单个样本是否完整,利用之前的show_one函数进行对比,看来没错!

[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(4)]
#['[CLS] someone walks over to the radio. [SEP] someone hands her another phone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
# '[CLS] someone walks over to the radio. [SEP] someone takes the drink, then holds it. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
# '[CLS] someone walks over to the radio. [SEP] someone looks off then looks at someone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
# '[CLS] someone walks over to the radio. [SEP] someone stares blearily down at the floor. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

show_one(datasets["train"][8])
# Context: Someone walks over to the radio.
# A - Someone hands her another phone.
# B - Someone takes the drink, then holds it.
# C - Someone looks off then looks at someone.
# D - Someone stares blearily down at the floor.
#
# Ground truth: option D

定义评估方法

我们使用'accuracy'对模型进行评估。

需要定义一个函数计算返回精度,取预测logits的argmax得到预测标签preds,和ground_truth进行进行对比,计算精度:

import numpy as np
from datasets import load_metric
def compute_metrics(eval_predictions):
predictions, label_ids = eval_predictions
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

开始训练

将数据/模型/参数传入Trainer即可:

from transformers import  Trainer
trainer = Trainer(
model,
args,
train_dataset=encoded_datasets["train"],
eval_dataset=encoded_datasets["validation"],
tokenizer=tokenizer,
data_collator=DataCollatorForMultipleChoice(tokenizer),
compute_metrics=compute_metrics,
)

调用train方法开始训练:

trainer.train()

参考文献

4.4-问答任务-多选问答.md

BERT实战——(1)文本分类

transformers官方文档