datasets["train"][0] #{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, # 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', # 'id': '5733be284776f41900661182', # 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', # 'title': 'University_of_Notre_Dame'}
下面的函数将从数据集里随机选择几个例子进行展示:
from datasets import ClassLabel, Sequence import random import pandas as pd from IPython.display import display, HTML
defshow_random_elements(dataset, num_examples=10): assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset." picks = [] for _ inrange(num_examples): pick = random.randint(0, len(dataset)-1) while pick in picks: pick = random.randint(0, len(dataset)-1) picks.append(pick) df = pd.DataFrame(dataset[picks]) for column, typ in dataset.features.items(): ifisinstance(typ, ClassLabel): df[column] = df[column].transform(lambda i: typ.names[i]) elifisinstance(typ, Sequence) andisinstance(typ.feature, ClassLabel): df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x]) display(HTML(df.to_html()))
In Alberta, five bitumen upgraders produce synthetic crude oil and a variety of other products: The Suncor Energy upgrader near Fort McMurray, Alberta produces synthetic crude oil plus diesel fuel; the Syncrude Canada, Canadian Natural Resources, and Nexen upgraders near Fort McMurray produce synthetic crude oil; and the Shell Scotford Upgrader near Edmonton produces synthetic crude oil plus an intermediate feedstock for the nearby Shell Oil Refinery. A sixth upgrader, under construction in 2015 near Redwater, Alberta, will upgrade half of its crude bitumen directly to diesel fuel, with the remainder of the output being sold as feedstock to nearby oil refineries and petrochemical plants.
571b074c9499d21900609be3
Besides crude oil, what does the Suncor Energy plant produce?
for i, example inenumerate(datasets["train"]): iflen(tokenizer(example["question"], example["context"])["input_ids"]) > 384: break example = datasets["train"][i]
for i, x inenumerate(tokenized_example["input_ids"][:2]): print("切片: {}".format(i)) print(tokenizer.decode(x)) #切片: 0 #[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP] #切片: 1 #[CLS] how many wins does the notre dame men's basketball team have? [SEP] championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]
# 检测答案是否超出文本长度,超出的话也使用CLS index作为标注. ifnot (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): tokenized_examples["start_positions"].append(cls_index) tokenized_examples["end_positions"].append(cls_index) else: # 如果不超出则找到答案token的start和end位置。. # Note: we could go after the last offset if the answer is the last word (edge case). while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: token_start_index += 1 tokenized_examples["start_positions"].append(token_start_index - 1) while offsets[token_end_index][1] >= end_char: token_end_index -= 1 tokenized_examples["end_positions"].append(token_end_index + 1)
for batch in trainer.get_eval_dataloader(): #产生batch的迭代器 break batch = {k: v.to(trainer.args.device) for k, v in batch.items()} with torch.no_grad(): output = trainer.model(**batch) output.keys() #odict_keys(['loss', 'start_logits', 'end_logits'])
defprepare_validation_features(examples): # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results # in one example possible giving several features when a context is long, each of those features having a # context that overlaps a bit the context of the previous feature. tokenized_examples = tokenizer( examples["question"if pad_on_right else"context"], examples["context"if pad_on_right else"question"], truncation="only_second"if pad_on_right else"only_first", max_length=max_length, stride=doc_stride, return_overflowing_tokens=True, return_offsets_mapping=True, padding="max_length", )
# Since one example might give us several features if it has a long context, we need a map from a feature to # its corresponding example. This key gives us just that. # 我们使用overflow_to_sample_mapping参数来映射切片片ID到原始ID。 # 比如有2个expamples被切成4片,那么对应是[0, 0, 1, 1],前两片对应原来的第一个example。 sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
# We keep the example_id that gave us this feature and we will store the offset mappings. tokenized_examples["example_id"] = []
for i inrange(len(tokenized_examples["input_ids"])): # Grab the sequence corresponding to that example (to know what is the context and what is the question). sequence_ids = tokenized_examples.sequence_ids(i) context_index = 1if pad_on_right else0
# One example can give several spans, this is the index of the example containing this span of text. # 拿到原始的example 下标. sample_index = sample_mapping[i] tokenized_examples["example_id"].append(examples["id"][sample_index])
# Set to None the offset_mapping that are not part of the context so it's easy to determine if a token # position is part of the context or not. # 检查答案是否在context中,如果不在offset_mapping为None # 其实就是把question部分的offset_mapping用None掩码,context部分保留不变 tokenized_examples["offset_mapping"][i] = [ (o if sequence_ids[k] == context_index elseNone) for k, o inenumerate(tokenized_examples["offset_mapping"][i]) ]
start_logits = output.start_logits[0].cpu().numpy() end_logits = output.end_logits[0].cpu().numpy() offset_mapping = validation_features[0]["offset_mapping"] # The first feature comes from the first example. For the more general case, we will need to be match the example_id to # an example index context = datasets["validation"][0]["context"] # 收集最佳的start和end logits的位置 # Gather the indices the best start/end logits: start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() valid_answers = [] for start_index in start_indexes: for end_index in end_indexes: # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond # to part of the input_ids that are not in the context. if (#答案不合理 start_index >= len(offset_mapping) or end_index >= len(offset_mapping) or offset_mapping[start_index] isNone or offset_mapping[end_index] isNone ): continue # Don't consider answers with a length that is either < 0 or > max_answer_length. if end_index < start_index or end_index - start_index + 1 > max_answer_length:#答案不合理 continue if start_index <= end_index: # We need to refine that test to check the answer is inside the context # 如果start小于end,那么是合理的可能答案 start_char = offset_mapping[start_index][0] end_char = offset_mapping[end_index][1] valid_answers.append( { "score": start_logits[start_index] + end_logits[end_index], "text": context[start_char: end_char]# 后续需要根据token的下标将答案找出来 } ) #最后根据`score`对`valid_answers`进行排序,找到最好的那一个 valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size] valid_answers
examples = datasets["validation"] features = validation_features
example_id_to_index = {k: i for i, k inenumerate(examples["id"])} features_per_example = collections.defaultdict(list) for i, feature inenumerate(features): features_per_example[example_id_to_index[feature["example_id"]]].append(i)
defpostprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30): all_start_logits, all_end_logits = raw_predictions # Build a map example to its corresponding features. example_id_to_index = {k: i for i, k inenumerate(examples["id"])} features_per_example = collections.defaultdict(list) for i, feature inenumerate(features): features_per_example[example_id_to_index[feature["example_id"]]].append(i)
# The dictionaries we have to fill. predictions = collections.OrderedDict()
# Logging. print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")
# Let's loop over all the examples! for example_index, example inenumerate(tqdm(examples)): # Those are the indices of the features associated to the current example. feature_indices = features_per_example[example_index]
min_null_score = None# Only used if squad_v2 is True. valid_answers = [] context = example["context"] # Looping through all the features associated to the current example. for feature_index in feature_indices: # We grab the predictions of the model for this feature. start_logits = all_start_logits[feature_index] end_logits = all_end_logits[feature_index] # This is what will allow us to map some the positions in our logits to span of texts in the original # context. offset_mapping = features[feature_index]["offset_mapping"]
# Go through all possibilities for the `n_best_size` greater start and end logits. start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() for start_index in start_indexes: for end_index in end_indexes: # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond # to part of the input_ids that are not in the context. if ( start_index >= len(offset_mapping) or end_index >= len(offset_mapping) or offset_mapping[start_index] isNone or offset_mapping[end_index] isNone ): continue # Don't consider answers with a length that is either < 0 or > max_answer_length. if end_index < start_index or end_index - start_index + 1 > max_answer_length: continue
start_char = offset_mapping[start_index][0] end_char = offset_mapping[end_index][1] valid_answers.append( { "score": start_logits[start_index] + end_logits[end_index], "text": context[start_char: end_char] } ) iflen(valid_answers) > 0: best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0] else: # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid # failure. best_answer = {"text": "", "score": 0.0} # Let's pick our final answer: the best one or the null answer (only for squad_v2) ifnot squad_v2: predictions[example["id"]] = best_answer["text"] else: answer = best_answer["text"] if best_answer["score"] > min_null_score else"" predictions[example["id"]] = answer
if squad_v2: formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()] else: formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()] references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]] metric.compute(predictions=formatted_predictions, references=references)