评估一个训练好的模型需要评估指标,比如正确率、查准率、查全率、F1值等。当然不同的任务类型有着不同的评估指标,而HuggingFace提供了统一的评价指标工具。
1.列出可用的评价指标
通过list_metrics()函数列出可用的评价指标:
def list_metric_test():
# 第4章/列出可用的评价指标
from datasets import list_metrics
metrics_list = list_metrics()
print(len(metrics_list), metrics_list[:5])
输出结果如下所示:
157 ['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score']
可见目前包含157个评价指标,并且输出了前5个评价指标。
2.加载一个评价指标
通过load_metric()加载评价指标,需要说明的是有的评价指标和对应的数据集配套使用,这里以glue数据集的mrpc子集为例:
def load_metric_test():
# 第4章/加载评价指标
from datasets import load_metric
metric = load_metric(path="accuracy") #加载accuracy指标
print(metric)
# 第4章/加载一个评价指标
from datasets import load_metric
metric = load_metric(path='glue', config_name='mrpc') #加载glue数据集中的mrpc子集
print(metric)
3.获取评价指标的使用说明
评价指标的inputs_description属性描述了评价指标的使用方法,以及评价指标的使用方法如下所示:
def load_metric_description_test():
# 第4章/加载一个评价指标
from datasets import load_metric
glue_metric = load_metric('glue', 'mrpc') # 加载glue数据集中的mrpc子集
print(glue_metric.inputs_description)
references = [0, 1]
predictions = [0, 1]
results = glue_metric.compute(predictions=predictions, references=references)
print(results) # {'accuracy': 1.0, 'f1': 1.0}
输出结果如下所示:
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
predictions: list of predictions to score.
Each translation should be tokenized into a list of tokens.
references: list of lists of references for each translation.
Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
"accuracy": Accuracy
"f1": F1 score
"pearson": Pearson Correlation
"spearmanr": Spearman Correlation
"matthews_correlation": Matthew Correlation
Examples:
>>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp'
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0, 'f1': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'stsb')
>>> references = [0., 1., 2., 3., 4., 5.]
>>> predictions = [0., 1., 2., 3., 4., 5.]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
{'pearson': 1.0, 'spearmanr': 1.0}
>>> glue_metric = datasets.load_metric('glue', 'cola')
>>> references = [0, 1]
>>> predictions = [0, 1]
>>> results = glue_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'matthews_correlation': 1.0}
{'accuracy': 1.0, 'f1': 1.0}
首先描述了评价指标的使用方法,然后计算评价指标accuracy和f1。