评估一个训练好的模型需要评估指标,比如正确率、查准率、查全率、F1值等。当然不同的任务类型有着不同的评估指标,而HuggingFace提供了统一的评价指标工具。
1.列出可用的评价指标
通过list_metrics()函数列出可用的评价指标:
def list_metric_test(): # 第4章/列出可用的评价指标 from datasets import list_metrics metrics_list = list_metrics() print(len(metrics_list), metrics_list[:5])
复制
输出结果如下所示:
157 ['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score']
复制
可见目前包含157个评价指标,并且输出了前5个评价指标。
2.加载一个评价指标
通过load_metric()加载评价指标,需要说明的是有的评价指标和对应的数据集配套使用,这里以glue数据集的mrpc子集为例:
def load_metric_test(): # 第4章/加载评价指标 from datasets import load_metric metric = load_metric(path="accuracy") #加载accuracy指标 print(metric) # 第4章/加载一个评价指标 from datasets import load_metric metric = load_metric(path='glue', config_name='mrpc') #加载glue数据集中的mrpc子集 print(metric)
复制
3.获取评价指标的使用说明
评价指标的inputs_description属性描述了评价指标的使用方法,以及评价指标的使用方法如下所示:
def load_metric_description_test(): # 第4章/加载一个评价指标 from datasets import load_metric glue_metric = load_metric('glue', 'mrpc') # 加载glue数据集中的mrpc子集 print(glue_metric.inputs_description) references = [0, 1] predictions = [0, 1] results = glue_metric.compute(predictions=predictions, references=references) print(results) # {'accuracy': 1.0, 'f1': 1.0}
复制
输出结果如下所示:
Compute GLUE evaluation metric associated to each GLUE dataset.Args: predictions: list of predictions to score. Each translation should be tokenized into a list of tokens. references: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.Returns: depending on the GLUE subset, one or several of: "accuracy": Accuracy "f1": F1 score "pearson": Pearson Correlation "spearmanr": Spearman Correlation "matthews_correlation": Matthew CorrelationExamples: >>> glue_metric = datasets.load_metric('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"] >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} >>> glue_metric = datasets.load_metric('glue', 'mrpc') # 'mrpc' or 'qqp' >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0, 'f1': 1.0} >>> glue_metric = datasets.load_metric('glue', 'stsb') >>> references = [0., 1., 2., 3., 4., 5.] >>> predictions = [0., 1., 2., 3., 4., 5.] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)}) {'pearson': 1.0, 'spearmanr': 1.0} >>> glue_metric = datasets.load_metric('glue', 'cola') >>> references = [0, 1] >>> predictions = [0, 1] >>> results = glue_metric.compute(predictions=predictions, references=references) >>> print(results) {'matthews_correlation': 1.0}{'accuracy': 1.0, 'f1': 1.0}
复制
首先描述了评价指标的使用方法,然后计算评价指标accuracy和f1。