HuggingFace本身就是一个模型库,包括了很多经典的模型,比如文本分类、阅读理解、完形填空、文本生成、命名实体识别、文本摘要、翻译等,这些模型即使不进行任何训练也能直接得出比较好的预测结果。pipeline是HuggingFace提供的一个非常实用的工具,但是封装程度太高,需要看源码才能理解其中的处理过程。
一.使用管道处理经典模型
通过pipeline处理文本分类任务,即把文本分为指定的类别。如下所示:
1.使用管道处理文本分类任务
def text_classification_test():
# 第5章/文本分类
from transformers import pipeline
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/distilbert-base-uncased-finetuned-sst-2-english"
classifier = pipeline(task="sentiment-analysis", model=Path(f'{model_name_or_path}'), framework="pt")
result = classifier("I hate you")[0]
print(result)
result = classifier("I love you")[0]
print(result)
输出结果如下所示:
{'label': 'NEGATIVE', 'score': 0.9991129040718079}
{'label': 'POSITIVE', 'score': 0.9998656511306763}
2.阅读理解
通过pipeline处理阅读理解,即把context和question输入到question_answer对象中,得到相应的答案。
def question_answerer():
# 第5章/阅读理解
from transformers import pipeline
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/distilbert-base-uncased-finetuned-sst-2-english"
question_answerer = pipeline(task="question-answering", model=Path(f'{model_name_or_path}'), framework="pt")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the examples/PyTorch/question-
answering/run_squad.py script.
"""
result = question_answerer(
question="What is extractive question answering?", #什么是抽取式问答?
context=context,
)
print(result)
result = question_answerer(
question="What is a good example of a question answering dataset?", #问答数据集的一个好例子是什么?
context=context,
)
print(result)
输出结果如下所示:
{'score': 0.6149137020111084, 'start': 38, 'end': 99, 'answer': 'the task of extracting an answer from a text given a question'} #从给定文本中提取答案的任务
{'score': 0.517293393611908, 'start': 156, 'end': 169, 'answer': 'SQuAD dataset'} #SQuAD数据集
3.完形填空
完形填空就是一个句子中的某些词被
def fill_mask_test():
# hf链接:https://huggingface.co/distilroberta-base
# 第5章/完形填空
from transformers import pipeline
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/distilroberta-base"
unmasker = pipeline(task="fill-mask", model=Path(f'{model_name_or_path}'), framework="pt") #加载本地模型
# unmasker = pipeline("fill-mask") #加载线上模型
from pprint import pprint
sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.' #HuggingFace正在创建一个社区用户,用于解决NLP任务的_。
print(unmasker(sentence))
输出结果如下所示:
[{'score': 0.17927540838718414, 'token': 3944, 'token_str': ' tool', 'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.'},
{'score': 0.11349398642778397, 'token': 7208, 'token_str': ' framework', 'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.'},
{'score': 0.05243553966283798, 'token': 5560, 'token_str': ' library', 'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.'},
{'score': 0.034935396164655685, 'token': 8503, 'token_str': ' database', 'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.'},
{'score': 0.028602469712495804, 'token': 17715, 'token_str': ' prototype', 'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.'}]
模型按照score从高到低给了5个结果,分别为tool(工具)、framework(框架)、library(资料库)、database(数据库)、prototype(原型)。
4.文本生成
文本生成就是输入一个句子的开头,让模型接着往下续写。使用pipeline处理文本生成如下所示:
def text_generator_test():
# hf链接:https://huggingface.co/gpt2
#第5章/文本生成
from transformers import pipeline
# text_generator=pipeline("text-generation")
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/gpt2"
text_generator = pipeline(task="text-generation", model=Path(f'{model_name_or_path}'), framework="pt")
result = text_generator("As far as I am concerned, I will", max_length=50, do_sample=False)
print(result)
输出结果如下所示:
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}] #我而言,我会第一个承认我不喜欢"自由市场"这个概念。我认为自由市场这个概念有点牵强。我认为这个想法......
5.命名实体识别
命名实体识别就是从一段文本中找出实体,使用pipeline处理命名实体识别如下所示:
def ner_pipe_test():
# hf链接:https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english
# 第5章/命名实体识别
from transformers import pipeline
# ner_pipe = pipeline("ner")
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/bert-large-cased-finetuned-conll03-english"
ner_pipeline = pipeline(task="ner", model=Path(f'{model_name_or_path}'), framework="pt")
sequence = """Hugging Face Inc. is a company based in New York City. Its
headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."""
for entity in ner_pipeline(sequence):
print(entity)
输出结果如下所示:
{'entity': 'I-ORG', 'score': 0.99957865, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982224, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9994879, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994344, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.99931955, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 83, 'end': 84}
{'entity': 'I-LOC', 'score': 0.95142686, 'index': 20, 'word': '##UM', 'start': 84, 'end': 86}
{'entity': 'I-LOC', 'score': 0.9336589, 'index': 21, 'word': '##BO', 'start': 86, 'end': 88}
{'entity': 'I-LOC', 'score': 0.97616553, 'index': 28, 'word': 'Manhattan', 'start': 118, 'end': 127}
{'entity': 'I-LOC', 'score': 0.9914629, 'index': 29, 'word': 'Bridge', 'start': 128, 'end': 134}
可见识别出来的组织机构名为Hugging Face Inc,地名为New York City、DUMBO、Manhattan Bridge。
6.文本摘要
文本摘要就是从长文本中提取核心内容,使用pipeline摘要30-130个词如下所示:
def summarization_test():
# https://huggingface.co/sshleifer/distilbart-cnn-12-6
# 第5章/文本摘要
from transformers import pipeline
# summarizer = pipeline("summarization")
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/distilbart-cnn-12-6"
summarizer = pipeline(task="summarization", model=Path(f'{model_name_or_path}'), framework="pt")
ARTICLE = ARTICLE = """New York (CNN) When Liana Barrientos was 23 years old, she got married in Westchester County,
New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times,
sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx.
In an application for a marriage license, she stated it was her "first and only" marriage. Barrientos, now 39,
is facing two criminal counts of "offering a false instrument for filing in the first degree,"
referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney,Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit,
said Detective Annette Markowski,a police spokeswoman.In total,Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time,
she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved.It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s Investigation Division.
Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. """
result = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(result)
输出结果如下所示:
[{'summary_text': ' Liana Barrientos pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" Prosecutors say the marriages were part of an immigration scam . She is believed to still be married to four men, and at one time, she was married to eight at once .'}]
7.英德翻译
使用pipeline中的t5-base模型将英文翻译为德文,该模型仅支持由英文翻译为德文、法文、罗马尼亚文。如下所示:
def translator_test():
# hf链接:https://huggingface.co/t5-base
# 第5章/翻译
from transformers import pipeline
# translator = pipeline("translation_en_to_de") #英文译德文
from pathlib import Path
model_name_or_path = "L:/20230713_HuggingFaceModel/t5-base"
translator = pipeline(task="summarization", model=Path(f'{model_name_or_path}'), framework="pt")
sentence = "Hugging Face is a technology company based in New York and Paris" #Hugging Face是一家总部位于纽约和巴黎的科技公司。
result = translator(sentence, max_length=40)
print(result)
输出结果如下所示:
[{'summary_text': 'Hugging Face is a technology company based in new york and Paris . the company has offices in london, berlin, and san francisco .'}] #Hugging Face是一家总部位于纽约和巴黎的科技公司。
二.替换模型执行任务
1.替换模型执行中译英任务
替换model和tokenizer将中文翻译为英文,如下所示:
def translator1_test():
# 第5章/替换模型执行中译英任务
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
# 要使用该模型,需要安装sentencepiece
# tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
# model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
# translator = pipeline(task="translation_zh_to_en", model=model, tokenizer=tokenizer)
# from pathlib import Path
from pathlib import Path
model = "L:/20230713_HuggingFaceModel/opus-mt-zh-en"
tokenizer = "L:/20230713_HuggingFaceModel/opus-mt-zh-en"
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model_name_or_path=Path(f'{model}'))
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=Path(f'{tokenizer}'))
translator = pipeline(task="translation_zh_to_en", model=model, tokenizer=tokenizer, framework="pt")
sentence = "我叫萨拉,我住在伦敦。"
result = translator(sentence, max_length=20)
print(result)
输出结果如下所示:
[{'translation_text': 'My name is Sarah, and I live in London.'}]
2.替换模型执行英译中任务 替换model和tokenizer将英文翻译为中文,如下所示:
# 第5章/替换模型执行英译中任务
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
# 要使用该模型,需要安装sentencepiece
# tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
# model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
# translator = pipeline(task="translation_zh_to_en", model=model, tokenizer=tokenizer)
# from pathlib import Path
from pathlib import Path
model = "L:/20230713_HuggingFaceModel/opus-mt-en-zh"
tokenizer = "L:/20230713_HuggingFaceModel/opus-mt-en-zh"
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model_name_or_path=Path(f'{model}'))
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=Path(f'{tokenizer}'))
translator = pipeline(task="translation_zh_to_en", model=model, tokenizer=tokenizer, framework="pt")
sentence = "My name is Sarah and I live in London"
result = translator(sentence, max_length=20)
print(result)
输出结果如下所示:
[{'translation_text': '我叫莎拉,我住伦敦'}]
参考文献:
[1]https://huggingface.co/models
[2]https://huggingface.co/docs/transformers/installation#offline-mode
[3]https://huggingface.co/distilbert-base-cased-distilled-squad
[4]https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
[5]https://huggingface.co/distilroberta-base
[6]https://github.com/ai408/nlp-daily-record/tree/main/20230625_HuggingFace自然语言处理详解