官方文档:https://huggingface.co/docs/transformers/transformers_agents
pip install transformers>=4.29.1
pip install openai
pip install diffusers
具体做法和 langchain 几乎一样,多了一个亮点:
- 通过构建 few-shot prompt 直接生成推理代码,用户可以选择自动仔细还是返回代码自己执行。当用户只想了解过程时候是非常有用的,因为模型通常太大了,跑不了
- 具备 remote 功能,例如有些工具模型太大了,用户实际上无法本地运行或者很困难,此时选择 remote 模式会调用 api 接口实现
作者提供了一个图
以一个简单的例子来说明,代码实现非常简单
from transformers import OpenAiAgent
agent = OpenAiAgent(model="text-davinci-003")
agent.run("Draw me a picture of rivers and lakes.", return_code=True)
==Explanation from the agent==
I will use the following tool: `image_generator` to generate an image according to the prompt.
==Code generated by the agent==
image = image_generator(prompt="Draw me a picture of rivers and lakes.")
可以看出,自动生成了图像生成的 code。
Agent 是核心类,表示使用哪一个 LLM 来调度,目前支持两个 OpenAI 和 HfAgent,其中官方推荐 OpenAI 效果最好。
可以通过阅读 Agent 代码来了解子类的所有功能。
class Agent:
# 支持两种描述,一次运行和 chat 模型,差别在于 prompt 和 chat 会有历史缓存
def __init__(self, chat_prompt_template=None, run_prompt_template=None, additional_tools=None):
# 加入默认的工具
_setup_default_tools()
# 不同模式采用不同模板
self.chat_prompt_template = CHAT_MESSAGE_PROMPT if chat_prompt_template is None else chat_prompt_template
self.run_prompt_template = RUN_PROMPT_TEMPLATE if run_prompt_template is None else run_prompt_template
# 把工具都准备好
self._toolbox = HUGGINGFACE_DEFAULT_TOOLS.copy()
# 用户可以自定义工具
if additional_tools is not None:
if isinstance(additional_tools, (list, tuple)):
additional_tools = {t.name: t for t in additional_tools}
elif not isinstance(additional_tools, dict):
additional_tools = {additional_tools.name: additional_tools}
replacements = {name: tool for name, tool in additional_tools.items() if name in HUGGINGFACE_DEFAULT_TOOLS}
self._toolbox.update(additional_tools)
if len(replacements) > 1:
names = "\n".join([f"- {n}: {t}" for n, t in replacements.items()])
logger.warn(
f"The following tools have been replaced by the ones provided in `additional_tools`:\n{names}."
)
elif len(replacements) == 1:
name = list(replacements.keys())[0]
logger.warn(f"{name} has been replaced by {replacements[name]} as provided in `additional_tools`.")
self.prepare_for_new_chat()
# 准备好 prompt
def format_prompt(self, task, chat_mode=False):
description = "\n".join([f"- {name}: {tool.description}" for name, tool in self.toolbox.items()])
if chat_mode:
if self.chat_history is None:
prompt = CHAT_PROMPT_TEMPLATE.replace("<<all_tools>>", description)
else:
prompt = self.chat_history
prompt += CHAT_MESSAGE_PROMPT.replace("<<task>>", task)
else:
prompt = self.run_prompt_template.replace("<<all_tools>>", description)
prompt = prompt.replace("<<prompt>>", task)
return prompt
# 聊天模式
def chat(self, task, *, return_code=False, remote=False, **kwargs):
prompt = self.format_prompt(task, chat_mode=True)
result = self.generate_one(prompt, stop=["Human:", "====="])
self.chat_history = prompt + result.strip() + "\n"
explanation, code = clean_code_for_chat(result)
print(f"==Explanation from the agent==\n{explanation}")
if code is not None:
print(f"\n\n==Code generated by the agent==\n{code}")
if not return_code:
print("\n\n==Result==")
self.cached_tools = resolve_tools(code, self.toolbox, remote=remote, cached_tools=self.cached_
self.chat_state.update(kwargs)
return evaluate(code, self.cached_tools, self.chat_state, chat_mode=True)
else:
tool_code = get_tool_creation_code(code, self.toolbox, remote=remote)
return f"{tool_code}\n{code}"
# 一次运行模式
def run(self, task, *, return_code=False, remote=False, **kwargs):
# 准备 prompt
prompt = self.format_prompt(task)
# 调用 llm 生成代码
result = self.generate_one(prompt, stop=["Task:"])
# 提取 code
explanation, code = clean_code_for_run(result)
print(f"==Explanation from the agent==\n{explanation}")
print(f"\n\n==Code generated by the agent==\n{code}")
if not return_code:
# 自动执行
print("\n\n==Result==")
self.cached_tools = resolve_tools(code, self.toolbox, remote=remote, cached_tools=self.cached_tools)
return evaluate(code, self.cached_tools, state=kwargs.copy())
else:
# 不执行,而是返回代码
tool_code = get_tool_creation_code(code, self.toolbox, remote=remote)
return f"{tool_code}\n{code}"
可以发现代码非常简单。
下面基于上述例子打印详细信息,方便理解
from transformers import OpenAiAgent
agent = OpenAiAgent(model="text-davinci-003")
agent.run("Draw me a picture of rivers and lakes.", return_code=True)
生成的 prompt
I will ask you to perform a task, your job is to come up with a series of simple commands in Python that will perform the task.
To help you, I will give you access to a set of tools that you can use. Each tool is a Python function and has a description explaining the task it performs, the inputs it expects and the outputs it returns.
You should first explain which tool you will use to perform the task and for what reason, then write the code in Python.
Each instruction in Python should be a simple assignment. You can print intermediate results if it makes sense to do so.
Tools:
- document_qa: This is a tool that answers a question about an document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question.
- image_captioner: This is a tool that generates a description of an image. It takes an input named `image` which should be the image to caption, and returns a text that contains the description in English.
- image_qa: This is a tool that answers a question about an image. It takes an input named `image` which should be the image containing the information, as well as a `question` which should be the question in English. It returns a text that is the answer to the question.
- image_segmenter: This is a tool that creates a segmentation mask of an image according to a label. It cannot create an image.It takes two arguments named `image` which should be the original image, and `label` which should be a text describing the elements what should be identified in the segmentation mask. The tool returns the mask.
- transcriber: This is a tool that transcribes an audio into text. It takes an input named `audio` and returns the transcribed text.
- summarizer: This is a tool that summarizes an English text. It takes an input `text` containing the text to summarize, and returns a summary of the text.
- text_classifier: This is a tool that classifies an English text using provided labels. It takes two inputs: `text`, which should be the text to classify, and `labels`, which should be the list of labels to use for classification. It returns the most likely label in the list of provided `labels` for the input text.
- text_qa: This is a tool that answers questions related to a text. It takes two arguments named `text`, which is the text where to find the answer, and `question`, which is the question, and returns the answer to the question.
- text_reader: This is a tool that reads an English text out loud. It takes an input named `text` which should contain the text to read (in English) and returns a waveform object containing the sound.
- translator: This is a tool that translates text from a language to another. It takes three inputs: `text`, which should be the text to translate, `src_lang`, which should be the language of the text to translate and `tgt_lang`, which should be the language for the desired ouput language. Both `src_lang` and `tgt_lang` are written in plain English, such as 'Romanian', or 'Albanian'. It returns the text translated in `tgt_lang`.
- image_transformer: This is a tool that transforms an image according to a prompt. It takes two inputs: `image`, which should be the image to transform, and `prompt`, which should be the prompt to use to change it. The prompt should only contain descriptive adjectives, as if completing the prompt of the original image. It returns the modified image.
- text_downloader: This is a tool that downloads a file from a `url`. It takes the `url` as input, and returns the text contained in the file.
- image_generator: This is a tool that creates an image according to a prompt, which is a text description. It takes an input named `prompt` which contains the image description and outputs an image.
- video_generator: This is a tool that creates a video according to a text description. It takes an input named `prompt` which contains the image description, as well as an optional input `seconds` which will be the duration of the video. The default is of two seconds. The tool outputs a video object.
Task: "Answer the question in the variable `question` about the image stored in the variable `image`. The question is in French."
I will use the following tools: `translator` to translate the question into English and then `image_qa` to answer the question on the input image.
Answer:
```py
translated_question = translator(question=question, src_lang="French", tgt_lang="English")
print(f"The translated question is {translated_question}.")
answer = image_qa(image=image, question=translated_question)
print(f"The answer is {answer}")
/```
Task: "Identify the oldest person in the `document` and create an image showcasing the result."
I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer.
Answer:
```py
answer = document_qa(document, question="What is the oldest person?")
print(f"The answer is {answer}.")
image = image_generator(answer)
/```
Task: "Generate an image using the text given in the variable `caption`."
I will use the following tool: `image_generator` to generate an image.
Answer:
```py
image = image_generator(prompt=caption)
/```
Task: "Summarize the text given in the variable `text` and read it out loud."
I will use the following tools: `summarizer` to create a summary of the input text, then `text_reader` to read it out loud.
Answer:
```py
summarized_text = summarizer(text)
print(f"Summary: {summarized_text}")
audio_summary = text_reader(summarized_text)
/```
Task: "Answer the question in the variable `question` about the text in the variable `text`. Use the answer to generate an image."
I will use the following tools: `text_qa` to create the answer, then `image_generator` to generate an image according to the answer.
Answer:
```py
answer = text_qa(text=text, question=question)
print(f"The answer is {answer}.")
image = image_generator(answer)
/```
Task: "Caption the following `image`."
I will use the following tool: `image_captioner` to generate a caption for the image.
Answer:
```py
caption = image_captioner(image)
/```
Task: "Draw me a picture of rivers and lakes."
I will use the following
tool: `image_generator` to generate an image according to the prompt.
LLM 返回:
Answer:
```py
image = image_generator(prompt="Draw me a picture of rivers and lakes.")
/```
然后调用 clean_code_for_run 生成如下结果:
I will use the following tool: `image_generator` to generate an image according to the prompt.
然后调用 get_tool_creation_code 生成如下代码
image = image_generator(prompt="Draw me a picture of rivers and lakes.")
换一个更复杂的任务:
agent.run("Draw me a picture of the sea then transform the picture to add an island", return_code=True)
LLM 返回:
Answer:
/```py
image = image_generator(prompt="Draw me a picture of the sea")
transformed_image = image_transformer(image, prompt="Add an island")
/```
==Explanation from the agent==
I will use the following tools: `image_generator` to generate an image of the sea, then `image_transformer` to add an island to the image.
==Code generated by the agent==
image = image_generator(prompt="Draw me a picture of the sea")
transformed_image = image_transformer(image, prompt="Add an island")
用户如果发现 agent 无法完成复杂任务,可以自己运行多次,例如
picture = agent.run("Generate a picture of rivers and lakes.")
updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture)
标题: InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language 官方地址: https://github.com/OpenGVLab/InternGPT 论文: https://arxiv.org/pdf/2305.05662.pdf
InternGPT(简称 iGPT) / InternChat(简称 iChat) 是一种基于指向语言驱动的视觉交互系统,允许您使用指向设备通过点击、拖动和绘制与 ChatGPT 进行互动。internGPT 的名称代表了 interaction(交互)、nonverbal(非语言)和 ChatGPT。与依赖纯语言的现有交互系统不同,通过整合指向指令,iGPT 显著提高了用户与聊天机器人之间的沟通效率,以及聊天机器人在视觉为中心任务中的准确性,特别是在复杂的视觉场景中。此外,在 iGPT 中,采用辅助控制机制来提高 LLM 的控制能力,并对一个大型视觉-语言模型 Husky 进行微调,以实现高质量的多模态对话(在ChatGPT-3.5-turbo评测中达到 93.89% GPT-4 质量)
简单来说就是多了用户点框等交互,可以快速帮助完成一些复杂的仅靠语言描述很难完成的任务。
看了下代码,看起来是基于 VisualGPT,然后多加入了一些模型,可以将用户交互点转换为 mask,然后将图片和 mask 以前丢给后面的模型,实现更高效的计算。
Gorilla: Large Language Model Connected with Massive APIs
finetune
https://arxiv.org/pdf/2305.18752.pdf
https://github.com/StevenGrove/GPT4Tools
使用低成本语言模型代替 ChatGPT 进行模型调度。核心在于指令数据集构建,作者采用的是
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Controllable Text-to-Image Generation with GPT-4