Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add deep research use case (Python) #482

Merged
merged 27 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/yellow-oranges-play.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"create-llama": patch
---

Add deep research over own documents use case (Python)
10 changes: 10 additions & 0 deletions helpers/datasources.ts
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,16 @@ export const EXAMPLE_GDPR: TemplateDataSource = {
},
};

export const AI_REPORTS: TemplateDataSource = {
type: "file",
config: {
url: new URL(
"https://www.europarl.europa.eu/RegData/etudes/ATAG/2024/760392/EPRS_ATA(2024)760392_EN.pdf",
),
filename: "EPRS_ATA_2024_760392_EN.pdf",
},
};

export function getDataSources(
files?: string,
exampleFile?: boolean,
Expand Down
1 change: 1 addition & 0 deletions helpers/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ export type TemplateObservability = "none" | "traceloop" | "llamatrace";
export type TemplateUseCase =
| "financial_report"
| "blog"
| "deep_research"
| "form_filling"
| "extractor"
| "contract_review";
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
"@types/cross-spawn": "6.0.0",
"@types/fs-extra": "11.0.4",
"@types/node": "^20.11.7",
"@types/prompts": "2.0.1",
"@types/prompts": "2.4.2",
"@types/tar": "6.1.5",
"@types/validate-npm-package-name": "3.0.0",
"async-retry": "1.3.1",
Expand Down
13 changes: 8 additions & 5 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

64 changes: 55 additions & 9 deletions questions/simple.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import prompts from "prompts";
import {
AI_REPORTS,
EXAMPLE_10K_SEC_FILES,
EXAMPLE_FILE,
EXAMPLE_GDPR,
Expand All @@ -17,7 +18,8 @@ type AppType =
| "form_filling"
| "extractor"
| "contract_review"
| "data_scientist";
| "data_scientist"
| "deep_research";

type SimpleAnswers = {
appType: AppType;
Expand All @@ -34,22 +36,55 @@ export const askSimpleQuestions = async (
type: "select",
name: "appType",
message: "What app do you want to build?",
hint: "🤖: Agent, 🔀: Workflow",
choices: [
{ title: "Agentic RAG", value: "rag" },
{ title: "Data Scientist", value: "data_scientist" },
{
title: "Financial Report Generator (using Workflows)",
title: "🤖 Agentic RAG",
value: "rag",
description:
"Chatbot that answers questions based on provided documents.",
},
{
title: "🤖 Data Scientist",
value: "data_scientist",
description:
"Agent that analyzes data and generates visualizations by using a code interpreter.",
},
{
title: "🤖 Code Artifact Agent",
value: "code_artifact",
description:
"Agent that writes code, runs it in a sandbox, and shows the output in the chat UI.",
},
{
title: "🤖 Information Extractor",
value: "extractor",
description:
"Extracts information from documents and returns it as a structured JSON object.",
},
{
title: "🔀 Financial Report Generator",
value: "financial_report_agent",
description:
"Generates a financial report by analyzing the provided 10-K SEC data. Uses a code interpreter to create charts or to conduct further analysis.",
},
{
title: "Form Filler (using Workflows)",
title: "🔀 Financial 10k SEC Form Filler",
value: "form_filling",
description:
"Extracts information from 10k SEC data and uses it to fill out a CSV form.",
},
{ title: "Code Artifact Agent", value: "code_artifact" },
{ title: "Information Extractor", value: "extractor" },
{
title: "Contract Review (using Workflows)",
title: "🔀 Contract Reviewer",
value: "contract_review",
description:
"Extracts and reviews contracts to ensure compliance with GDPR regulations",
},
{
title: "🔀 Deep Researcher",
value: "deep_research",
description:
"Researches and analyzes provided documents from multiple perspectives, generating a comprehensive report with citations to support key findings and insights.",
},
],
},
Expand All @@ -60,7 +95,11 @@ export const askSimpleQuestions = async (
let llamaCloudKey = args.llamaCloudKey;
let useLlamaCloud = false;

if (appType !== "extractor" && appType !== "contract_review") {
if (
appType !== "extractor" &&
appType !== "contract_review" &&
appType !== "deep_research"
) {
const { language: newLanguage } = await prompts(
{
type: "select",
Expand Down Expand Up @@ -188,6 +227,13 @@ const convertAnswers = async (
frontend: false,
dataSources: [EXAMPLE_GDPR],
},
deep_research: {
template: "multiagent",
useCase: "deep_research",
tools: [],
frontend: true,
dataSources: [AI_REPORTS],
},
};
const results = lookup[answers.appType];
return {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
This is a [LlamaIndex](https://www.llamaindex.ai/) multi-agents project using [Workflows](https://docs.llamaindex.ai/en/stable/understanding/workflows/).

## Getting Started

First, setup the environment with poetry:

> **_Note:_** This step is not needed if you are using the dev-container.

```shell
poetry install
```

Then check the parameters that have been pre-configured in the `.env` file in this directory. (E.g. you might need to configure an `OPENAI_API_KEY` if you're using OpenAI as model provider).
leehuwuj marked this conversation as resolved.
Show resolved Hide resolved
Second, generate the embeddings of the documents in the `./data` directory:

```shell
poetry run generate
```

Third, run the development server:

```shell
poetry run dev
```

## Use Case: Deep Research over own documents

The workflow performs deep research by retrieving and analyzing documents from the [data](./data) directory from multiple perspectives. The project includes a sample PDF about AI investment in 2024 to help you get started. You can also add your own documents by placing them in the data directory and running the generate script again to index them.

After starting the server, go to [http://localhost:8000](http://localhost:8000) and send a message to the agent to write a blog post.
E.g: "AI investment in 2024"

To update the workflow, you can edit the [deep_research.py](./app/workflows/deep_research.py) file.

By default, the workflow retrieves 10 results from your documents. To customize the amount of information covered in the answer, you can adjust the `TOP_K` environment variable in the `.env` file. A higher value will retrieve more results from your documents, potentially providing more comprehensive answers.

## Deployments

For production deployments, check the [DEPLOY.md](DEPLOY.md) file.
leehuwuj marked this conversation as resolved.
Show resolved Hide resolved

## Learn More

To learn more about LlamaIndex, take a look at the following resources:

- [LlamaIndex Documentation](https://docs.llamaindex.ai) - learn about LlamaIndex.
- [Workflows Introduction](https://docs.llamaindex.ai/en/stable/understanding/workflows/) - learn about LlamaIndex workflows.
You can check out [the LlamaIndex GitHub repository](https://github.com/run-llama/llama_index) - your feedback and contributions are welcome!
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .deep_research import create_workflow

__all__ = ["create_workflow"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
from typing import List, Literal, Optional

from llama_index.core.base.llms.types import (
CompletionResponse,
CompletionResponseAsyncGen,
)
from llama_index.core.memory.simple_composable_memory import SimpleComposableMemory
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import MetadataMode, Node, NodeWithScore
from llama_index.core.settings import Settings
from pydantic import BaseModel, Field


class AnalysisDecision(BaseModel):
decision: Literal["research", "write", "cancel"] = Field(
description="Whether to continue research, write a report, or cancel the research after several retries"
)
research_questions: Optional[List[str]] = Field(
description="Questions to research if continuing research. Maximum 3 questions. Set to null or empty if writing a report.",
default_factory=list,
)
cancel_reason: Optional[str] = Field(
description="The reason for cancellation if the decision is to cancel research.",
default=None,
)


async def plan_research(
memory: SimpleComposableMemory,
context_nodes: List[Node],
user_request: str,
) -> AnalysisDecision:
analyze_prompt = PromptTemplate(
"""
You are a professor who is guiding a researcher to research a specific request/problem.
Your task is to decide on a research plan for the researcher.
The possible actions are:
+ Provide a list of questions for the researcher to investigate, with the purpose of clarifying the request.
+ Write a report if the researcher has already gathered enough research on the topic and can resolve the initial request.
+ Cancel the research if most of the answers from researchers indicate there is insufficient information to research the request. Do not attempt more than 3 research iterations or too many questions.
The workflow should be:
+ Always begin by providing some initial questions for the researcher to investigate.
+ Analyze the provided answers against the initial topic/request. If the answers are insufficient to resolve the initial request, provide additional questions for the researcher to investigate.
+ If the answers are sufficient to resolve the initial request, instruct the researcher to write a report.
<User request>
{user_request}
</User request>

<Collected information>
{context_str}
</Collected information>

<Conversation context>
{conversation_context}
</Conversation context>
"""
)
conversation_context = "\n".join(
[f"{message.role}: {message.content}" for message in memory.get_all()]
)
context_str = "\n".join(
[node.get_content(metadata_mode=MetadataMode.LLM) for node in context_nodes]
)
res = await Settings.llm.astructured_predict(
output_cls=AnalysisDecision,
prompt=analyze_prompt,
user_request=user_request,
context_str=context_str,
conversation_context=conversation_context,
)
return res


async def research(
question: str,
context_nodes: List[NodeWithScore],
) -> str:
prompt = """
You are a researcher who is in the process of answering the question.
The purpose is to answer the question based on the collected information, without using prior knowledge or making up any new information.
Always add citations to the sentence/point/paragraph using the id of the provided content.
The citation should follow this format: [citation:id]() where id is the id of the content.

E.g:
If we have a context like this:
<Citation id='abc-xyz'>
Baby llama is called cria
</Citation id='abc-xyz'>

And your answer uses the content, then the citation should be:
- Baby llama is called cria [citation:abc-xyz]()

Here is the provided context for the question:
<Collected information>
{context_str}
</Collected information>`
leehuwuj marked this conversation as resolved.
Show resolved Hide resolved

No prior knowledge, just use the provided context to answer the question: {question}
"""
context_str = "\n".join(
[_get_text_node_content_for_citation(node) for node in context_nodes]
)
res = await Settings.llm.acomplete(
prompt=prompt.format(question=question, context_str=context_str),
)
return res.text


async def write_report(
memory: SimpleComposableMemory,
user_request: str,
stream: bool = False,
) -> CompletionResponse | CompletionResponseAsyncGen:
report_prompt = """
You are a researcher writing a report based on a user request and the research context.
You have researched various perspectives related to the user request.
The report should provide a comprehensive outline covering all important points from the researched perspectives.
Create a well-structured outline for the research report that covers all the answers.

# IMPORTANT when writing in markdown format:
+ Use tables or figures where appropriate to enhance presentation.
+ Preserve all citation syntax (the `[citation:id]()` parts in the provided context). Keep these citations in the final report - no separate reference section is needed.
+ Do not add links, a table of contents, or a references section to the report.

<User request>
{user_request}
</User request>

<Research context>
{research_context}
</Research context>

Now, write a report addressing the user request based on the research provided following the format and guidelines above.
"""
research_context = "\n".join(
[f"{message.role}: {message.content}" for message in memory.get_all()]
)

llm_complete_func = (
Settings.llm.astream_complete if stream else Settings.llm.acomplete
)

res = await llm_complete_func(
prompt=report_prompt.format(
user_request=user_request,
research_context=research_context,
),
)
return res


def _get_text_node_content_for_citation(node: NodeWithScore) -> str:
"""
Construct node content for LLM with citation flag.
"""
node_id = node.node.node_id
content = f"<Citation id='{node_id}'>\n{node.get_content(metadata_mode=MetadataMode.LLM)}</Citation id='{node_id}'>"
return content
Loading
Loading