Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'token_count' when calling chat API with RAG toolgroup #887

Closed
1 of 2 tasks
abhishek-syno opened this issue Jan 28, 2025 · 13 comments
Closed
1 of 2 tasks
Assignees

Comments

@abhishek-syno
Copy link

abhishek-syno commented Jan 28, 2025

System Info

Description

Encountered a KeyError when trying to use the chat API with RAG toolgroup. The error occurs when the system tries to access the 'token_count' key in metadata.

Python version: 3.13.1
llama_stack: 0.1
llama_stack_client: 0.1

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Steps to Reproduce
Created an agent with RAG toolgroup
Initialized a session
Attempted to make a chat request

agent_config = AgentConfig(
    model="meta-llama/Llama-3.2-1B-Instruct",
    instructions="You are a helpful assistant...",
    toolgroups=[
        {
            "name": "builtin::rag",
            "args": {"vector_db_ids": ["hrms_test_bank_test_1"]},
        }
    ],
    tool_choice="auto",
    tool_prompt_format="json",
    enable_session_persistence=True,
    stream=True
)

Error logs

Error Message -Docker Ollama Distribution

2025-01-28 11:25:55 INFO:     172.17.0.1:38390 - "POST /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn HTTP/1.1" 200 OK
2025-01-28 11:25:55 05:55:55.868 [START] /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn
2025-01-28 11:25:55 05:55:55.899 [START] create_and_execute_turn
2025-01-28 11:25:55 05:55:55.910 [START] query_from_memory
2025-01-28 11:25:55 
Batches:   0%|                                                                                             | 0/1 [00:00<?, ?it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.05it/s]
2025-01-28 11:25:55 Traceback (most recent call last):
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 157, in sse_generator
2025-01-28 11:25:55     async for item in event_gen:
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agents.py", line 168, in _create_agent_turn_streaming
2025-01-28 11:25:55     async for event in agent.create_and_execute_turn(request):
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 192, in create_and_execute_turn
2025-01-28 11:25:55     async for chunk in self.run(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 266, in run
2025-01-28 11:25:55     async for res in self._run(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 439, in _run
2025-01-28 11:25:55     result = await self.tool_runtime_api.rag_tool.query(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-28 11:25:55     result = await method(self, *args, **kwargs)
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 423, in query
2025-01-28 11:25:55     return await self.routing_table.get_provider_impl(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-28 11:25:55     result = await method(self, *args, **kwargs)
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/tool_runtime/rag/memory.py", line 131, in query
2025-01-28 11:25:55     tokens += metadata["token_count"]
2025-01-28 11:25:55 KeyError: 'token_count'
2025-01-28 11:25:56 05:55:56.054 [END] query_from_memory [StatusCode.OK] (144.18ms)
2025-01-28 11:25:56 05:55:56.067 [END] create_and_execute_turn [StatusCode.OK] (168.68ms)
2025-01-28 11:25:56 05:55:56.079 [END] /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn [StatusCode.OK] (211.33ms)

Expected behavior

Through the client SDK code/ API call using POSTMAN it is throwing the error:
500: Internal server error: An unexpected error occurred.

It should work.

@wukaixingxp
Copy link
Contributor

Hi! This is caused by the persistent DB saved by the old llama-stack, which did not have any metadata. Can you delete the previous DB in ~/.llama/distributions/<your provider>/, eg rm -r ~/.llama/distributions/ollama and retry the rag test? Thanks!

@wukaixingxp wukaixingxp self-assigned this Jan 28, 2025
@abhishek-syno
Copy link
Author

Hi @wukaixingxp , I have tried using a fresh volume mount, but it is still not working. I'm getting the same error.

@abhishek-syno
Copy link
Author

abhishek-syno commented Jan 29, 2025

I have removed the volume mount option while running the container to verify the issue. Here is the updated code according to your documentation.

# %% [markdown]
# **Setting up Vector DBs**

# %%
import os
import uuid
from llama_stack_client import LlamaStackClient
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types import UserMessage
from llama_stack_client.lib.agents.event_logger import EventLogger

# %%
host = "localhost"
port = 6003
model_name="meta-llama/Llama-3.2-1B-Instruct"

# %%
client = LlamaStackClient(base_url=f"http://{host}:{port}")

# %%
# Register a vector db
vector_db_id = "my_documents"
response = client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
    provider_id="faiss",
)


# %%

# You can insert a pre-chunked document directly into the vector db
chunks = [
    {
        "document_id": "hrms.md",
        "content": "My Content...",
        "mime_type": "text/plain"
    },
]
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)

# You can then query for these chunks
chunks_response = client.vector_io.query(vector_db_id=vector_db_id, query="what is office hours?")
print(chunks_response)


# %%

# Configure agent with memory
agent_config = AgentConfig(
    model=model_name,
    enable_session_persistence=True,
    instructions="You are a helpful assistant",
    toolgroups=[
        {
            "name": "builtin::rag",
            "args": {
                "vector_db_ids": [vector_db_id],
            }
        }
    ]
)


# %%

agent = Agent(client, agent_config)
session_id = agent.create_session("rag_session")


# %%

# Query with RAG
response = agent.create_turn(
    messages=[{
        "role": "user",
        "content": "What are the key topics in the documents?"
    }],
    session_id=session_id
)

for log in EventLogger().log(response):
    log.print()

Error Log

2025-01-29 12:10:07 INFO:     172.17.0.1:47034 - "POST /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn HTTP/1.1" 200 OK
2025-01-29 12:10:07 06:40:07.453 [START] /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn
2025-01-29 12:10:07 06:40:07.481 [START] create_and_execute_turn
2025-01-29 12:10:07 06:40:07.498 [START] query_from_memory
2025-01-29 12:10:07 
Batches:   0%|                                                                                                              | 0/1 [00:00<?, ?it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.23it/s]
2025-01-29 12:10:07 Traceback (most recent call last):
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 157, in sse_generator
2025-01-29 12:10:07     async for item in event_gen:
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agents.py", line 168, in _create_agent_turn_streaming
2025-01-29 12:10:07     async for event in agent.create_and_execute_turn(request):
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 192, in create_and_execute_turn
2025-01-29 12:10:07     async for chunk in self.run(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 266, in run
2025-01-29 12:10:07     async for res in self._run(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 439, in _run
2025-01-29 12:10:07     result = await self.tool_runtime_api.rag_tool.query(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-29 12:10:07     result = await method(self, *args, **kwargs)
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 423, in query
2025-01-29 12:10:07     return await self.routing_table.get_provider_impl(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-29 12:10:07     result = await method(self, *args, **kwargs)
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/tool_runtime/rag/memory.py", line 131, in query
2025-01-29 12:10:07     tokens += metadata["token_count"]
2025-01-29 12:10:07 KeyError: 'token_count'
2025-01-29 12:10:07 06:40:07.626 [END] query_from_memory [StatusCode.OK] (128.13ms)
2025-01-29 12:10:07 06:40:07.638 [END] create_and_execute_turn [StatusCode.OK] (156.16ms)
2025-01-29 12:10:07 06:40:07.648 [END] /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn [StatusCode.OK] (195.09ms)

@wukaixingxp
Copy link
Contributor

Hi! I think this bug is because your are manually creating the chunks = [ { "document_id": "hrms.md", "content": "My Content...", "mime_type": "text/plain" }, ], which did not have the metadata["token_count"] attribute, and insert it into vector_io , I think it is better to create a Document object then use client.tool_runtime.rag_tool.insert(). Here is an example code you can follow:

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from termcolor import cprint
from llama_stack_client.types import Document

urls = ["chat.rst", "llama3.rst", "datasets.rst", "lora_finetune.rst"]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

vector_db_id = "test-vector-db"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
)
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
    toolgroups = [
        {
          "name": "builtin::rag",
          "args" : {
            "vector_db_ids": [vector_db_id],
          }
        }
    ],
)
rag_agent = Agent(client, agent_config)
session_id = rag_agent.create_session("test-session")
user_prompts = [
"What are the top 5 topics that were explained? Only list succinct bullet points.",
]
for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()

@abhishek-syno
Copy link
Author

abhishek-syno commented Jan 31, 2025

Thank you, @wukaixingxp 👍
Your solution worked like a charm.
I got confused because I had been experimenting with llama-stack for a long time. In version 0.1, they renamed many concepts and updated the documentation.

@abhishek-syno
Copy link
Author

I had to parse document using client.tool_runtime.rag_tool.insert()

@zanetworker
Copy link

zanetworker commented Jan 31, 2025

@abhishek-syno @wukaixingxp I almost always seem to get "none" and wrong responseswhen using agentConfig RAG

[{'role': 'user', 'content': 'What are the top 5 topics that were explained? Only list succinct bullet points.'}]
None
None
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10709 bytes from memory
inference>     Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant

Any idea what I could be doing wrong?

@abhishek-syno
Copy link
Author

@zanetworker ,
are you using Ollama?

@zanetworker
Copy link

zanetworker commented Jan 31, 2025

yes!

Logs from ollama:

⠧ time=2025-01-31T09:28:09.032+01:00 level=INFO source=server.go:594 msg="llama runner started in 0.76 seconds"
[GIN] 2025/01/31 - 09:28:09 | 200 |  787.424041ms |       127.0.0.1 | POST     "/api/generate"
>>> [GIN] 2025/01/31 - 09:28:12 | 200 |     111.709µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/01/31 - 09:28:14 | 200 |      144.25µs |       127.0.0.1 | GET      "/api/ps"
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
time=2025-01-31T09:32:19.372+01:00 level=WARN source=runner.go:129 msg="truncating input prompt" limit=2048 prompt=2415 keep=5 new=2048
[GIN] 2025/01/31 - 09:32:30 | 200 | 11.655411833s |       127.0.0.1 | POST     "/api/generate"

@abhishek-syno
Copy link
Author

@zanetworker , check in your ollama logs you must be facing input prompt warning.

here is your solution.
increasing the context window size

@zanetworker
Copy link

Thanks @abhishek-syno thats a good lead to go on. Will give it a shot :)

@abhishek-syno
Copy link
Author

@wukaixingxp @ashwinb ,
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?

I wanted to bring to your attention that a Beginning of Sequence (BOS) token has been added to the prompt as specified by the model. However, the prompt also begins with another BOS token. As a result, the final prompt now starts with 2 BOS tokens. Are you sure this is the intended outcome?

This issue should be addressed through the Llama-stack, as no BOS token type prompt is required in this case.

@abhishek-syno
Copy link
Author

raised a new issue here: #913

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants