KeyError: 'token_count' when calling chat API with RAG toolgroup #887

abhishek-syno · 2025-01-28T06:18:32Z

System Info

Description

Encountered a KeyError when trying to use the chat API with RAG toolgroup. The error occurs when the system tries to access the 'token_count' key in metadata.

Python version: 3.13.1
llama_stack: 0.1
llama_stack_client: 0.1

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Steps to Reproduce
Created an agent with RAG toolgroup
Initialized a session
Attempted to make a chat request

agent_config = AgentConfig(
    model="meta-llama/Llama-3.2-1B-Instruct",
    instructions="You are a helpful assistant...",
    toolgroups=[
        {
            "name": "builtin::rag",
            "args": {"vector_db_ids": ["hrms_test_bank_test_1"]},
        }
    ],
    tool_choice="auto",
    tool_prompt_format="json",
    enable_session_persistence=True,
    stream=True
)

Error logs

Error Message -Docker Ollama Distribution

2025-01-28 11:25:55 INFO:     172.17.0.1:38390 - "POST /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn HTTP/1.1" 200 OK
2025-01-28 11:25:55 05:55:55.868 [START] /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn
2025-01-28 11:25:55 05:55:55.899 [START] create_and_execute_turn
2025-01-28 11:25:55 05:55:55.910 [START] query_from_memory
2025-01-28 11:25:55 
Batches:   0%|                                                                                             | 0/1 [00:00<?, ?it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.05it/s]
2025-01-28 11:25:55 Traceback (most recent call last):
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 157, in sse_generator
2025-01-28 11:25:55     async for item in event_gen:
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agents.py", line 168, in _create_agent_turn_streaming
2025-01-28 11:25:55     async for event in agent.create_and_execute_turn(request):
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 192, in create_and_execute_turn
2025-01-28 11:25:55     async for chunk in self.run(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 266, in run
2025-01-28 11:25:55     async for res in self._run(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 439, in _run
2025-01-28 11:25:55     result = await self.tool_runtime_api.rag_tool.query(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-28 11:25:55     result = await method(self, *args, **kwargs)
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 423, in query
2025-01-28 11:25:55     return await self.routing_table.get_provider_impl(
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-28 11:25:55     result = await method(self, *args, **kwargs)
2025-01-28 11:25:55   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/tool_runtime/rag/memory.py", line 131, in query
2025-01-28 11:25:55     tokens += metadata["token_count"]
2025-01-28 11:25:55 KeyError: 'token_count'
2025-01-28 11:25:56 05:55:56.054 [END] query_from_memory [StatusCode.OK] (144.18ms)
2025-01-28 11:25:56 05:55:56.067 [END] create_and_execute_turn [StatusCode.OK] (168.68ms)
2025-01-28 11:25:56 05:55:56.079 [END] /v1/agents/887cf37c-329f-4cb2-ad1d-62e3a979ab13/session/af7abb2e-149f-4c9f-aeeb-e5d305b5923e/turn [StatusCode.OK] (211.33ms)

Expected behavior

Through the client SDK code/ API call using POSTMAN it is throwing the error:
500: Internal server error: An unexpected error occurred.

It should work.

wukaixingxp · 2025-01-28T17:36:11Z

Hi! This is caused by the persistent DB saved by the old llama-stack, which did not have any metadata. Can you delete the previous DB in ~/.llama/distributions/<your provider>/, eg rm -r ~/.llama/distributions/ollama and retry the rag test? Thanks!

abhishek-syno · 2025-01-29T06:00:24Z

Hi @wukaixingxp , I have tried using a fresh volume mount, but it is still not working. I'm getting the same error.

abhishek-syno · 2025-01-29T06:44:50Z

I have removed the volume mount option while running the container to verify the issue. Here is the updated code according to your documentation.

# %% [markdown]
# **Setting up Vector DBs**

# %%
import os
import uuid
from llama_stack_client import LlamaStackClient
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.types import UserMessage
from llama_stack_client.lib.agents.event_logger import EventLogger

# %%
host = "localhost"
port = 6003
model_name="meta-llama/Llama-3.2-1B-Instruct"

# %%
client = LlamaStackClient(base_url=f"http://{host}:{port}")

# %%
# Register a vector db
vector_db_id = "my_documents"
response = client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
    provider_id="faiss",
)


# %%

# You can insert a pre-chunked document directly into the vector db
chunks = [
    {
        "document_id": "hrms.md",
        "content": "My Content...",
        "mime_type": "text/plain"
    },
]
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)

# You can then query for these chunks
chunks_response = client.vector_io.query(vector_db_id=vector_db_id, query="what is office hours?")
print(chunks_response)


# %%

# Configure agent with memory
agent_config = AgentConfig(
    model=model_name,
    enable_session_persistence=True,
    instructions="You are a helpful assistant",
    toolgroups=[
        {
            "name": "builtin::rag",
            "args": {
                "vector_db_ids": [vector_db_id],
            }
        }
    ]
)


# %%

agent = Agent(client, agent_config)
session_id = agent.create_session("rag_session")


# %%

# Query with RAG
response = agent.create_turn(
    messages=[{
        "role": "user",
        "content": "What are the key topics in the documents?"
    }],
    session_id=session_id
)

for log in EventLogger().log(response):
    log.print()

Error Log

2025-01-29 12:10:07 INFO:     172.17.0.1:47034 - "POST /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn HTTP/1.1" 200 OK
2025-01-29 12:10:07 06:40:07.453 [START] /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn
2025-01-29 12:10:07 06:40:07.481 [START] create_and_execute_turn
2025-01-29 12:10:07 06:40:07.498 [START] query_from_memory
2025-01-29 12:10:07 
Batches:   0%|                                                                                                              | 0/1 [00:00<?, ?it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.23it/s]
2025-01-29 12:10:07 Traceback (most recent call last):
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 157, in sse_generator
2025-01-29 12:10:07     async for item in event_gen:
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agents.py", line 168, in _create_agent_turn_streaming
2025-01-29 12:10:07     async for event in agent.create_and_execute_turn(request):
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 192, in create_and_execute_turn
2025-01-29 12:10:07     async for chunk in self.run(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 266, in run
2025-01-29 12:10:07     async for res in self._run(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 439, in _run
2025-01-29 12:10:07     result = await self.tool_runtime_api.rag_tool.query(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-29 12:10:07     result = await method(self, *args, **kwargs)
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 423, in query
2025-01-29 12:10:07     return await self.routing_table.get_provider_impl(
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
2025-01-29 12:10:07     result = await method(self, *args, **kwargs)
2025-01-29 12:10:07   File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/inline/tool_runtime/rag/memory.py", line 131, in query
2025-01-29 12:10:07     tokens += metadata["token_count"]
2025-01-29 12:10:07 KeyError: 'token_count'
2025-01-29 12:10:07 06:40:07.626 [END] query_from_memory [StatusCode.OK] (128.13ms)
2025-01-29 12:10:07 06:40:07.638 [END] create_and_execute_turn [StatusCode.OK] (156.16ms)
2025-01-29 12:10:07 06:40:07.648 [END] /v1/agents/ff9135fc-fa75-4e77-8674-f7aeb97e2203/session/8fd6d687-fb34-44f1-b8fa-98e612cfbd30/turn [StatusCode.OK] (195.09ms)

wukaixingxp · 2025-01-30T19:26:01Z

Hi! I think this bug is because your are manually creating the chunks = [ { "document_id": "hrms.md", "content": "My Content...", "mime_type": "text/plain" }, ], which did not have the metadata["token_count"] attribute, and insert it into vector_io , I think it is better to create a Document object then use client.tool_runtime.rag_tool.insert(). Here is an example code you can follow:

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from termcolor import cprint
from llama_stack_client.types import Document

urls = ["chat.rst", "llama3.rst", "datasets.rst", "lora_finetune.rst"]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

vector_db_id = "test-vector-db"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
)
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
    toolgroups = [
        {
          "name": "builtin::rag",
          "args" : {
            "vector_db_ids": [vector_db_id],
          }
        }
    ],
)
rag_agent = Agent(client, agent_config)
session_id = rag_agent.create_session("test-session")
user_prompts = [
"What are the top 5 topics that were explained? Only list succinct bullet points.",
]
for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()

abhishek-syno · 2025-01-31T08:28:32Z

Thank you, @wukaixingxp 👍
Your solution worked like a charm.
I got confused because I had been experimenting with llama-stack for a long time. In version 0.1, they renamed many concepts and updated the documentation.

abhishek-syno · 2025-01-31T08:29:50Z

I had to parse document using client.tool_runtime.rag_tool.insert()

zanetworker · 2025-01-31T08:33:50Z

@abhishek-syno @wukaixingxp I almost always seem to get "none" and wrong responseswhen using agentConfig RAG

[{'role': 'user', 'content': 'What are the top 5 topics that were explained? Only list succinct bullet points.'}]
None
None
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10709 bytes from memory
inference>     Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant

Any idea what I could be doing wrong?

abhishek-syno · 2025-01-31T08:35:47Z

@zanetworker ,
are you using Ollama?

zanetworker · 2025-01-31T08:35:55Z

yes!

Logs from ollama:

⠧ time=2025-01-31T09:28:09.032+01:00 level=INFO source=server.go:594 msg="llama runner started in 0.76 seconds"
[GIN] 2025/01/31 - 09:28:09 | 200 |  787.424041ms |       127.0.0.1 | POST     "/api/generate"
>>> [GIN] 2025/01/31 - 09:28:12 | 200 |     111.709µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/01/31 - 09:28:14 | 200 |      144.25µs |       127.0.0.1 | GET      "/api/ps"
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
time=2025-01-31T09:32:19.372+01:00 level=WARN source=runner.go:129 msg="truncating input prompt" limit=2048 prompt=2415 keep=5 new=2048
[GIN] 2025/01/31 - 09:32:30 | 200 | 11.655411833s |       127.0.0.1 | POST     "/api/generate"

abhishek-syno · 2025-01-31T08:37:32Z

@zanetworker , check in your ollama logs you must be facing input prompt warning.

here is your solution.
increasing the context window size

zanetworker · 2025-01-31T08:40:12Z

Thanks @abhishek-syno thats a good lead to go on. Will give it a shot :)

abhishek-syno · 2025-01-31T08:40:52Z

@wukaixingxp @ashwinb ,
check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?

I wanted to bring to your attention that a Beginning of Sequence (BOS) token has been added to the prompt as specified by the model. However, the prompt also begins with another BOS token. As a result, the final prompt now starts with 2 BOS tokens. Are you sure this is the intended outcome?

This issue should be addressed through the Llama-stack, as no BOS token type prompt is required in this case.

abhishek-syno · 2025-01-31T08:53:51Z

raised a new issue here: #913

wukaixingxp self-assigned this Jan 28, 2025

abhishek-syno closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'token_count' when calling chat API with RAG toolgroup #887

KeyError: 'token_count' when calling chat API with RAG toolgroup #887

abhishek-syno commented Jan 28, 2025 •

edited

Loading

wukaixingxp commented Jan 28, 2025

abhishek-syno commented Jan 29, 2025

abhishek-syno commented Jan 29, 2025 •

edited

Loading

wukaixingxp commented Jan 30, 2025

abhishek-syno commented Jan 31, 2025 •

edited

Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025 •

edited

Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025 •

edited

Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025

abhishek-syno commented Jan 31, 2025

abhishek-syno commented Jan 31, 2025

KeyError: 'token_count' when calling chat API with RAG toolgroup #887

KeyError: 'token_count' when calling chat API with RAG toolgroup #887

Comments

abhishek-syno commented Jan 28, 2025 • edited Loading

System Info

Description

Information

🐛 Describe the bug

Error logs

Error Message -Docker Ollama Distribution

Expected behavior

wukaixingxp commented Jan 28, 2025

abhishek-syno commented Jan 29, 2025

abhishek-syno commented Jan 29, 2025 • edited Loading

Error Log

wukaixingxp commented Jan 30, 2025

abhishek-syno commented Jan 31, 2025 • edited Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025 • edited Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025 • edited Loading

abhishek-syno commented Jan 31, 2025

zanetworker commented Jan 31, 2025

abhishek-syno commented Jan 31, 2025

abhishek-syno commented Jan 31, 2025

abhishek-syno commented Jan 28, 2025 •

edited

Loading

abhishek-syno commented Jan 29, 2025 •

edited

Loading

abhishek-syno commented Jan 31, 2025 •

edited

Loading

zanetworker commented Jan 31, 2025 •

edited

Loading

zanetworker commented Jan 31, 2025 •

edited

Loading