You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and this bug is not already filed.
My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
In the finalize_entities method of Graphrag, there is an issue when handling duplicate entities. The method groups entities by title and type, and merges their descriptions into a list. This list is then passed to the model for summarization. However, in the finalize_entities method, the drop operation does not account for cases where nodes have the same title but different types. As a result, when duplicates are removed, only the first node with the same title is kept, and the other nodes are discarded.
Steps to reproduce
Extract multiple entities with the same title but different types.
Call the finalize_entities method to process them.
After processing, notice that only the first node with the same title is kept, and the others are discarded.
Expected Behavior
The expected behavior is that nodes with the same title but different types should be handled correctly during deduplication, rather than keeping only the first node and discarding the rest.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/### LLM settings ##### There are a number of settings to tune the threading and token limits for LLM calls - check the docs.models:
default_chat_model:
api_key: ${OPENAI_API_KEY} # set this in the generated .env filetype: openai_chat # or azure_openai_chatauth_type: api_key # or azure_managed_identitymodel: gpt-4o-mini-2024-07-18model_supports_json: true # recommended if this is available for your model.parallelization_num_threads: 50parallelization_stagger: 0.3async_mode: threaded # or asyncio# audience: "https://cognitiveservices.azure.com/.default"# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name>default_embedding_model:
api_key: ${OPENAI_API_KEY}type: openai_embedding # or azure_openai_embeddingauth_type: api_key # or azure_managed_identitymodel: text-embedding-3-largeparallelization_num_threads: 50parallelization_stagger: 0.3async_mode: threaded # or asyncio# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id># deployment_name: <azure_model_deployment_name>vector_store:
default_vector_store:
type: lancedbdb_uri: output\lancedbcontainer_name: defaultoverwrite: Trueembed_text:
model_id: default_embedding_modelvector_store_id: default_vector_store### Input settings ###input:
type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$$"chunks:
size: 1200overlap: 100group_by_columns: [id]### Output settings ##### If blob storage is specified in the following four sections,## connection_string and container_name must be providedcache:
type: file # [file, blob, cosmosdb]base_dir: "cache"reporting:
type: file # [file, blob, cosmosdb]base_dir: "logs"output:
type: file # [file, blob, cosmosdb]base_dir: "output"## only turn this on if running `graphrag index` with custom settings## we normally use `graphrag update` with the defaultsupdate_index_output:
# type: file # [file, blob, cosmosdb]# base_dir: "update_output"### Workflow settings ###extract_graph:
model_id: default_chat_modelprompt: "prompts/extract_graph.txt"entity_types: [organization,person,geo,event]max_gleanings: 1summarize_descriptions:
model_id: default_chat_modelprompt: "prompts/summarize_descriptions.txt"max_length: 500extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]extract_claims:
enabled: falsemodel_id: default_chat_modelprompt: "prompts/extract_claims.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 1community_reports:
model_id: default_chat_modelprompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:
max_cluster_size: 10embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodesumap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)snapshots:
graphml: trueembeddings: true### Query settings ##### The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#querylocal_search:
prompt: "prompts/local_search_system_prompt.txt"global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"drift_search:
prompt: "prompts/drift_search_system_prompt.txt"reduce_prompt: "prompts/drift_search_reduce_prompt.txt"basic_search:
prompt: "prompts/basic_search_system_prompt.txt"
Logs and screenshots
I save the entities before merge.
You can see that two descriptions about PARENT AREA are missing.
Additional Information
GraphRAG Version: 1.2.0
Operating System: Windows 10
Python Version: 3.12.9
Related Issues:
The text was updated successfully, but these errors were encountered:
IT-Bill
added
bug
Something isn't working
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
labels
Feb 18, 2025
IT-Bill
changed the title
[Bug]: Issue with handling duplicate titles in finalize_entities method
[Fatal Bug]: Incorrect deduplication of entities with same title but different type
Feb 18, 2025
natoverse
added
backlog
We've confirmed some action is needed on this and will plan it
and removed
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
labels
Feb 18, 2025
Do you need to file an issue?
Describe the bug
In the
finalize_entities
method of Graphrag, there is an issue when handling duplicate entities. The method groups entities bytitle
andtype
, and merges their descriptions into a list. This list is then passed to the model for summarization. However, in thefinalize_entities
method, the drop operation does not account for cases where nodes have the sametitle
but differenttypes
. As a result, when duplicates are removed, only the first node with the sametitle
is kept, and the other nodes are discarded.Steps to reproduce
title
but differenttypes
.finalize_entities
method to process them.title
is kept, and the others are discarded.Expected Behavior
The expected behavior is that nodes with the same
title
but differenttypes
should be handled correctly during deduplication, rather than keeping only the first node and discarding the rest.GraphRAG Config Used
Logs and screenshots
I save the entities before merge.
You can see that two descriptions about PARENT AREA are missing.
Additional Information
The text was updated successfully, but these errors were encountered: