Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fatal Bug]: Incorrect deduplication of entities with same title but different type #1718

Open
3 tasks done
IT-Bill opened this issue Feb 18, 2025 · 1 comment
Open
3 tasks done
Labels
backlog We've confirmed some action is needed on this and will plan it bug Something isn't working

Comments

@IT-Bill
Copy link

IT-Bill commented Feb 18, 2025

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

In the finalize_entities method of Graphrag, there is an issue when handling duplicate entities. The method groups entities by title and type, and merges their descriptions into a list. This list is then passed to the model for summarization. However, in the finalize_entities method, the drop operation does not account for cases where nodes have the same title but different types. As a result, when duplicates are removed, only the first node with the same title is kept, and the other nodes are discarded.

Steps to reproduce

  1. Extract multiple entities with the same title but different types.
  2. Call the finalize_entities method to process them.
  3. After processing, notice that only the first node with the same title is kept, and the others are discarded.

Expected Behavior

The expected behavior is that nodes with the same title but different types should be handled correctly during deduplication, rather than keeping only the first node and discarding the rest.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    api_key: ${OPENAI_API_KEY} # set this in the generated .env file
    type: openai_chat # or azure_openai_chat
    auth_type: api_key # or azure_managed_identity
    model: gpt-4o-mini-2024-07-18
    model_supports_json: true # recommended if this is available for your model.
    parallelization_num_threads: 50
    parallelization_stagger: 0.3
    async_mode: threaded # or asyncio
    # audience: "https://cognitiveservices.azure.com/.default"
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
  default_embedding_model:
    api_key: ${OPENAI_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    auth_type: api_key # or azure_managed_identity
    model: text-embedding-3-large
    parallelization_num_threads: 50
    parallelization_stagger: 0.3
    async_mode: threaded # or asyncio
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output\lancedb
    container_name: default
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_output:
  # type: file # [file, blob, cosmosdb]
  # base_dir: "update_output"

### Workflow settings ###

extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: true
  embeddings: true

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

Image

Image

Image
I save the entities before merge.

Image
You can see that two descriptions about PARENT AREA are missing.

Additional Information

  • GraphRAG Version: 1.2.0
  • Operating System: Windows 10
  • Python Version: 3.12.9
  • Related Issues:
@IT-Bill IT-Bill added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Feb 18, 2025
@IT-Bill IT-Bill changed the title [Bug]: Issue with handling duplicate titles in finalize_entities method [Fatal Bug]: Incorrect deduplication of entities with same title but different type Feb 18, 2025
@natoverse natoverse added backlog We've confirmed some action is needed on this and will plan it and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Feb 18, 2025
@IT-Bill
Copy link
Author

IT-Bill commented Feb 26, 2025

Could you please provide an estimated timeline for fixing the issue? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog We've confirmed some action is needed on this and will plan it bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants