Understanding Entity Deduplication

User Intent

"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"

Operation

Concept: Entity resolution and deduplication SDK Methods: queryObservables() for finding potential duplicates Entity: Observable deduplication behavior

Prerequisites

  • Knowledge graph with entities

  • Understanding of Observable model

  • Multiple content sources with entity mentions


How Deduplication Works

Automatic Deduplication

At Creation Time:

// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 Observations

Deduplication Strategies:

  1. Exact Name Match: "Kirk Marple" = "Kirk Marple"

  2. Email Matching (for Person): [email protected] always same person

  3. URL Matching (for Organization): graphlit.com domain

  4. Normalization: Case-insensitive, whitespace trimming

Race Conditions

Problem: Parallel processing can create duplicates

When This Happens:

  • Multiple feeds syncing in parallel

  • Batch ingestion of many documents

  • High-frequency entity creation

Future Improvement: More robust entity resolution is roadmap item


Finding Duplicates

Query Similar Entities

Identify Potential Duplicates


Working with Duplicates

Query All Variants

Disambiguate by Properties


Best Practices

1. Use Unique Identifiers

When available, use email (Person) or URL (Organization):

2. Aggregate Across Variants

Combine mentions from all duplicate entities:

3. Normalize in UI

Display normalized names in UI:


Future Improvements

Roadmap Items (not yet available):

  • Manual entity merging API

  • More sophisticated entity resolution

  • Cross-source entity linking

  • Entity resolution confidence scores

Current Limitations:

  • No API to manually merge duplicates

  • Some race conditions create duplicates

  • Name variants not always linked

Workarounds:

  • Query all variants and aggregate

  • Use unique identifiers (email, URL)

  • Filter by properties to disambiguate


Developer Hints

  • Duplicates are rare but possible

  • Most common after parallel batch ingestion

  • Email/URL properties help disambiguation

  • Query by identifier, not just name

  • Future releases will improve resolution

  • Not a critical issue for most use cases


Last updated

Was this helpful?