Understanding Entity Deduplication
User Intent
"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"
Operation
Concept: Entity resolution and deduplication
SDK Methods: queryObservables() for finding potential duplicates
Entity: Observable deduplication behavior
Prerequisites
Knowledge graph with entities
Understanding of Observable model
Multiple content sources with entity mentions
How Deduplication Works
Automatic Deduplication
At Creation Time:
// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 ObservationsDeduplication Strategies:
Exact Name Match: "Kirk Marple" = "Kirk Marple"
Email Matching (for Person): [email protected] always same person
URL Matching (for Organization): graphlit.com domain
Normalization: Case-insensitive, whitespace trimming
Race Conditions
Problem: Parallel processing can create duplicates
When This Happens:
Multiple feeds syncing in parallel
Batch ingestion of many documents
High-frequency entity creation
Future Improvement: More robust entity resolution is roadmap item
Finding Duplicates
Query Similar Entities
Identify Potential Duplicates
Working with Duplicates
Query All Variants
Disambiguate by Properties
Best Practices
1. Use Unique Identifiers
When available, use email (Person) or URL (Organization):
2. Aggregate Across Variants
Combine mentions from all duplicate entities:
3. Normalize in UI
Display normalized names in UI:
Future Improvements
Roadmap Items (not yet available):
Manual entity merging API
More sophisticated entity resolution
Cross-source entity linking
Entity resolution confidence scores
Current Limitations:
No API to manually merge duplicates
Some race conditions create duplicates
Name variants not always linked
Workarounds:
Query all variants and aggregate
Use unique identifiers (email, URL)
Filter by properties to disambiguate
Developer Hints
Duplicates are rare but possible
Most common after parallel batch ingestion
Email/URL properties help disambiguation
Query by identifier, not just name
Future releases will improve resolution
Not a critical issue for most use cases
Last updated
Was this helpful?