Understanding Entity Deduplication
User Intent
"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"
Operation
Concept: Entity resolution and deduplication
SDK Methods: queryObservables() for finding potential duplicates
Entity: Observable deduplication behavior
Prerequisites
Knowledge graph with entities
Understanding of Observable model
Multiple content sources with entity mentions
How Deduplication Works
Automatic Deduplication
At Creation Time:
// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 ObservationsDeduplication Strategies:
Exact Name Match: "Kirk Marple" = "Kirk Marple"
Email Matching (for Person): [email protected] always same person
URL Matching (for Organization): graphlit.com domain
Normalization: Case-insensitive, whitespace trimming
Race Conditions
Problem: Parallel processing can create duplicates
// Document 1 and Document 2 processed simultaneously
// Both mention "Kirk Marple" for first time
// May create 2 separate Observables before deduplication runsWhen This Happens:
Multiple feeds syncing in parallel
Batch ingestion of many documents
High-frequency entity creation
Future Improvement: More robust entity resolution is roadmap item
Finding Duplicates
Query Similar Entities
import { Graphlit } from 'graphlit-client';
const graphlit = new Graphlit();
// Find all "Kirk" variants
const kirkEntities = await graphlit.queryObservables({
search: "Kirk",
filter: { types: [ObservableTypes.Person] }
});
console.log(`Found ${kirkEntities.observables.results.length} entities matching "Kirk"`);
kirkEntities.observables.results.forEach(entity => {
console.log(` - ${entity.observable.name} (ID: ${entity.observable.id})`);
console.log(` Email: ${entity.observable.properties?.email || 'N/A'}`);
});Identify Potential Duplicates
function findPotentialDuplicates(entities: Observable[]): Map<string, Observable[]> {
const groups = new Map<string, Observable[]>();
entities.forEach(entity => {
const normalized = entity.observable.name.toLowerCase().trim();
// Group by normalized name
if (!groups.has(normalized)) {
groups.set(normalized, []);
}
groups.get(normalized)!.push(entity);
});
// Return only groups with duplicates
return new Map(
Array.from(groups.entries()).filter(([_, group]) => group.length > 1)
);
}
const allPeople = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person] }
});
const duplicates = findPotentialDuplicates(allPeople.observables.results);
console.log(`Found ${duplicates.size} potential duplicate groups`);
duplicates.forEach((group, name) => {
console.log(`\n${name}:`);
group.forEach(entity => {
console.log(` - ID: ${entity.observable.id}`);
});
});Working with Duplicates
Query All Variants
// Find content mentioning ANY variant of entity
const kirkVariants = await graphlit.queryObservables({
search: "Kirk",
filter: { types: [ObservableTypes.Person] }
});
const allKirkContent = await Promise.all(
kirkVariants.observables.results.map(variant =>
graphlit.queryContents({
filter: {
observations: [{
type: ObservableTypes.Person,
observable: { id: variant.observable.id }
}]
}
})
)
);
const totalMentions = allKirkContent.reduce(
(sum, result) => sum + result.contents.results.length,
0
);
console.log(`Total content mentioning Kirk: ${totalMentions}`);Disambiguate by Properties
// Use email or other properties to identify correct entity
const kirkWithEmail = kirkVariants.observables.results.find(
entity => entity.observable.properties?.email === '[email protected]'
);
if (kirkWithEmail) {
console.log(`Canonical Kirk Marple: ${kirkWithEmail.observable.id}`);
}Best Practices
1. Use Unique Identifiers
When available, use email (Person) or URL (Organization):
// Search by email for Person entities
const person = await graphlit.queryObservables({
search: "[email protected]",
filter: { types: [ObservableTypes.Person] }
});2. Aggregate Across Variants
Combine mentions from all duplicate entities:
const allVariants = await graphlit.queryObservables({
search: "Kirk Marple OR Kirk",
filter: { types: [ObservableTypes.Person] }
});
// Aggregate data from all variants3. Normalize in UI
Display normalized names in UI:
function normalizeEntityName(name: string): string {
// "kirk marple" and "Kirk Marple" → "Kirk Marple"
return name.split(' ')
.map(word => word.charAt(0).toUpperCase() + word.slice(1).toLowerCase())
.join(' ');
}Future Improvements
Roadmap Items (not yet available):
Manual entity merging API
More sophisticated entity resolution
Cross-source entity linking
Entity resolution confidence scores
Current Limitations:
No API to manually merge duplicates
Some race conditions create duplicates
Name variants not always linked
Workarounds:
Query all variants and aggregate
Use unique identifiers (email, URL)
Filter by properties to disambiguate
Developer Hints
Duplicates are rare but possible
Most common after parallel batch ingestion
Email/URL properties help disambiguation
Query by identifier, not just name
Future releases will improve resolution
Not a critical issue for most use cases
Last updated
Was this helpful?