Understanding Entity Deduplication

User Intent

"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"

Operation

Concept: Entity resolution and deduplication SDK Methods: queryObservables() for finding potential duplicates Entity: Observable deduplication behavior

Prerequisites

Knowledge graph with entities
Understanding of Observable model
Multiple content sources with entity mentions

How Deduplication Works

Automatic Deduplication

At Creation Time:

// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 Observations

Deduplication Strategies:

Exact Name Match: "Kirk Marple" = "Kirk Marple"
Email Matching (for Person): [email protected] always same person
URL Matching (for Organization): graphlit.com domain
Normalization: Case-insensitive, whitespace trimming

Race Conditions

Problem: Parallel processing can create duplicates

// Document 1 and Document 2 processed simultaneously
// Both mention "Kirk Marple" for first time
// May create 2 separate Observables before deduplication runs

When This Happens:

Multiple feeds syncing in parallel
Batch ingestion of many documents
High-frequency entity creation

Future Improvement: More robust entity resolution is roadmap item

Finding Duplicates

Query Similar Entities

import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Find all "Kirk" variants
const kirkEntities = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

console.log(`Found ${kirkEntities.observables.results.length} entities matching "Kirk"`);

kirkEntities.observables.results.forEach(entity => {
  console.log(`  - ${entity.observable.name} (ID: ${entity.observable.id})`);
  console.log(`    Email: ${entity.observable.properties?.email || 'N/A'}`);
});

Identify Potential Duplicates

function findPotentialDuplicates(entities: Observable[]): Map<string, Observable[]> {
  const groups = new Map<string, Observable[]>();
  
  entities.forEach(entity => {
    const normalized = entity.observable.name.toLowerCase().trim();
    
    // Group by normalized name
    if (!groups.has(normalized)) {
      groups.set(normalized, []);
    }
    groups.get(normalized)!.push(entity);
  });
  
  // Return only groups with duplicates
  return new Map(
    Array.from(groups.entries()).filter(([_, group]) => group.length > 1)
  );
}

const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

const duplicates = findPotentialDuplicates(allPeople.observables.results);

console.log(`Found ${duplicates.size} potential duplicate groups`);
duplicates.forEach((group, name) => {
  console.log(`\n${name}:`);
  group.forEach(entity => {
    console.log(`  - ID: ${entity.observable.id}`);
  });
});

Working with Duplicates

Query All Variants

// Find content mentioning ANY variant of entity
const kirkVariants = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

const allKirkContent = await Promise.all(
  kirkVariants.observables.results.map(variant =>
    graphlit.queryContents({
      filter: {
        observations: [{
          type: ObservableTypes.Person,
          observable: { id: variant.observable.id }
        }]
      }
    })
  )
);

const totalMentions = allKirkContent.reduce(
  (sum, result) => sum + result.contents.results.length,
  0
);

console.log(`Total content mentioning Kirk: ${totalMentions}`);

Disambiguate by Properties

// Use email or other properties to identify correct entity
const kirkWithEmail = kirkVariants.observables.results.find(
  entity => entity.observable.properties?.email === '[email protected]'
);

if (kirkWithEmail) {
  console.log(`Canonical Kirk Marple: ${kirkWithEmail.observable.id}`);
}

Best Practices

1. Use Unique Identifiers

When available, use email (Person) or URL (Organization):

// Search by email for Person entities
const person = await graphlit.queryObservables({
  search: "[email protected]",
  filter: { types: [ObservableTypes.Person] }
});

2. Aggregate Across Variants

Combine mentions from all duplicate entities:

const allVariants = await graphlit.queryObservables({
  search: "Kirk Marple OR Kirk",
  filter: { types: [ObservableTypes.Person] }
});

// Aggregate data from all variants

3. Normalize in UI

Display normalized names in UI:

function normalizeEntityName(name: string): string {
  // "kirk marple" and "Kirk Marple" → "Kirk Marple"
  return name.split(' ')
    .map(word => word.charAt(0).toUpperCase() + word.slice(1).toLowerCase())
    .join(' ');
}

Future Improvements

Roadmap Items (not yet available):

Manual entity merging API
More sophisticated entity resolution
Cross-source entity linking
Entity resolution confidence scores

Current Limitations:

No API to manually merge duplicates
Some race conditions create duplicates
Name variants not always linked

Workarounds:

Query all variants and aggregate
Use unique identifiers (email, URL)
Filter by properties to disambiguate

Developer Hints

Duplicates are rare but possible
Most common after parallel batch ingestion
Email/URL properties help disambiguation
Query by identifier, not just name
Future releases will improve resolution
Not a critical issue for most use cases

Last updated 2 months ago

Was this helpful?