Understanding Entity Deduplication

User Intent

"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"

Operation

Concept: Entity resolution and deduplication SDK Methods: queryObservables() for finding potential duplicates Entity: Observable deduplication behavior

Prerequisites

  • Knowledge graph with entities

  • Understanding of Observable model

  • Multiple content sources with entity mentions


How Deduplication Works

Automatic Deduplication

At Creation Time:

// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 Observations

Deduplication Strategies:

  1. Exact Name Match: "Kirk Marple" = "Kirk Marple"

  2. Email Matching (for Person): [email protected] always same person

  3. URL Matching (for Organization): graphlit.com domain

  4. Normalization: Case-insensitive, whitespace trimming

Race Conditions

Problem: Parallel processing can create duplicates

// Document 1 and Document 2 processed simultaneously
// Both mention "Kirk Marple" for first time
// May create 2 separate Observables before deduplication runs

When This Happens:

  • Multiple feeds syncing in parallel

  • Batch ingestion of many documents

  • High-frequency entity creation

Future Improvement: More robust entity resolution is roadmap item


Finding Duplicates

Query Similar Entities

import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Find all "Kirk" variants
const kirkEntities = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

console.log(`Found ${kirkEntities.observables.results.length} entities matching "Kirk"`);

kirkEntities.observables.results.forEach(entity => {
  console.log(`  - ${entity.observable.name} (ID: ${entity.observable.id})`);
  console.log(`    Email: ${entity.observable.properties?.email || 'N/A'}`);
});

Identify Potential Duplicates

function findPotentialDuplicates(entities: Observable[]): Map<string, Observable[]> {
  const groups = new Map<string, Observable[]>();
  
  entities.forEach(entity => {
    const normalized = entity.observable.name.toLowerCase().trim();
    
    // Group by normalized name
    if (!groups.has(normalized)) {
      groups.set(normalized, []);
    }
    groups.get(normalized)!.push(entity);
  });
  
  // Return only groups with duplicates
  return new Map(
    Array.from(groups.entries()).filter(([_, group]) => group.length > 1)
  );
}

const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

const duplicates = findPotentialDuplicates(allPeople.observables.results);

console.log(`Found ${duplicates.size} potential duplicate groups`);
duplicates.forEach((group, name) => {
  console.log(`\n${name}:`);
  group.forEach(entity => {
    console.log(`  - ID: ${entity.observable.id}`);
  });
});

Working with Duplicates

Query All Variants

// Find content mentioning ANY variant of entity
const kirkVariants = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

const allKirkContent = await Promise.all(
  kirkVariants.observables.results.map(variant =>
    graphlit.queryContents({
      filter: {
        observations: [{
          type: ObservableTypes.Person,
          observable: { id: variant.observable.id }
        }]
      }
    })
  )
);

const totalMentions = allKirkContent.reduce(
  (sum, result) => sum + result.contents.results.length,
  0
);

console.log(`Total content mentioning Kirk: ${totalMentions}`);

Disambiguate by Properties

// Use email or other properties to identify correct entity
const kirkWithEmail = kirkVariants.observables.results.find(
  entity => entity.observable.properties?.email === '[email protected]'
);

if (kirkWithEmail) {
  console.log(`Canonical Kirk Marple: ${kirkWithEmail.observable.id}`);
}

Best Practices

1. Use Unique Identifiers

When available, use email (Person) or URL (Organization):

// Search by email for Person entities
const person = await graphlit.queryObservables({
  search: "[email protected]",
  filter: { types: [ObservableTypes.Person] }
});

2. Aggregate Across Variants

Combine mentions from all duplicate entities:

const allVariants = await graphlit.queryObservables({
  search: "Kirk Marple OR Kirk",
  filter: { types: [ObservableTypes.Person] }
});

// Aggregate data from all variants

3. Normalize in UI

Display normalized names in UI:

function normalizeEntityName(name: string): string {
  // "kirk marple" and "Kirk Marple" → "Kirk Marple"
  return name.split(' ')
    .map(word => word.charAt(0).toUpperCase() + word.slice(1).toLowerCase())
    .join(' ');
}

Future Improvements

Roadmap Items (not yet available):

  • Manual entity merging API

  • More sophisticated entity resolution

  • Cross-source entity linking

  • Entity resolution confidence scores

Current Limitations:

  • No API to manually merge duplicates

  • Some race conditions create duplicates

  • Name variants not always linked

Workarounds:

  • Query all variants and aggregate

  • Use unique identifiers (email, URL)

  • Filter by properties to disambiguate


Developer Hints

  • Duplicates are rare but possible

  • Most common after parallel batch ingestion

  • Email/URL properties help disambiguation

  • Query by identifier, not just name

  • Future releases will improve resolution

  • Not a critical issue for most use cases


Last updated

Was this helpful?