# Understanding Entity Deduplication

## User Intent

"How does Graphlit handle duplicate entities? Why do I sometimes see 'Kirk Marple' and 'Kirk' as separate entities?"

## Operation

**Concept**: Entity resolution and deduplication\
**SDK Methods**: `queryObservables()` for finding potential duplicates\
**Entity**: Observable deduplication behavior

## Prerequisites

* Knowledge graph with entities
* Understanding of Observable model
* Multiple content sources with entity mentions

***

## How Deduplication Works

### Automatic Deduplication

**At Creation Time**:

```typescript
// When entities are extracted, Graphlit attempts to deduplicate
// "Kirk Marple" mentioned in 10 documents → 1 Observable with 10 Observations
```

**Deduplication Strategies**:

1. **Exact Name Match**: "Kirk Marple" = "Kirk Marple"
2. **Email Matching** (for Person): <kirk@graphlit.com> always same person
3. **URL Matching** (for Organization): graphlit.com domain
4. **Normalization**: Case-insensitive, whitespace trimming

### Race Conditions

**Problem**: Parallel processing can create duplicates

```typescript
// Document 1 and Document 2 processed simultaneously
// Both mention "Kirk Marple" for first time
// May create 2 separate Observables before deduplication runs
```

**When This Happens**:

* Multiple feeds syncing in parallel
* Batch ingestion of many documents
* High-frequency entity creation

**Future Improvement**: More robust entity resolution is roadmap item

***

## Finding Duplicates

### Query Similar Entities

```typescript
import { Graphlit } from 'graphlit-client';

const graphlit = new Graphlit();

// Find all "Kirk" variants
const kirkEntities = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

console.log(`Found ${kirkEntities.observables.results.length} entities matching "Kirk"`);

kirkEntities.observables.results.forEach(entity => {
  console.log(`  - ${entity.observable.name} (ID: ${entity.observable.id})`);
  console.log(`    Email: ${entity.observable.properties?.email || 'N/A'}`);
});
```

### Identify Potential Duplicates

```typescript
function findPotentialDuplicates(entities: Observable[]): Map<string, Observable[]> {
  const groups = new Map<string, Observable[]>();
  
  entities.forEach(entity => {
    const normalized = entity.observable.name.toLowerCase().trim();
    
    // Group by normalized name
    if (!groups.has(normalized)) {
      groups.set(normalized, []);
    }
    groups.get(normalized)!.push(entity);
  });
  
  // Return only groups with duplicates
  return new Map(
    Array.from(groups.entries()).filter(([_, group]) => group.length > 1)
  );
}

const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

const duplicates = findPotentialDuplicates(allPeople.observables.results);

console.log(`Found ${duplicates.size} potential duplicate groups`);
duplicates.forEach((group, name) => {
  console.log(`\n${name}:`);
  group.forEach(entity => {
    console.log(`  - ID: ${entity.observable.id}`);
  });
});
```

***

## Working with Duplicates

### Query All Variants

```typescript
// Find content mentioning ANY variant of entity
const kirkVariants = await graphlit.queryObservables({
  search: "Kirk",
  filter: { types: [ObservableTypes.Person] }
});

const allKirkContent = await Promise.all(
  kirkVariants.observables.results.map(variant =>
    graphlit.queryContents({
      
        observations: [{
          type: ObservableTypes.Person,
          observable: { id: variant.observable.id }
        }]
      })
  )
);

const totalMentions = allKirkContent.reduce(
  (sum, result) => sum + result.contents.results.length,
  0
);

console.log(`Total content mentioning Kirk: ${totalMentions}`);
```

### Disambiguate by Properties

```typescript
// Use email or other properties to identify correct entity
const kirkWithEmail = kirkVariants.observables.results.find(
  entity => entity.observable.properties?.email === 'kirk@graphlit.com'
);

if (kirkWithEmail) {
  console.log(`Canonical Kirk Marple: ${kirkWithEmail.observable.id}`);
}
```

***

## Best Practices

### 1. Use Unique Identifiers

When available, use email (Person) or URL (Organization):

```typescript
// Search by email for Person entities
const person = await graphlit.queryObservables({
  search: "kirk@graphlit.com",
  filter: { types: [ObservableTypes.Person] }
});
```

### 2. Aggregate Across Variants

Combine mentions from all duplicate entities:

```typescript
const allVariants = await graphlit.queryObservables({
  search: "Kirk Marple OR Kirk",
  filter: { types: [ObservableTypes.Person] }
});

// Aggregate data from all variants
```

### 3. Normalize in UI

Display normalized names in UI:

```typescript
function normalizeEntityName(name: string): string {
  // "kirk marple" and "Kirk Marple" → "Kirk Marple"
  return name.split(' ')
    .map(word => word.charAt(0).toUpperCase() + word.slice(1).toLowerCase())
    .join(' ');
}
```

***

## Future Improvements

**Roadmap Items** (not yet available):

* Manual entity merging API
* More sophisticated entity resolution
* Cross-source entity linking
* Entity resolution confidence scores

**Current Limitations**:

* No API to manually merge duplicates
* Some race conditions create duplicates
* Name variants not always linked

**Workarounds**:

* Query all variants and aggregate
* Use unique identifiers (email, URL)
* Filter by properties to disambiguate

***

## Developer Hints

* Duplicates are rare but possible
* Most common after parallel batch ingestion
* Email/URL properties help disambiguation
* Query by identifier, not just name
* Future releases will improve resolution
* Not a critical issue for most use cases

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/observable-entity-deduplication.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
