Build Knowledge Graph from Emails

Use Case: Build Knowledge Graph from Emails

User Intent

"How do I extract entities from my Gmail or Outlook emails to build a knowledge graph? Show me how to connect contacts, organizations, and build relationship networks from email data."

Operation

SDK Methods: createWorkflow(), createFeed(), isFeedDone(), queryContents(), queryObservables() GraphQL: Feed creation + entity extraction + relationship queries Entity: Email Feed → Email Content → Observations → Observables (Contact Graph)

Prerequisites

  • Graphlit project with API credentials

  • Gmail or Microsoft 365 account

  • OAuth tokens for email access (via Graphlit Developer Portal)

  • Understanding of feed and workflow concepts


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, EntityState, FeedServiceTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from Emails ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "Email Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,          // Senders, recipients, mentions
          ObservableTypes.Organization,    // Companies from domains/signatures
          ObservableTypes.Event,           // Meeting mentions, deadlines
          ObservableTypes.Product,         // Products/services discussed
          ObservableTypes.Place            // Locations mentioned
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create Gmail feed with OAuth
console.log('Step 2: Creating Gmail feed...');
const feed = await graphlit.createFeed({
  name: "My Gmail",
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: process.env.GOOGLE_OAUTH_TOKEN!,  // From Developer Portal
    readLimit: 100,                          // Number of emails to sync
    includeAttachments: true                 // Sync attachments too
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Feed: ${feed.createFeed.id}\n`);

// Step 3: Wait for email sync
console.log('Step 3: Syncing emails...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isFeedDone(feed.createFeed.id);
  isDone = status.isFeedDone.result;
  
  if (!isDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 4: Query synced emails
console.log('Step 4: Querying synced emails...');
const emails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email],
    feeds: [{ id: feed.createFeed.id }]
  }
});

console.log(`✓ Synced ${emails.contents.results.length} emails\n`);

// Step 5: Analyze email metadata
console.log('Step 5: Analyzing email senders...\n');

const senders = new Map<string, number>();
emails.contents.results.forEach(email => {
  if (email.email?.from) {
    email.email.from.forEach(sender => {
      const email_addr = sender.email || 'unknown';
      senders.set(email_addr, (senders.get(email_addr) || 0) + 1);
    });
  }
});

console.log('Top email senders:');
Array.from(senders.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([email, count]) => {
    console.log(`  ${email}: ${count} emails`);
  });
console.log();

// Step 6: Query extracted entities
console.log('Step 6: Querying knowledge graph...\n');

// Get all people from emails
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`People extracted: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Organizations extracted: ${orgs.observables.results.length}\n`);

// Step 7: Build contact network
console.log('Step 7: Building contact network...\n');

// Email threads create person-to-person relationships
const contactNetwork = new Map<string, Set<string>>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0]?.email;
  const toList = email.email?.to?.map(t => t.email) || [];
  const ccList = email.email?.cc?.map(c => c.email) || [];
  
  const recipients = [...toList, ...ccList].filter(e => e);
  
  if (from && recipients.length > 0) {
    if (!contactNetwork.has(from)) {
      contactNetwork.set(from, new Set());
    }
    recipients.forEach(recipient => {
      contactNetwork.get(from)!.add(recipient);
    });
  }
});

console.log('Top email relationships:');
Array.from(contactNetwork.entries())
  .map(([from, to]) => ({ from, count: to.size }))
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ from, count }) => {
    console.log(`  ${from}${count} contacts`);
  });

console.log('\n✓ Knowledge graph complete!');

Run

asyncio.run(build_kg_from_emails())


Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

Email-Specific Entity Types:

  • Person: Senders, recipients, people mentioned in body, signatures

  • Organization: Companies from email domains, mentioned in text, signatures

  • Event: Meetings, deadlines, calendar invites mentioned

  • Product: Products/services discussed in emails

  • Place: Locations mentioned (meeting locations, offices)

Why Text Extraction:

  • Emails are primarily text-based

  • No visual analysis needed (unlike PDFs)

  • Fast and cost-effective

  • Handles HTML email bodies

Step 2: Configure Email Feed

Gmail Feed Configuration:

Microsoft Outlook Feed:

OAuth Token Setup:

  1. Go to Graphlit Developer Portal

  2. Navigate to Connectors → Email

  3. Authorize Gmail or Outlook

  4. Copy OAuth token

  5. Use in feed creation

Step 3: Sync and Wait for Processing

Sync Timeline:

  • 100 emails: 1-2 minutes

  • 1,000 emails: 10-15 minutes

  • 10,000 emails: 1-2 hours

Polling Strategy:

Step 4: Query Email Content

Email Metadata Structure:

Step 5: Extract Entity Observations

Email Body Extraction:

Signature Extraction: Email signatures are rich sources of Person/Organization data:

Extracts: Person("Kirk Marple"), Organization("Graphlit")

Step 6: Build Contact Network

Email Threads Create Relationships:

  • fromto/cc: Direct communication

  • Frequency indicates relationship strength

  • Thread IDs group related emails

Network Analysis:

Step 7: Query Knowledge Graph

Cross-Feed Entity Queries: Entities from emails become part of global knowledge graph:


Configuration Options

Limiting Email Sync Scope

By Count:

By Date Range:

By Labels/Folders:

Handling Attachments

Include Attachments:

Attachments are processed through workflow:

  • PDFs → extraction → entities

  • Images → vision analysis → entities

  • Documents → text extraction → entities

Exclude Attachments (faster):


Variations

Variation 1: Organization Email Domain Mapping

Extract organizations from email domains:

Variation 2: Email Thread Analysis

Analyze conversation threads:

Variation 3: Contact Frequency Ranking

Rank contacts by interaction frequency:

Search emails by entity:

Variation 5: Cross-Source Entity Linking

Link email entities with other sources:


Common Issues & Solutions

Issue: OAuth Token Expired

Problem: Feed sync fails with authorization error.

Solution: Refresh OAuth token in Developer Portal:

  1. Go to Developer Portal → Connectors

  2. Re-authorize Gmail/Outlook

  3. Copy new token

  4. Update feed or create new feed

Issue: Duplicate Entities from Sender/Recipient and Body

Problem: Same person appears as sender AND extracted from body.

Explanation: This is expected and valuable:

  • Email metadata (from/to/cc) captured automatically

  • Body extraction finds additional context

  • Multiple mentions increase confidence

Not a Problem: Graphlit deduplicates to single Observable.

Issue: Too Many Low-Confidence Entities

Problem: Email extraction finds many uncertain entities.

Solution: Filter by confidence threshold:

Emails can have ambiguous mentions ("John said...") with low confidence.

Issue: Missing Email Body Entities

Problem: Only sender/recipient captured, no body extraction.

Causes:

  1. Workflow not configured with extraction stage

  2. Email is HTML-only with no text

  3. Extraction failed for some emails

Solution: Verify workflow has extraction:


Developer Hints

OAuth Token Management

  • Tokens expire after 1 hour (short-lived)

  • Refresh tokens valid for 6 months (Gmail) or indefinitely (Outlook)

  • Use Developer Portal for token management

  • Production apps should handle token refresh automatically

Email Sync Best Practices

  1. Start small: Test with readLimit: 100 first

  2. Incremental sync: Graphlit tracks what's synced

  3. Monitor quota: Gmail API has rate limits

  4. Handle failures: Email sync can be interrupted

  5. Attachments optional: Skip for faster sync

Entity Quality from Emails

  • High confidence: Senders/recipients, signatures

  • Medium confidence: Explicit mentions in body

  • Low confidence: Implicit references, pronouns

  • Filter threshold: >=0.7 recommended for emails

Performance Considerations

  • Email sync is incremental (doesn't re-sync)

  • 100 emails = ~1-2 minutes processing

  • Attachments increase processing time significantly

  • Entity extraction adds 10-30% overhead

Privacy and Security

  • OAuth tokens have user-level permissions

  • Graphlit never stores raw OAuth refresh tokens

  • Email content encrypted at rest

  • Multi-tenant isolation ensures data privacy


Last updated

Was this helpful?