Build Knowledge Graph from Emails

Use Case: Build Knowledge Graph from Emails

User Intent

"How do I extract entities from my Gmail or Outlook emails to build a knowledge graph? Show me how to connect contacts, organizations, and build relationship networks from email data."

Operation

SDK Methods: createWorkflow(), createFeed(), isFeedDone(), queryContents(), queryObservables() GraphQL: Feed creation + entity extraction + relationship queries Entity: Email Feed → Email Content → Observations → Observables (Contact Graph)

Prerequisites

Graphlit project with API credentials
Gmail or Microsoft 365 account
OAuth tokens for email access (via Graphlit Developer Portal)
Understanding of feed and workflow concepts

Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, EntityState, FeedServiceTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from Emails ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "Email Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,          // Senders, recipients, mentions
          ObservableTypes.Organization,    // Companies from domains/signatures
          ObservableTypes.Event,           // Meeting mentions, deadlines
          ObservableTypes.Product,         // Products/services discussed
          ObservableTypes.Place            // Locations mentioned
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create Gmail feed with OAuth
console.log('Step 2: Creating Gmail feed...');
const feed = await graphlit.createFeed({
  name: "My Gmail",
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: process.env.GOOGLE_OAUTH_TOKEN!,  // From Developer Portal
    readLimit: 100,                          // Number of emails to sync
    includeAttachments: true                 // Sync attachments too
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Feed: ${feed.createFeed.id}\n`);

// Step 3: Wait for email sync
console.log('Step 3: Syncing emails...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isFeedDone(feed.createFeed.id);
  isDone = status.isFeedDone.result;
  
  if (!isDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 4: Query synced emails
console.log('Step 4: Querying synced emails...');
const emails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email],
    feeds: [{ id: feed.createFeed.id }]
  }
});

console.log(`✓ Synced ${emails.contents.results.length} emails\n`);

// Step 5: Analyze email metadata
console.log('Step 5: Analyzing email senders...\n');

const senders = new Map<string, number>();
emails.contents.results.forEach(email => {
  if (email.email?.from) {
    email.email.from.forEach(sender => {
      const email_addr = sender.email || 'unknown';
      senders.set(email_addr, (senders.get(email_addr) || 0) + 1);
    });
  }
});

console.log('Top email senders:');
Array.from(senders.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([email, count]) => {
    console.log(`  ${email}: ${count} emails`);
  });
console.log();

// Step 6: Query extracted entities
console.log('Step 6: Querying knowledge graph...\n');

// Get all people from emails
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`People extracted: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Organizations extracted: ${orgs.observables.results.length}\n`);

// Step 7: Build contact network
console.log('Step 7: Building contact network...\n');

// Email threads create person-to-person relationships
const contactNetwork = new Map<string, Set<string>>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0]?.email;
  const toList = email.email?.to?.map(t => t.email) || [];
  const ccList = email.email?.cc?.map(c => c.email) || [];
  
  const recipients = [...toList, ...ccList].filter(e => e);
  
  if (from && recipients.length > 0) {
    if (!contactNetwork.has(from)) {
      contactNetwork.set(from, new Set());
    }
    recipients.forEach(recipient => {
      contactNetwork.get(from)!.add(recipient);
    });
  }
});

console.log('Top email relationships:');
Array.from(contactNetwork.entries())
  .map(([from, to]) => ({ from, count: to.size }))
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ from, count }) => {
    console.log(`  ${from} → ${count} contacts`);
  });

console.log('\n✓ Knowledge graph complete!');

Run

asyncio.run(build_kg_from_emails())


### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;

var graphlit = new Graphlit();

Console.WriteLine("=== Building Knowledge Graph from Emails ===\n");

// Step 1: Create workflow
Console.WriteLine("Step 1: Creating entity extraction workflow...");
var workflow = await graphlit.CreateWorkflow(
    name: "Email Entity Extraction",
    extraction: new WorkflowExtractionInput
    {
        Jobs = new[]
        {
            new WorkflowExtractionJobInput
            {
                Connector = new ExtractionConnectorInput
                {
                    Type = ExtractionServiceModelText,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization,
                        ObservableTypes.Event,
                        ObservableTypes.Product
                    }
                }
            }
        }
    }
);

Console.WriteLine($"✓ Workflow: {workflow.CreateWorkflow.Id}\n");

// Step 2: Create Gmail feed
Console.WriteLine("Step 2: Creating Gmail feed...");
var feed = await graphlit.CreateFeed(
    name: "My Gmail",
    type: FeedEmail,
    email: new EmailFeedInput
    {
        Type = FeedServiceGmail,
        Token = Environment.GetEnvironmentVariable("GOOGLE_OAUTH_TOKEN"),
        ReadLimit = 100,
        IncludeAttachments = true
    },
    workflow: new EntityReferenceInput { Id = workflow.CreateWorkflow.Id }
);

Console.WriteLine($"✓ Feed: {feed.CreateFeed.Id}\n");

// (Continue with remaining steps...)

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

Email-Specific Entity Types:

Person: Senders, recipients, people mentioned in body, signatures
Organization: Companies from email domains, mentioned in text, signatures
Event: Meetings, deadlines, calendar invites mentioned
Product: Products/services discussed in emails
Place: Locations mentioned (meeting locations, offices)

Why Text Extraction:

Emails are primarily text-based
No visual analysis needed (unlike PDFs)
Fast and cost-effective
Handles HTML email bodies

Step 2: Configure Email Feed

Gmail Feed Configuration:

feed: {
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: googleOAuthToken,           // From Developer Portal OAuth
    readLimit: 100,                     // How many emails to sync
    includeAttachments: true,           // Sync attachments as separate content
    labels: ['INBOX', 'SENT']           // Optional: specific labels
  }
}

Microsoft Outlook Feed:

feed: {
  type: FeedEmail,
  email: {
    type: FeedServiceOutlook,
    token: microsoftOAuthToken,         // Microsoft OAuth token
    readLimit: 100,
    includeAttachments: true,
    folderNames: ['Inbox', 'Sent Items']  // Optional: specific folders
  }
}

OAuth Token Setup:

Go to Graphlit Developer Portal
Navigate to Connectors → Email
Authorize Gmail or Outlook
Copy OAuth token
Use in feed creation

Step 3: Sync and Wait for Processing

Sync Timeline:

100 emails: 1-2 minutes
1,000 emails: 10-15 minutes
10,000 emails: 1-2 hours

Polling Strategy:

const pollInterval = 5000;  // 5 seconds
const maxWait = 600000;     // 10 minutes max

const startTime = Date.now();
while (!isDone && (Date.now() - startTime < maxWait)) {
  const status = await graphlit.isFeedDone(feedId);
  isDone = status.isFeedDone.result;
  
  if (!isDone) {
    await new Promise(resolve => setTimeout(resolve, pollInterval));
  }
}

Step 4: Query Email Content

Email Metadata Structure:

email: {
  from: [{ name: "Kirk Marple", email: "[email protected]" }],
  to: [{ name: "John Doe", email: "[email protected]" }],
  cc: [{ name: "Jane Smith", email: "[email protected]" }],
  bcc: [],  // Usually empty (privacy)
  subject: "Q4 Planning Meeting",
  labels: ["INBOX", "IMPORTANT"],  // Gmail labels
  identifier: "<[email protected]>",
  threadIdentifier: "<[email protected]>",
  sensitivity: "Normal",
  priority: "High",
  attachmentCount: 2
}

Step 5: Extract Entity Observations

Email Body Extraction:

const emailContent = await graphlit.getContent(emailId);

// Entities from email body
emailContent.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
  // No page numbers (emails aren't paginated)
  // High confidence for explicit mentions
});

Signature Extraction: Email signatures are rich sources of Person/Organization data:

Kirk Marple
CEO, Graphlit
[email protected]
https://graphlit.com

Extracts: Person("Kirk Marple"), Organization("Graphlit")

Step 6: Build Contact Network

Email Threads Create Relationships:

from → to/cc: Direct communication
Frequency indicates relationship strength
Thread IDs group related emails

Network Analysis:

// Who communicates with whom
const relationships = new Map<string, Map<string, number>>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0]?.email;
  const recipients = [
    ...(email.email?.to?.map(t => t.email) || []),
    ...(email.email?.cc?.map(c => c.email) || [])
  ];
  
  if (from && recipients.length > 0) {
    if (!relationships.has(from)) {
      relationships.set(from, new Map());
    }
    
    recipients.forEach(to => {
      const recipientMap = relationships.get(from)!;
      recipientMap.set(to, (recipientMap.get(to) || 0) + 1);
    });
  }
});

Step 7: Query Knowledge Graph

Cross-Feed Entity Queries: Entities from emails become part of global knowledge graph:

// Find all content mentioning a person (emails + other sources)
const kirkContent = await graphlit.queryContents({
  filter: {
    observations: [{
      type: ObservableTypes.Person,
      observable: { id: kirkPersonId }
    }]
  }
});

// Includes: emails, Slack messages, documents, etc.

Configuration Options

Limiting Email Sync Scope

By Count:

email: {
  readLimit: 500  // Most recent 500 emails
}

By Date Range:

email: {
  readLimit: 1000,
  // Only recent emails (Graphlit handles recency automatically)
}

By Labels/Folders:

// Gmail
email: {
  type: FeedServiceGmail,
  labels: ['INBOX', 'IMPORTANT', 'Sent']  // Specific labels only
}

// Outlook
email: {
  type: FeedServiceOutlook,
  folderNames: ['Inbox', 'Sent Items', 'Archive']
}

Handling Attachments

Include Attachments:

email: {
  includeAttachments: true  // PDFs, images, etc. become separate content
}

Attachments are processed through workflow:

PDFs → extraction → entities
Images → vision analysis → entities
Documents → text extraction → entities

Exclude Attachments (faster):

email: {
  includeAttachments: false  // Email body only
}

Variations

Variation 1: Organization Email Domain Mapping

Extract organizations from email domains:

function extractOrgFromDomain(email: string): string | null {
  const domain = email.split('@')[1];
  if (!domain) return null;
  
  // Map common domains
  const orgMap: Record<string, string> = {
    'gmail.com': null,        // Personal email
    'outlook.com': null,      // Personal email
    'graphlit.com': 'Graphlit',
    'microsoft.com': 'Microsoft',
    // ... add more
  };
  
  return orgMap[domain] || domain.replace(/\.(com|org|net|io)$/, '');
}

// Build org roster from emails
const emailsByOrg = new Map<string, Set<string>>();

emails.contents.results.forEach(email => {
  email.email?.from?.forEach(sender => {
    const org = extractOrgFromDomain(sender.email || '');
    if (org) {
      if (!emailsByOrg.has(org)) {
        emailsByOrg.set(org, new Set());
      }
      emailsByOrg.get(org)!.add(sender.email || '');
    }
  });
});

console.log('Emails by organization:');
emailsByOrg.forEach((emails, org) => {
  console.log(`  ${org}: ${emails.size} contacts`);
});

Variation 2: Email Thread Analysis

Analyze conversation threads:

// Group emails by thread
const threads = new Map<string, Array<typeof emails.contents.results[0]>>();

emails.contents.results.forEach(email => {
  const threadId = email.email?.threadIdentifier || email.id;
  if (!threads.has(threadId)) {
    threads.set(threadId, []);
  }
  threads.get(threadId)!.push(email);
});

// Find longest threads
const longThreads = Array.from(threads.entries())
  .sort((a, b) => b[1].length - a[1].length)
  .slice(0, 5);

console.log('Longest email threads:');
longThreads.forEach(([threadId, emails]) => {
  const subject = emails[0].email?.subject;
  console.log(`  "${subject}": ${emails.length} emails`);
});

Variation 3: Contact Frequency Ranking

Rank contacts by interaction frequency:

interface ContactStats {
  email: string;
  name?: string;
  emailsReceived: number;
  emailsSent: number;
  total: number;
}

const myEmail = '[email protected]';  // Your email address
const contactStats = new Map<string, ContactStats>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0];
  const toList = email.email?.to || [];
  const ccList = email.email?.cc || [];
  
  if (from?.email === myEmail) {
    // Email I sent
    [...toList, ...ccList].forEach(recipient => {
      if (!contactStats.has(recipient.email!)) {
        contactStats.set(recipient.email!, {
          email: recipient.email!,
          name: recipient.name,
          emailsReceived: 0,
          emailsSent: 0,
          total: 0
        });
      }
      const stats = contactStats.get(recipient.email!)!;
      stats.emailsSent++;
      stats.total++;
    });
  } else if (from?.email) {
    // Email I received
    if (!contactStats.has(from.email)) {
      contactStats.set(from.email, {
        email: from.email,
        name: from.name,
        emailsReceived: 0,
        emailsSent: 0,
        total: 0
      });
    }
    const stats = contactStats.get(from.email)!;
    stats.emailsReceived++;
    stats.total++;
  }
});

// Top contacts
const topContacts = Array.from(contactStats.values())
  .sort((a, b) => b.total - a.total)
  .slice(0, 10);

console.log('Top contacts:');
topContacts.forEach((contact, i) => {
  console.log(`${i + 1}. ${contact.name || contact.email}`);
  console.log(`   Received: ${contact.emailsReceived}, Sent: ${contact.emailsSent}`);
});

Variation 4: Entity-Enhanced Email Search

Search emails by entity:

// Find all emails mentioning Graphlit
const graphlitOrg = await graphlit.queryObservables({
  search: "Graphlit",
  filter: { types: [ObservableTypes.Organization] }
});

const graphlitEmails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email],
    observations: [{
      type: ObservableTypes.Organization,
      observable: { id: graphlitOrg.observables.results[0].observable.id }
    }]
  }
});

console.log(`Emails mentioning Graphlit: ${graphlitEmails.contents.results.length}`);

// Who sent these emails?
const senders = new Set<string>();
graphlitEmails.contents.results.forEach(email => {
  email.email?.from?.forEach(sender => {
    if (sender.email) senders.add(sender.email);
  });
});

console.log('Senders:', Array.from(senders));

Variation 5: Cross-Source Entity Linking

Link email entities with other sources:

// Find person across email + Slack + documents
const person = await graphlit.queryObservables({
  search: "Kirk Marple",
  filter: { types: [ObservableTypes.Person] }
});

const allMentions = await graphlit.queryContents({
  filter: {
    observations: [{
      type: ObservableTypes.Person,
      observable: { id: person.observables.results[0].observable.id }
    }]
  }
});

// Group by content type
const byType = allMentions.contents.results.reduce((groups, content) => {
  const type = content.type || 'UNKNOWN';
  if (!groups[type]) groups[type] = [];
  groups[type].push(content);
  return groups;
}, {} as Record<string, typeof allMentions.contents.results>);

console.log('Kirk Marple mentions:');
Object.entries(byType).forEach(([type, contents]) => {
  console.log(`  ${type}: ${contents.length} items`);
});

Common Issues & Solutions

Issue: OAuth Token Expired

Problem: Feed sync fails with authorization error.

Solution: Refresh OAuth token in Developer Portal:

Go to Developer Portal → Connectors
Re-authorize Gmail/Outlook
Copy new token
Update feed or create new feed

// Can't update token on existing feed - create new feed
const newFeed = await graphlit.createFeed({
  name: "Gmail (Updated)",
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: newOAuthToken  // Fresh token
  }
});

Issue: Duplicate Entities from Sender/Recipient and Body

Problem: Same person appears as sender AND extracted from body.

Explanation: This is expected and valuable:

Email metadata (from/to/cc) captured automatically
Body extraction finds additional context
Multiple mentions increase confidence

Not a Problem: Graphlit deduplicates to single Observable.

Issue: Too Many Low-Confidence Entities

Problem: Email extraction finds many uncertain entities.

Solution: Filter by confidence threshold:

const highConfidence = email.observations
  ?.filter(obs => obs.occurrences?.some(occ => occ.confidence >= 0.75)) || [];

Emails can have ambiguous mentions ("John said...") with low confidence.

Issue: Missing Email Body Entities

Problem: Only sender/recipient captured, no body extraction.

Causes:

Workflow not configured with extraction stage
Email is HTML-only with no text
Extraction failed for some emails

Solution: Verify workflow has extraction:

// Check workflow configuration
const workflowDetails = await graphlit.getWorkflow(workflowId);
console.log('Extraction jobs:', workflowDetails.workflow.extraction?.jobs);

Developer Hints

OAuth Token Management

Tokens expire after 1 hour (short-lived)
Refresh tokens valid for 6 months (Gmail) or indefinitely (Outlook)
Use Developer Portal for token management
Production apps should handle token refresh automatically

Email Sync Best Practices

Start small: Test with readLimit: 100 first
Incremental sync: Graphlit tracks what's synced
Monitor quota: Gmail API has rate limits
Handle failures: Email sync can be interrupted
Attachments optional: Skip for faster sync

Entity Quality from Emails

High confidence: Senders/recipients, signatures
Medium confidence: Explicit mentions in body
Low confidence: Implicit references, pronouns
Filter threshold: >=0.7 recommended for emails

Performance Considerations

Email sync is incremental (doesn't re-sync)
100 emails = ~1-2 minutes processing
Attachments increase processing time significantly
Entity extraction adds 10-30% overhead

Privacy and Security

OAuth tokens have user-level permissions
Graphlit never stores raw OAuth refresh tokens
Email content encrypted at rest
Multi-tenant isolation ensures data privacy

Last updated 2 months ago

Was this helpful?