Build Knowledge Graph from Emails

Use Case: Build Knowledge Graph from Emails

User Intent

"How do I extract entities from my Gmail or Outlook emails to build a knowledge graph? Show me how to connect contacts, organizations, and build relationship networks from email data."

Operation

SDK Methods: createWorkflow(), createFeed(), isFeedDone(), queryContents(), queryObservables() GraphQL: Feed creation + entity extraction + relationship queries Entity: Email Feed → Email Content → Observations → Observables (Contact Graph)

Prerequisites

  • Graphlit project with API credentials

  • Gmail or Microsoft 365 account

  • OAuth tokens for email access (via Graphlit Developer Portal)

  • Understanding of feed and workflow concepts


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import { ContentTypes, EntityState, FeedServiceTypes, ObservableTypes } from 'graphlit-client/dist/generated/graphql-types';
import {
  FeedTypes,
  FeedServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
  ContentTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from Emails ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating entity extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "Email Entity Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,          // Senders, recipients, mentions
          ObservableTypes.Organization,    // Companies from domains/signatures
          ObservableTypes.Event,           // Meeting mentions, deadlines
          ObservableTypes.Product,         // Products/services discussed
          ObservableTypes.Place            // Locations mentioned
        ]
      }
    }]
  }
});

console.log(`✓ Workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Create Gmail feed with OAuth
console.log('Step 2: Creating Gmail feed...');
const feed = await graphlit.createFeed({
  name: "My Gmail",
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: process.env.GOOGLE_OAUTH_TOKEN!,  // From Developer Portal
    readLimit: 100,                          // Number of emails to sync
    includeAttachments: true                 // Sync attachments too
  },
  workflow: { id: workflow.createWorkflow.id }
});

console.log(`✓ Feed: ${feed.createFeed.id}\n`);

// Step 3: Wait for email sync
console.log('Step 3: Syncing emails...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isFeedDone(feed.createFeed.id);
  isDone = status.isFeedDone.result;
  
  if (!isDone) {
    console.log('  Syncing... (checking again in 5s)');
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}
console.log('✓ Sync complete\n');

// Step 4: Query synced emails
console.log('Step 4: Querying synced emails...');
const emails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email],
    feeds: [{ id: feed.createFeed.id }]
  }
});

console.log(`✓ Synced ${emails.contents.results.length} emails\n`);

// Step 5: Analyze email metadata
console.log('Step 5: Analyzing email senders...\n');

const senders = new Map<string, number>();
emails.contents.results.forEach(email => {
  if (email.email?.from) {
    email.email.from.forEach(sender => {
      const email_addr = sender.email || 'unknown';
      senders.set(email_addr, (senders.get(email_addr) || 0) + 1);
    });
  }
});

console.log('Top email senders:');
Array.from(senders.entries())
  .sort((a, b) => b[1] - a[1])
  .slice(0, 5)
  .forEach(([email, count]) => {
    console.log(`  ${email}: ${count} emails`);
  });
console.log();

// Step 6: Query extracted entities
console.log('Step 6: Querying knowledge graph...\n');

// Get all people from emails
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`People extracted: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Organizations extracted: ${orgs.observables.results.length}\n`);

// Step 7: Build contact network
console.log('Step 7: Building contact network...\n');

// Email threads create person-to-person relationships
const contactNetwork = new Map<string, Set<string>>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0]?.email;
  const toList = email.email?.to?.map(t => t.email) || [];
  const ccList = email.email?.cc?.map(c => c.email) || [];
  
  const recipients = [...toList, ...ccList].filter(e => e);
  
  if (from && recipients.length > 0) {
    if (!contactNetwork.has(from)) {
      contactNetwork.set(from, new Set());
    }
    recipients.forEach(recipient => {
      contactNetwork.get(from)!.add(recipient);
    });
  }
});

console.log('Top email relationships:');
Array.from(contactNetwork.entries())
  .map(([from, to]) => ({ from, count: to.size }))
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ from, count }) => {
    console.log(`  ${from}${count} contacts`);
  });

console.log('\n✓ Knowledge graph complete!');

Run

asyncio.run(build_kg_from_emails())


### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;

var graphlit = new Graphlit();

Console.WriteLine("=== Building Knowledge Graph from Emails ===\n");

// Step 1: Create workflow
Console.WriteLine("Step 1: Creating entity extraction workflow...");
var workflow = await graphlit.CreateWorkflow(
    name: "Email Entity Extraction",
    extraction: new WorkflowExtractionInput
    {
        Jobs = new[]
        {
            new WorkflowExtractionJobInput
            {
                Connector = new ExtractionConnectorInput
                {
                    Type = ExtractionServiceModelText,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization,
                        ObservableTypes.Event,
                        ObservableTypes.Product
                    }
                }
            }
        }
    }
);

Console.WriteLine($"✓ Workflow: {workflow.CreateWorkflow.Id}\n");

// Step 2: Create Gmail feed
Console.WriteLine("Step 2: Creating Gmail feed...");
var feed = await graphlit.CreateFeed(
    name: "My Gmail",
    type: FeedEmail,
    email: new EmailFeedInput
    {
        Type = FeedServiceGmail,
        Token = Environment.GetEnvironmentVariable("GOOGLE_OAUTH_TOKEN"),
        ReadLimit = 100,
        IncludeAttachments = true
    },
    workflow: new EntityReferenceInput { Id = workflow.CreateWorkflow.Id }
);

Console.WriteLine($"✓ Feed: {feed.CreateFeed.Id}\n");

// (Continue with remaining steps...)

Step-by-Step Explanation

Step 1: Create Entity Extraction Workflow

Email-Specific Entity Types:

  • Person: Senders, recipients, people mentioned in body, signatures

  • Organization: Companies from email domains, mentioned in text, signatures

  • Event: Meetings, deadlines, calendar invites mentioned

  • Product: Products/services discussed in emails

  • Place: Locations mentioned (meeting locations, offices)

Why Text Extraction:

  • Emails are primarily text-based

  • No visual analysis needed (unlike PDFs)

  • Fast and cost-effective

  • Handles HTML email bodies

Step 2: Configure Email Feed

Gmail Feed Configuration:

feed: {
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: googleOAuthToken,           // From Developer Portal OAuth
    readLimit: 100,                     // How many emails to sync
    includeAttachments: true,           // Sync attachments as separate content
    labels: ['INBOX', 'SENT']           // Optional: specific labels
  }
}

Microsoft Outlook Feed:

feed: {
  type: FeedEmail,
  email: {
    type: FeedServiceOutlook,
    token: microsoftOAuthToken,         // Microsoft OAuth token
    readLimit: 100,
    includeAttachments: true,
    folderNames: ['Inbox', 'Sent Items']  // Optional: specific folders
  }
}

OAuth Token Setup:

  1. Go to Graphlit Developer Portal

  2. Navigate to Connectors → Email

  3. Authorize Gmail or Outlook

  4. Copy OAuth token

  5. Use in feed creation

Step 3: Sync and Wait for Processing

Sync Timeline:

  • 100 emails: 1-2 minutes

  • 1,000 emails: 10-15 minutes

  • 10,000 emails: 1-2 hours

Polling Strategy:

const pollInterval = 5000;  // 5 seconds
const maxWait = 600000;     // 10 minutes max

const startTime = Date.now();
while (!isDone && (Date.now() - startTime < maxWait)) {
  const status = await graphlit.isFeedDone(feedId);
  isDone = status.isFeedDone.result;
  
  if (!isDone) {
    await new Promise(resolve => setTimeout(resolve, pollInterval));
  }
}

Step 4: Query Email Content

Email Metadata Structure:

email: {
  from: [{ name: "Kirk Marple", email: "[email protected]" }],
  to: [{ name: "John Doe", email: "[email protected]" }],
  cc: [{ name: "Jane Smith", email: "[email protected]" }],
  bcc: [],  // Usually empty (privacy)
  subject: "Q4 Planning Meeting",
  labels: ["INBOX", "IMPORTANT"],  // Gmail labels
  identifier: "<[email protected]>",
  threadIdentifier: "<[email protected]>",
  sensitivity: "Normal",
  priority: "High",
  attachmentCount: 2
}

Step 5: Extract Entity Observations

Email Body Extraction:

const emailContent = await graphlit.getContent(emailId);

// Entities from email body
emailContent.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
  // No page numbers (emails aren't paginated)
  // High confidence for explicit mentions
});

Signature Extraction: Email signatures are rich sources of Person/Organization data:

Kirk Marple
CEO, Graphlit
[email protected]
https://graphlit.com

Extracts: Person("Kirk Marple"), Organization("Graphlit")

Step 6: Build Contact Network

Email Threads Create Relationships:

  • fromto/cc: Direct communication

  • Frequency indicates relationship strength

  • Thread IDs group related emails

Network Analysis:

// Who communicates with whom
const relationships = new Map<string, Map<string, number>>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0]?.email;
  const recipients = [
    ...(email.email?.to?.map(t => t.email) || []),
    ...(email.email?.cc?.map(c => c.email) || [])
  ];
  
  if (from && recipients.length > 0) {
    if (!relationships.has(from)) {
      relationships.set(from, new Map());
    }
    
    recipients.forEach(to => {
      const recipientMap = relationships.get(from)!;
      recipientMap.set(to, (recipientMap.get(to) || 0) + 1);
    });
  }
});

Step 7: Query Knowledge Graph

Cross-Feed Entity Queries: Entities from emails become part of global knowledge graph:

// Find all content mentioning a person (emails + other sources)
const kirkContent = await graphlit.queryContents({
  filter: {
    observations: [{
      type: ObservableTypes.Person,
      observable: { id: kirkPersonId }
    }]
  }
});

// Includes: emails, Slack messages, documents, etc.

Configuration Options

Limiting Email Sync Scope

By Count:

email: {
  readLimit: 500  // Most recent 500 emails
}

By Date Range:

email: {
  readLimit: 1000,
  // Only recent emails (Graphlit handles recency automatically)
}

By Labels/Folders:

// Gmail
email: {
  type: FeedServiceGmail,
  labels: ['INBOX', 'IMPORTANT', 'Sent']  // Specific labels only
}

// Outlook
email: {
  type: FeedServiceOutlook,
  folderNames: ['Inbox', 'Sent Items', 'Archive']
}

Handling Attachments

Include Attachments:

email: {
  includeAttachments: true  // PDFs, images, etc. become separate content
}

Attachments are processed through workflow:

  • PDFs → extraction → entities

  • Images → vision analysis → entities

  • Documents → text extraction → entities

Exclude Attachments (faster):

email: {
  includeAttachments: false  // Email body only
}

Variations

Variation 1: Organization Email Domain Mapping

Extract organizations from email domains:

function extractOrgFromDomain(email: string): string | null {
  const domain = email.split('@')[1];
  if (!domain) return null;
  
  // Map common domains
  const orgMap: Record<string, string> = {
    'gmail.com': null,        // Personal email
    'outlook.com': null,      // Personal email
    'graphlit.com': 'Graphlit',
    'microsoft.com': 'Microsoft',
    // ... add more
  };
  
  return orgMap[domain] || domain.replace(/\.(com|org|net|io)$/, '');
}

// Build org roster from emails
const emailsByOrg = new Map<string, Set<string>>();

emails.contents.results.forEach(email => {
  email.email?.from?.forEach(sender => {
    const org = extractOrgFromDomain(sender.email || '');
    if (org) {
      if (!emailsByOrg.has(org)) {
        emailsByOrg.set(org, new Set());
      }
      emailsByOrg.get(org)!.add(sender.email || '');
    }
  });
});

console.log('Emails by organization:');
emailsByOrg.forEach((emails, org) => {
  console.log(`  ${org}: ${emails.size} contacts`);
});

Variation 2: Email Thread Analysis

Analyze conversation threads:

// Group emails by thread
const threads = new Map<string, Array<typeof emails.contents.results[0]>>();

emails.contents.results.forEach(email => {
  const threadId = email.email?.threadIdentifier || email.id;
  if (!threads.has(threadId)) {
    threads.set(threadId, []);
  }
  threads.get(threadId)!.push(email);
});

// Find longest threads
const longThreads = Array.from(threads.entries())
  .sort((a, b) => b[1].length - a[1].length)
  .slice(0, 5);

console.log('Longest email threads:');
longThreads.forEach(([threadId, emails]) => {
  const subject = emails[0].email?.subject;
  console.log(`  "${subject}": ${emails.length} emails`);
});

Variation 3: Contact Frequency Ranking

Rank contacts by interaction frequency:

interface ContactStats {
  email: string;
  name?: string;
  emailsReceived: number;
  emailsSent: number;
  total: number;
}

const myEmail = '[email protected]';  // Your email address
const contactStats = new Map<string, ContactStats>();

emails.contents.results.forEach(email => {
  const from = email.email?.from?.[0];
  const toList = email.email?.to || [];
  const ccList = email.email?.cc || [];
  
  if (from?.email === myEmail) {
    // Email I sent
    [...toList, ...ccList].forEach(recipient => {
      if (!contactStats.has(recipient.email!)) {
        contactStats.set(recipient.email!, {
          email: recipient.email!,
          name: recipient.name,
          emailsReceived: 0,
          emailsSent: 0,
          total: 0
        });
      }
      const stats = contactStats.get(recipient.email!)!;
      stats.emailsSent++;
      stats.total++;
    });
  } else if (from?.email) {
    // Email I received
    if (!contactStats.has(from.email)) {
      contactStats.set(from.email, {
        email: from.email,
        name: from.name,
        emailsReceived: 0,
        emailsSent: 0,
        total: 0
      });
    }
    const stats = contactStats.get(from.email)!;
    stats.emailsReceived++;
    stats.total++;
  }
});

// Top contacts
const topContacts = Array.from(contactStats.values())
  .sort((a, b) => b.total - a.total)
  .slice(0, 10);

console.log('Top contacts:');
topContacts.forEach((contact, i) => {
  console.log(`${i + 1}. ${contact.name || contact.email}`);
  console.log(`   Received: ${contact.emailsReceived}, Sent: ${contact.emailsSent}`);
});

Search emails by entity:

// Find all emails mentioning Graphlit
const graphlitOrg = await graphlit.queryObservables({
  search: "Graphlit",
  filter: { types: [ObservableTypes.Organization] }
});

const graphlitEmails = await graphlit.queryContents({
  filter: {
    types: [ContentTypes.Email],
    observations: [{
      type: ObservableTypes.Organization,
      observable: { id: graphlitOrg.observables.results[0].observable.id }
    }]
  }
});

console.log(`Emails mentioning Graphlit: ${graphlitEmails.contents.results.length}`);

// Who sent these emails?
const senders = new Set<string>();
graphlitEmails.contents.results.forEach(email => {
  email.email?.from?.forEach(sender => {
    if (sender.email) senders.add(sender.email);
  });
});

console.log('Senders:', Array.from(senders));

Variation 5: Cross-Source Entity Linking

Link email entities with other sources:

// Find person across email + Slack + documents
const person = await graphlit.queryObservables({
  search: "Kirk Marple",
  filter: { types: [ObservableTypes.Person] }
});

const allMentions = await graphlit.queryContents({
  filter: {
    observations: [{
      type: ObservableTypes.Person,
      observable: { id: person.observables.results[0].observable.id }
    }]
  }
});

// Group by content type
const byType = allMentions.contents.results.reduce((groups, content) => {
  const type = content.type || 'UNKNOWN';
  if (!groups[type]) groups[type] = [];
  groups[type].push(content);
  return groups;
}, {} as Record<string, typeof allMentions.contents.results>);

console.log('Kirk Marple mentions:');
Object.entries(byType).forEach(([type, contents]) => {
  console.log(`  ${type}: ${contents.length} items`);
});

Common Issues & Solutions

Issue: OAuth Token Expired

Problem: Feed sync fails with authorization error.

Solution: Refresh OAuth token in Developer Portal:

  1. Go to Developer Portal → Connectors

  2. Re-authorize Gmail/Outlook

  3. Copy new token

  4. Update feed or create new feed

// Can't update token on existing feed - create new feed
const newFeed = await graphlit.createFeed({
  name: "Gmail (Updated)",
  type: FeedEmail,
  email: {
    type: FeedServiceGmail,
    token: newOAuthToken  // Fresh token
  }
});

Issue: Duplicate Entities from Sender/Recipient and Body

Problem: Same person appears as sender AND extracted from body.

Explanation: This is expected and valuable:

  • Email metadata (from/to/cc) captured automatically

  • Body extraction finds additional context

  • Multiple mentions increase confidence

Not a Problem: Graphlit deduplicates to single Observable.

Issue: Too Many Low-Confidence Entities

Problem: Email extraction finds many uncertain entities.

Solution: Filter by confidence threshold:

const highConfidence = email.observations
  ?.filter(obs => obs.occurrences?.some(occ => occ.confidence >= 0.75)) || [];

Emails can have ambiguous mentions ("John said...") with low confidence.

Issue: Missing Email Body Entities

Problem: Only sender/recipient captured, no body extraction.

Causes:

  1. Workflow not configured with extraction stage

  2. Email is HTML-only with no text

  3. Extraction failed for some emails

Solution: Verify workflow has extraction:

// Check workflow configuration
const workflowDetails = await graphlit.getWorkflow(workflowId);
console.log('Extraction jobs:', workflowDetails.workflow.extraction?.jobs);

Developer Hints

OAuth Token Management

  • Tokens expire after 1 hour (short-lived)

  • Refresh tokens valid for 6 months (Gmail) or indefinitely (Outlook)

  • Use Developer Portal for token management

  • Production apps should handle token refresh automatically

Email Sync Best Practices

  1. Start small: Test with readLimit: 100 first

  2. Incremental sync: Graphlit tracks what's synced

  3. Monitor quota: Gmail API has rate limits

  4. Handle failures: Email sync can be interrupted

  5. Attachments optional: Skip for faster sync

Entity Quality from Emails

  • High confidence: Senders/recipients, signatures

  • Medium confidence: Explicit mentions in body

  • Low confidence: Implicit references, pronouns

  • Filter threshold: >=0.7 recommended for emails

Performance Considerations

  • Email sync is incremental (doesn't re-sync)

  • 100 emails = ~1-2 minutes processing

  • Attachments increase processing time significantly

  • Entity extraction adds 10-30% overhead

Privacy and Security

  • OAuth tokens have user-level permissions

  • Graphlit never stores raw OAuth refresh tokens

  • Email content encrypted at rest

  • Multi-tenant isolation ensures data privacy


Last updated

Was this helpful?