Build Knowledge Graph from PDF Documents

Use Case: Build Knowledge Graph from PDF Documents

User Intent

"How do I extract entities from PDF documents to build a knowledge graph? Show me a complete workflow from PDF ingestion to querying entities."

Operation

SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables() GraphQL: Complete workflow + ingestion + entity querying Entity: PDF → Content → Observations → Observables (Knowledge Graph)

Prerequisites

  • Graphlit project with API credentials

  • PDF documents to process (local files or URLs)

  • Understanding of entity types

  • Basic knowledge of workflows


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from PDF ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "PDF Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // PDF, Word, Excel, etc.
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision model for PDFs
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Ingest PDF
console.log('Step 2: Ingesting PDF document...');
const content = await graphlit.ingestUri('https://arxiv.org/pdf/2301.00001.pdf', "Research Paper", undefined, undefined, undefined, { id: workflow.createWorkflow.id  });

console.log(`✓ Ingested: ${content.ingestUri.id}\n`);

// Step 3: Wait for processing
console.log('Step 3: Waiting for entity extraction...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(content.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    console.log('  Processing... (checking again in 2s)');
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
}
console.log('✓ Extraction complete\n');

// Step 4: Retrieve content with entities
console.log('Step 4: Retrieving extracted entities...');
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];

console.log(`✓ Found ${observations.length} entity observations\n`);

// Step 5: Analyze entities by type
console.log('Step 5: Analyzing entities...\n');

const byType = new Map<string, Set<string>>();
observations.forEach(obs => {
  if (!byType.has(obs.type)) {
    byType.set(obs.type, new Set());
  }
  byType.get(obs.type)!.add(obs.observable.name);
});

byType.forEach((entities, type) => {
  console.log(`${type} (${entities.size} unique):`);
  Array.from(entities).slice(0, 5).forEach(name => {
    console.log(`  - ${name}`);
  });
  if (entities.size > 5) {
    console.log(`  ... and ${entities.size - 5} more`);
  }
  console.log();
});

// Step 6: Query knowledge graph
console.log('Step 6: Querying knowledge graph...\n');

// Get all unique people
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`Total people in knowledge graph: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Total organizations in knowledge graph: ${orgs.observables.results.length}`);

// Step 7: Find entity relationships
console.log('\nStep 7: Analyzing entity co-occurrences...\n');

const cooccurrences: Array<{ person: string; organization: string; count: number }> = [];

observations
  .filter(obs => obs.type === ObservableTypes.Person)
  .forEach(personObs => {
    observations
      .filter(obs => obs.type === ObservableTypes.Organization)
      .forEach(orgObs => {
        // Check if they appear on same pages
        const personPages = new Set(
          personObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        const orgPages = new Set(
          orgObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        
        const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
        
        if (sharedPages.length > 0) {
          cooccurrences.push({
            person: personObs.observable.name,
            organization: orgObs.observable.name,
            count: sharedPages.length
          });
        }
      });
  });

console.log('Top person-organization relationships:');
cooccurrences
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ person, organization, count }) => {
    console.log(`  ${person}${organization} (${count} pages)`);
  });

console.log('\n✓ Knowledge graph analysis complete!');

Run

asyncio.run(build_kg_from_pdf())


Step-by-Step Explanation

Step 1: Create Extraction Workflow

Document Preparation:

  • FilePreparationServiceDocument handles PDFs, Word, Excel, PowerPoint

  • Extracts text, tables, images

  • Preserves page structure and layout

  • Handles encrypted PDFs (if password provided)

Vision-Based Extraction:

  • ExtractionServiceModelDocument uses vision models

  • Analyzes visual layout (charts, diagrams, tables)

  • Better for scanned PDFs

  • Extracts from images within PDFs

Entity Type Selection:

  • Choose types relevant to your domain

  • More types = longer processing time

  • Start with Person, Organization, Place, Event

Step 2: Ingest PDF Document

Ingestion Options:

Step 3: Poll for Completion

Processing Timeline:

  • Small PDF (<10 pages): 30-60 seconds

  • Medium PDF (10-50 pages): 1-3 minutes

  • Large PDF (50-200 pages): 3-10 minutes

  • Very large PDF (200+ pages): 10-30 minutes

Polling Strategy:

Step 4: Retrieve Extracted Entities

Full Content Details:

Step 5: Analyze Entities

Group by Type:

Deduplicate:

Step 6: Query Knowledge Graph

After entities are extracted, they're available globally:

Step 7: Analyze Relationships

Co-occurrence Analysis:

  • Entities on same page likely related

  • Frequency indicates relationship strength

  • Build relationship graph from co-occurrences

Cross-document Relationships:


Configuration Options

Choosing Text vs Vision Extraction

Use Text Extraction (ModelText) When:

  • PDFs are text-based (born-digital)

  • No important visual elements

  • Want faster/cheaper processing

  • Content is primarily textual

Use Vision Extraction (ModelDocument) When:

  • PDFs are scanned documents

  • Contains important charts/diagrams

  • Mixed text and visual content

  • OCR quality is poor with text extraction

Model Selection for Quality vs Speed


Variations

Extract parties, dates, obligations from legal documents:

Variation 2: Research Paper Citation Network

Build academic citation graphs:

Variation 3: Invoice/Receipt Processing

Extract vendors, amounts, dates from financial documents:

Variation 4: Medical Records Analysis

Extract medical entities from clinical documents:

Variation 5: Batch PDF Processing

Process multiple PDFs efficiently:


Common Issues & Solutions

Issue: No Entities Extracted from Scanned PDF

Problem: PDF is scanned image, text extraction fails.

Solution: Use vision model + proper preparation:

Issue: Encrypted PDF Won't Process

Problem: PDF is password-protected.

Solution: Provide password in preparation:

Issue: Missing Entities from Images/Charts

Problem: Text-based extraction misses visual elements.

Solution: Use vision model extraction:

Issue: Processing Takes Too Long

Problem: Large PDF processing exceeds timeout.

Solutions:

  1. Split large PDFs into smaller chunks

  2. Use faster model (GPT-4o instead of GPT-4)

  3. Reduce number of entity types

  4. Process in background, poll asynchronously


Developer Hints

PDF Processing Best Practices

  1. Check file size first: >50MB PDFs may need special handling

  2. Test with sample page: Validate extraction quality before batch

  3. Use appropriate model: Vision for scanned, text for born-digital

  4. Monitor confidence scores: Filter entities with confidence <0.7

  5. Handle failures gracefully: PDFs can be corrupt or malformed

Vision Model Selection

  • GPT-4o Vision: Best balance (recommended)

  • Claude 3.5 Sonnet: Good for complex layouts

  • GPT-4 Vision: Highest quality but slower/expensive

Cost Optimization

  • Text extraction much cheaper than vision

  • GPT-4o significantly cheaper than GPT-4

  • Extract only needed entity types

  • Batch processing for volume discounts

Performance Optimization

  • Parallel ingestion up to 10 PDFs simultaneously

  • Poll every 2-5 seconds (not more frequently)

  • Cache entity results to avoid re-querying

  • Use collections to organize large sets of PDFs


Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_13_Extract_People_Organizations_from_ArXiv_Papers.ipynb:

  • Ingests ArXiv research papers (PDFs)

  • Extracts Person (authors), Organization (institutions)

  • Filters by confidence >=0.7

  • Builds citation network from entities

  • Exports to CSV for analysis

  • Process contracts in batch (100s of PDFs)

  • Extract parties, dates, obligations

  • Build contract database with entity index

  • Enable search by party or jurisdiction

  • Alert on approaching deadlines


Last updated

Was this helpful?