Build Knowledge Graph from PDF Documents

Use Case: Build Knowledge Graph from PDF Documents

User Intent

"How do I extract entities from PDF documents to build a knowledge graph? Show me a complete workflow from PDF ingestion to querying entities."

Operation

SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables() GraphQL: Complete workflow + ingestion + entity querying Entity: PDF → Content → Observations → Observables (Knowledge Graph)

Prerequisites

  • Graphlit project with API credentials

  • PDF documents to process (local files or URLs)

  • Understanding of entity types

  • Basic knowledge of workflows


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from PDF ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "PDF Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // PDF, Word, Excel, etc.
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision model for PDFs
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Ingest PDF
console.log('Step 2: Ingesting PDF document...');
const content = await graphlit.ingestUri('https://arxiv.org/pdf/2301.00001.pdf', "Research Paper", undefined, undefined, undefined, { id: workflow.createWorkflow.id  });

console.log(`✓ Ingested: ${content.ingestUri.id}\n`);

// Step 3: Wait for processing
console.log('Step 3: Waiting for entity extraction...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(content.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    console.log('  Processing... (checking again in 2s)');
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
}
console.log('✓ Extraction complete\n');

// Step 4: Retrieve content with entities
console.log('Step 4: Retrieving extracted entities...');
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];

console.log(`✓ Found ${observations.length} entity observations\n`);

// Step 5: Analyze entities by type
console.log('Step 5: Analyzing entities...\n');

const byType = new Map<string, Set<string>>();
observations.forEach(obs => {
  if (!byType.has(obs.type)) {
    byType.set(obs.type, new Set());
  }
  byType.get(obs.type)!.add(obs.observable.name);
});

byType.forEach((entities, type) => {
  console.log(`${type} (${entities.size} unique):`);
  Array.from(entities).slice(0, 5).forEach(name => {
    console.log(`  - ${name}`);
  });
  if (entities.size > 5) {
    console.log(`  ... and ${entities.size - 5} more`);
  }
  console.log();
});

// Step 6: Query knowledge graph
console.log('Step 6: Querying knowledge graph...\n');

// Get all unique people
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`Total people in knowledge graph: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Total organizations in knowledge graph: ${orgs.observables.results.length}`);

// Step 7: Find entity relationships
console.log('\nStep 7: Analyzing entity co-occurrences...\n');

const cooccurrences: Array<{ person: string; organization: string; count: number }> = [];

observations
  .filter(obs => obs.type === ObservableTypes.Person)
  .forEach(personObs => {
    observations
      .filter(obs => obs.type === ObservableTypes.Organization)
      .forEach(orgObs => {
        // Check if they appear on same pages
        const personPages = new Set(
          personObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        const orgPages = new Set(
          orgObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        
        const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
        
        if (sharedPages.length > 0) {
          cooccurrences.push({
            person: personObs.observable.name,
            organization: orgObs.observable.name,
            count: sharedPages.length
          });
        }
      });
  });

console.log('Top person-organization relationships:');
cooccurrences
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ person, organization, count }) => {
    console.log(`  ${person}${organization} (${count} pages)`);
  });

console.log('\n✓ Knowledge graph analysis complete!');

Run

asyncio.run(build_kg_from_pdf())


### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;

var graphlit = new Graphlit();

Console.WriteLine("=== Building Knowledge Graph from PDF ===\n");

// Step 1: Create workflow
Console.WriteLine("Step 1: Creating extraction workflow...");
var workflow = await graphlit.CreateWorkflow(
    name: "PDF Entity Extraction",
    preparation: new WorkflowPreparationInput
    {
        Jobs = new[]
        {
            new WorkflowPreparationJobInput
            {
                Connector = new FilePreparationConnectorInput
                {
                    Type = FilePreparationServiceDocument
                }
            }
        }
    },
    extraction: new WorkflowExtractionInput
    {
        Jobs = new[]
        {
            new WorkflowExtractionJobInput
            {
                Connector = new ExtractionConnectorInput
                {
                    Type = ExtractionServiceModelDocument,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization,
                        ObservableTypes.Place,
                        ObservableTypes.Event
                    }
                }
            }
        }
    }
);

Console.WriteLine($"✓ Created workflow: {workflow.CreateWorkflow.Id}\n");

// Step 2: Ingest PDF
Console.WriteLine("Step 2: Ingesting PDF document...");
var content = await graphlit.IngestUri(
    name: "Research Paper",
    uri: "https://arxiv.org/pdf/2301.00001.pdf",
    workflow: new EntityReferenceInput { Id = workflow.CreateWorkflow.Id }
);

Console.WriteLine($"✓ Ingested: {content.IngestUri.Id}\n");

// (Continue with remaining steps...)

Step-by-Step Explanation

Step 1: Create Extraction Workflow

Document Preparation:

  • FilePreparationServiceDocument handles PDFs, Word, Excel, PowerPoint

  • Extracts text, tables, images

  • Preserves page structure and layout

  • Handles encrypted PDFs (if password provided)

Vision-Based Extraction:

  • ExtractionServiceModelDocument uses vision models

  • Analyzes visual layout (charts, diagrams, tables)

  • Better for scanned PDFs

  • Extracts from images within PDFs

Entity Type Selection:

  • Choose types relevant to your domain

  • More types = longer processing time

  • Start with Person, Organization, Place, Event

Step 2: Ingest PDF Document

Ingestion Options:

// From URL
await graphlit.ingestUri('https://example.com/document.pdf', undefined, undefined, undefined, undefined, { id: workflowId  });

// From local file (base64 encoded)
const fileBuffer = fs.readFileSync('./document.pdf');
const base64 = fileBuffer.toString('base64');

await graphlit.ingestEncodedFile({
  name: 'document.pdf',
  data: base64,
  mimeType: 'application/pdf',
  workflow: { id: workflowId }
});

// From cloud storage (via feed)
const feed = await graphlit.createFeed({
  type: FeedTypes.Site,
  site: {
    type: FeedServiceTypes.AzureFile,
    // ... Azure Blob Storage config
  },
  workflow: { id: workflowId }
});

Step 3: Poll for Completion

Processing Timeline:

  • Small PDF (<10 pages): 30-60 seconds

  • Medium PDF (10-50 pages): 1-3 minutes

  • Large PDF (50-200 pages): 3-10 minutes

  • Very large PDF (200+ pages): 10-30 minutes

Polling Strategy:

// Efficient polling with backoff
let retries = 0;
const maxRetries = 60;  // 2 minutes max
const delay = 2000;     // 2 seconds

while (retries < maxRetries) {
  const status = await graphlit.isContentDone(contentId);
  if (status.isContentDone.result) break;
  
  await new Promise(resolve => setTimeout(resolve, delay));
  retries++;
}

Step 4: Retrieve Extracted Entities

Full Content Details:

const content = await graphlit.getContent(contentId);

// Access entity observations
const observations = content.content.observations || [];

// Access content metadata
console.log(`Pages: ${content.content.document?.pageCount}`);
console.log(`File size: ${content.content.fileSize}`);
console.log(`Created: ${content.content.creationDate}`);

Step 5: Analyze Entities

Group by Type:

const entityGroups = observations.reduce((groups, obs) => {
  if (!groups[obs.type]) {
    groups[obs.type] = [];
  }
  groups[obs.type].push(obs.observable);
  return groups;
}, {} as Record<string, Observable[]>);

Deduplicate:

const uniqueEntities = new Map<string, Observable>();
observations.forEach(obs => {
  uniqueEntities.set(obs.observable.id, obs.observable);
});

Step 6: Query Knowledge Graph

After entities are extracted, they're available globally:

// All people across all content
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

// Search for specific person
const kirkEntities = await graphlit.queryObservables({
  search: "Kirk Marple",
  filter: { types: [ObservableTypes.Person] }
});

Step 7: Analyze Relationships

Co-occurrence Analysis:

  • Entities on same page likely related

  • Frequency indicates relationship strength

  • Build relationship graph from co-occurrences

Cross-document Relationships:

// Find content mentioning both entities
const relatedContent = await graphlit.queryContents({
  filter: {
    observations: [
      { type: ObservableTypes.Person, observable: { id: personId } },
      { type: ObservableTypes.Organization, observable: { id: orgId } }
    ]
  }
});

Configuration Options

Choosing Text vs Vision Extraction

Use Text Extraction (ModelText) When:

  • PDFs are text-based (born-digital)

  • No important visual elements

  • Want faster/cheaper processing

  • Content is primarily textual

Use Vision Extraction (ModelDocument) When:

  • PDFs are scanned documents

  • Contains important charts/diagrams

  • Mixed text and visual content

  • OCR quality is poor with text extraction

Model Selection for Quality vs Speed

// High quality (slower, more expensive)
const specGPT4 = await graphlit.createSpecification({
  name: "GPT-4 Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAIModels.Gpt4, temperature: 0.1 }
});

// Balanced (recommended)
const specGPT4o = await graphlit.createSpecification({
  name: "GPT-4o Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAIModels.Gpt4o, temperature: 0.1 }
});

// Fast and cost-effective
const specGemini = await graphlit.createSpecification({
  name: "Gemini Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.Google,
  gemini: { model: GeminiModels.Gemini15Pro, temperature: 0.1 }
});

Variations

Extract parties, dates, obligations from legal documents:

const legalWorkflow = await graphlit.createWorkflow({
  name: "Legal Contract Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,
        extractedTypes: [
          ObservableTypes.Person,        // Parties
          ObservableTypes.Organization,  // Companies
          ObservableTypes.Place,         // Jurisdictions
          ObservableTypes.Event          // Effective dates, deadlines
        ]
      }
    }]
  }
});

// Analyze extracted obligations
const contractContent = await graphlit.getContent(contentId);
const events = contractContent.content.observations
  ?.filter(obs => obs.type === ObservableTypes.Event) || [];

console.log('Contract deadlines:');
events.forEach(event => {
  console.log(`  - ${event.observable.name}`);
  event.occurrences?.forEach(occ => {
    console.log(`    Page ${(occ.pageIndex || 0) + 1}, Confidence: ${occ.confidence}`);
  });
});

Variation 2: Research Paper Citation Network

Build academic citation graphs:

const researchWorkflow = await graphlit.createWorkflow({
  name: "Research Paper Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,        // Authors
          ObservableTypes.Organization,  // Institutions
          ObservableTypes.Category       // Topics, keywords
        ]
      }
    }]
  }
});

// Build author collaboration network
const papers = await graphlit.queryContents({
  filter: { workflows: [{ id: researchWorkflow.createWorkflow.id }] }
});

const collaborations = new Map<string, Set<string>>();

papers.contents.results.forEach(paper => {
  const authors = paper.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .map(obs => obs.observable.name) || [];
  
  // Record co-authorships
  for (let i = 0; i < authors.length; i++) {
    for (let j = i + 1; j < authors.length; j++) {
      const key = [authors[i], authors[j]].sort().join(' & ');
      if (!collaborations.has(key)) {
        collaborations.set(key, new Set());
      }
      collaborations.get(key)!.add(paper.name);
    }
  }
});

console.log('Top collaborations:');
Array.from(collaborations.entries())
  .sort((a, b) => b[1].size - a[1].size)
  .slice(0, 10)
  .forEach(([authors, papers]) => {
    console.log(`  ${authors}: ${papers.size} papers`);
  });

Variation 3: Invoice/Receipt Processing

Extract vendors, amounts, dates from financial documents:

const invoiceWorkflow = await graphlit.createWorkflow({
  name: "Invoice Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision for logos/stamps
        extractedTypes: [
          ObservableTypes.Organization,  // Vendor, customer
          ObservableTypes.Place,         // Billing/shipping address
          ObservableTypes.Event,         // Invoice date, due date
          ObservableTypes.Product        // Line items
        ]
      }
    }]
  }
});

// Extract invoice metadata
const invoice = await graphlit.getContent(invoiceId);
const vendor = invoice.content.observations
  ?.find(obs => obs.type === ObservableTypes.Organization);

console.log(`Vendor: ${vendor?.observable.name}`);

Variation 4: Medical Records Analysis

Extract medical entities from clinical documents:

const medicalWorkflow = await graphlit.createWorkflow({
  name: "Medical Records Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,              // Patients, doctors
          ObservableTypes.MedicalCondition,    // Diagnoses
          ObservableTypes.MedicalDrug,         // Medications
          ObservableTypes.MedicalProcedure,    // Treatments
          ObservableTypes.MedicalTest          // Lab tests
        ]
      }
    }]
  }
});

// Analyze patient record
const record = await graphlit.getContent(recordId);
const conditions = record.content.observations
  ?.filter(obs => obs.type === ObservableTypes.MedicalCondition) || [];
const drugs = record.content.observations
  ?.filter(obs => obs.type === ObservableTypes.MedicalDrug) || [];

console.log('Diagnoses:', conditions.map(c => c.observable.name));
console.log('Medications:', drugs.map(d => d.observable.name));

Variation 5: Batch PDF Processing

Process multiple PDFs efficiently:

const pdfUrls = [
  'https://example.com/doc1.pdf',
  'https://example.com/doc2.pdf',
  'https://example.com/doc3.pdf',
  // ... more PDFs
];

// Ingest all PDFs
const contentIds = await Promise.all(
  pdfUrls.map(uri =>
    graphlit.ingestUri({ uri, workflow: { id: workflowId } })
      .then(result => result.ingestUri.id)
  )
);

console.log(`Ingested ${contentIds.length} PDFs`);

// Wait for all to complete
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      contentIds.map(id => graphlit.isContentDone(id))
    );
    allDone = statuses.every(s => s.isContentDone.result);
    
    if (!allDone) {
      console.log('Processing...');
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();
console.log('All PDFs processed');

// Query all extracted entities
const allEntities = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person, ObservableTypes.Organization] }
});

console.log(`Total entities extracted: ${allEntities.observables.results.length}`);

Common Issues & Solutions

Issue: No Entities Extracted from Scanned PDF

Problem: PDF is scanned image, text extraction fails.

Solution: Use vision model + proper preparation:

preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument,
      document: {
        includeOCR: true  // Enable OCR for scanned docs
      }
    }
  }]
},
extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument  // Vision model
    }
  }]
}

Issue: Encrypted PDF Won't Process

Problem: PDF is password-protected.

Solution: Provide password in preparation:

preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument,
      document: {
        password: 'document-password-here'
      }
    }
  }]
}

Issue: Missing Entities from Images/Charts

Problem: Text-based extraction misses visual elements.

Solution: Use vision model extraction:

extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument,  // Analyzes images
      extractedTypes: [/* ... */]
    }
  }]
}

Issue: Processing Takes Too Long

Problem: Large PDF processing exceeds timeout.

Solutions:

  1. Split large PDFs into smaller chunks

  2. Use faster model (GPT-4o instead of GPT-4)

  3. Reduce number of entity types

  4. Process in background, poll asynchronously

// Async processing pattern
const content = await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, undefined, { id: workflowId  });

// Don't block - return ID immediately
console.log(`Processing started: ${content.ingestUri.id}`);
// Poll in background job or webhook

Developer Hints

PDF Processing Best Practices

  1. Check file size first: >50MB PDFs may need special handling

  2. Test with sample page: Validate extraction quality before batch

  3. Use appropriate model: Vision for scanned, text for born-digital

  4. Monitor confidence scores: Filter entities with confidence <0.7

  5. Handle failures gracefully: PDFs can be corrupt or malformed

Vision Model Selection

  • GPT-4o Vision: Best balance (recommended)

  • Claude 3.5 Sonnet: Good for complex layouts

  • GPT-4 Vision: Highest quality but slower/expensive

Cost Optimization

  • Text extraction much cheaper than vision

  • GPT-4o significantly cheaper than GPT-4

  • Extract only needed entity types

  • Batch processing for volume discounts

Performance Optimization

  • Parallel ingestion up to 10 PDFs simultaneously

  • Poll every 2-5 seconds (not more frequently)

  • Cache entity results to avoid re-querying

  • Use collections to organize large sets of PDFs


Production Patterns

Pattern from Graphlit Samples

Graphlit_2024_09_13_Extract_People_Organizations_from_ArXiv_Papers.ipynb:

  • Ingests ArXiv research papers (PDFs)

  • Extracts Person (authors), Organization (institutions)

  • Filters by confidence >=0.7

  • Builds citation network from entities

  • Exports to CSV for analysis

  • Process contracts in batch (100s of PDFs)

  • Extract parties, dates, obligations

  • Build contract database with entity index

  • Enable search by party or jurisdiction

  • Alert on approaching deadlines


Last updated

Was this helpful?