Build Knowledge Graph from PDF Documents
Use Case: Build Knowledge Graph from PDF Documents
User Intent
"How do I extract entities from PDF documents to build a knowledge graph? Show me a complete workflow from PDF ingestion to querying entities."
Operation
SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables()
GraphQL: Complete workflow + ingestion + entity querying
Entity: PDF → Content → Observations → Observables (Knowledge Graph)
Prerequisites
Graphlit project with API credentials
PDF documents to process (local files or URLs)
Understanding of entity types
Basic knowledge of workflows
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import {
FilePreparationServiceTypes,
EntityExtractionServiceTypes,
ObservableTypes,
EntityState
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
console.log('=== Building Knowledge Graph from PDF ===\n');
// Step 1: Create extraction workflow
console.log('Step 1: Creating extraction workflow...');
const workflow = await graphlit.createWorkflow({
name: "PDF Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // PDF, Word, Excel, etc.
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Vision model for PDFs
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
]
}
}]
}
});
console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);
// Step 2: Ingest PDF
console.log('Step 2: Ingesting PDF document...');
const content = await graphlit.ingestUri('https://arxiv.org/pdf/2301.00001.pdf', "Research Paper", undefined, undefined, undefined, { id: workflow.createWorkflow.id });
console.log(`✓ Ingested: ${content.ingestUri.id}\n`);
// Step 3: Wait for processing
console.log('Step 3: Waiting for entity extraction...');
let isDone = false;
while (!isDone) {
const status = await graphlit.isContentDone(content.ingestUri.id);
isDone = status.isContentDone.result;
if (!isDone) {
console.log(' Processing... (checking again in 2s)');
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
console.log('✓ Extraction complete\n');
// Step 4: Retrieve content with entities
console.log('Step 4: Retrieving extracted entities...');
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];
console.log(`✓ Found ${observations.length} entity observations\n`);
// Step 5: Analyze entities by type
console.log('Step 5: Analyzing entities...\n');
const byType = new Map<string, Set<string>>();
observations.forEach(obs => {
if (!byType.has(obs.type)) {
byType.set(obs.type, new Set());
}
byType.get(obs.type)!.add(obs.observable.name);
});
byType.forEach((entities, type) => {
console.log(`${type} (${entities.size} unique):`);
Array.from(entities).slice(0, 5).forEach(name => {
console.log(` - ${name}`);
});
if (entities.size > 5) {
console.log(` ... and ${entities.size - 5} more`);
}
console.log();
});
// Step 6: Query knowledge graph
console.log('Step 6: Querying knowledge graph...\n');
// Get all unique people
const people = await graphlit.queryObservables({
filter: {
types: [ObservableTypes.Person],
states: [EntityState.Enabled]
}
});
console.log(`Total people in knowledge graph: ${people.observables.results.length}`);
// Get all organizations
const orgs = await graphlit.queryObservables({
filter: {
types: [ObservableTypes.Organization],
states: [EntityState.Enabled]
}
});
console.log(`Total organizations in knowledge graph: ${orgs.observables.results.length}`);
// Step 7: Find entity relationships
console.log('\nStep 7: Analyzing entity co-occurrences...\n');
const cooccurrences: Array<{ person: string; organization: string; count: number }> = [];
observations
.filter(obs => obs.type === ObservableTypes.Person)
.forEach(personObs => {
observations
.filter(obs => obs.type === ObservableTypes.Organization)
.forEach(orgObs => {
// Check if they appear on same pages
const personPages = new Set(
personObs.occurrences?.map(occ => occ.pageIndex) || []
);
const orgPages = new Set(
orgObs.occurrences?.map(occ => occ.pageIndex) || []
);
const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
if (sharedPages.length > 0) {
cooccurrences.push({
person: personObs.observable.name,
organization: orgObs.observable.name,
count: sharedPages.length
});
}
});
});
console.log('Top person-organization relationships:');
cooccurrences
.sort((a, b) => b.count - a.count)
.slice(0, 5)
.forEach(({ person, organization, count }) => {
console.log(` ${person} ↔ ${organization} (${count} pages)`);
});
console.log('\n✓ Knowledge graph analysis complete!');Run
asyncio.run(build_kg_from_pdf())
Step-by-Step Explanation
Step 1: Create Extraction Workflow
Document Preparation:
FilePreparationServiceDocumenthandles PDFs, Word, Excel, PowerPointExtracts text, tables, images
Preserves page structure and layout
Handles encrypted PDFs (if password provided)
Vision-Based Extraction:
ExtractionServiceModelDocumentuses vision modelsAnalyzes visual layout (charts, diagrams, tables)
Better for scanned PDFs
Extracts from images within PDFs
Entity Type Selection:
Choose types relevant to your domain
More types = longer processing time
Start with Person, Organization, Place, Event
Step 2: Ingest PDF Document
Ingestion Options:
Step 3: Poll for Completion
Processing Timeline:
Small PDF (<10 pages): 30-60 seconds
Medium PDF (10-50 pages): 1-3 minutes
Large PDF (50-200 pages): 3-10 minutes
Very large PDF (200+ pages): 10-30 minutes
Polling Strategy:
Step 4: Retrieve Extracted Entities
Full Content Details:
Step 5: Analyze Entities
Group by Type:
Deduplicate:
Step 6: Query Knowledge Graph
After entities are extracted, they're available globally:
Step 7: Analyze Relationships
Co-occurrence Analysis:
Entities on same page likely related
Frequency indicates relationship strength
Build relationship graph from co-occurrences
Cross-document Relationships:
Configuration Options
Choosing Text vs Vision Extraction
Use Text Extraction (ModelText) When:
PDFs are text-based (born-digital)
No important visual elements
Want faster/cheaper processing
Content is primarily textual
Use Vision Extraction (ModelDocument) When:
PDFs are scanned documents
Contains important charts/diagrams
Mixed text and visual content
OCR quality is poor with text extraction
Model Selection for Quality vs Speed
Variations
Variation 1: Legal Contract Analysis
Extract parties, dates, obligations from legal documents:
Variation 2: Research Paper Citation Network
Build academic citation graphs:
Variation 3: Invoice/Receipt Processing
Extract vendors, amounts, dates from financial documents:
Variation 4: Medical Records Analysis
Extract medical entities from clinical documents:
Variation 5: Batch PDF Processing
Process multiple PDFs efficiently:
Common Issues & Solutions
Issue: No Entities Extracted from Scanned PDF
Problem: PDF is scanned image, text extraction fails.
Solution: Use vision model + proper preparation:
Issue: Encrypted PDF Won't Process
Problem: PDF is password-protected.
Solution: Provide password in preparation:
Issue: Missing Entities from Images/Charts
Problem: Text-based extraction misses visual elements.
Solution: Use vision model extraction:
Issue: Processing Takes Too Long
Problem: Large PDF processing exceeds timeout.
Solutions:
Split large PDFs into smaller chunks
Use faster model (GPT-4o instead of GPT-4)
Reduce number of entity types
Process in background, poll asynchronously
Developer Hints
PDF Processing Best Practices
Check file size first: >50MB PDFs may need special handling
Test with sample page: Validate extraction quality before batch
Use appropriate model: Vision for scanned, text for born-digital
Monitor confidence scores: Filter entities with confidence <0.7
Handle failures gracefully: PDFs can be corrupt or malformed
Vision Model Selection
GPT-4o Vision: Best balance (recommended)
Claude 3.5 Sonnet: Good for complex layouts
GPT-4 Vision: Highest quality but slower/expensive
Cost Optimization
Text extraction much cheaper than vision
GPT-4o significantly cheaper than GPT-4
Extract only needed entity types
Batch processing for volume discounts
Performance Optimization
Parallel ingestion up to 10 PDFs simultaneously
Poll every 2-5 seconds (not more frequently)
Cache entity results to avoid re-querying
Use collections to organize large sets of PDFs
Production Patterns
Pattern from Graphlit Samples
Graphlit_2024_09_13_Extract_People_Organizations_from_ArXiv_Papers.ipynb:
Ingests ArXiv research papers (PDFs)
Extracts Person (authors), Organization (institutions)
Filters by confidence >=0.7
Builds citation network from entities
Exports to CSV for analysis
Pattern from Legal Tech
Process contracts in batch (100s of PDFs)
Extract parties, dates, obligations
Build contract database with entity index
Enable search by party or jurisdiction
Alert on approaching deadlines
Last updated
Was this helpful?