Build Knowledge Graph from PDF Documents
Use Case: Build Knowledge Graph from PDF Documents
User Intent
"How do I extract entities from PDF documents to build a knowledge graph? Show me a complete workflow from PDF ingestion to querying entities."
Operation
SDK Methods: createWorkflow(), ingestUri(), isContentDone(), getContent(), queryObservables()
GraphQL: Complete workflow + ingestion + entity querying
Entity: PDF → Content → Observations → Observables (Knowledge Graph)
Prerequisites
Graphlit project with API credentials
PDF documents to process (local files or URLs)
Understanding of entity types
Basic knowledge of workflows
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import {
FilePreparationServiceTypes,
EntityExtractionServiceTypes,
ObservableTypes,
EntityState
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
console.log('=== Building Knowledge Graph from PDF ===\n');
// Step 1: Create extraction workflow
console.log('Step 1: Creating extraction workflow...');
const workflow = await graphlit.createWorkflow({
name: "PDF Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument // PDF, Word, Excel, etc.
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Vision model for PDFs
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
]
}
}]
}
});
console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);
// Step 2: Ingest PDF
console.log('Step 2: Ingesting PDF document...');
const content = await graphlit.ingestUri('https://arxiv.org/pdf/2301.00001.pdf', "Research Paper", undefined, undefined, undefined, { id: workflow.createWorkflow.id });
console.log(`✓ Ingested: ${content.ingestUri.id}\n`);
// Step 3: Wait for processing
console.log('Step 3: Waiting for entity extraction...');
let isDone = false;
while (!isDone) {
const status = await graphlit.isContentDone(content.ingestUri.id);
isDone = status.isContentDone.result;
if (!isDone) {
console.log(' Processing... (checking again in 2s)');
await new Promise(resolve => setTimeout(resolve, 2000));
}
}
console.log('✓ Extraction complete\n');
// Step 4: Retrieve content with entities
console.log('Step 4: Retrieving extracted entities...');
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];
console.log(`✓ Found ${observations.length} entity observations\n`);
// Step 5: Analyze entities by type
console.log('Step 5: Analyzing entities...\n');
const byType = new Map<string, Set<string>>();
observations.forEach(obs => {
if (!byType.has(obs.type)) {
byType.set(obs.type, new Set());
}
byType.get(obs.type)!.add(obs.observable.name);
});
byType.forEach((entities, type) => {
console.log(`${type} (${entities.size} unique):`);
Array.from(entities).slice(0, 5).forEach(name => {
console.log(` - ${name}`);
});
if (entities.size > 5) {
console.log(` ... and ${entities.size - 5} more`);
}
console.log();
});
// Step 6: Query knowledge graph
console.log('Step 6: Querying knowledge graph...\n');
// Get all unique people
const people = await graphlit.queryObservables({
filter: {
types: [ObservableTypes.Person],
states: [EntityState.Enabled]
}
});
console.log(`Total people in knowledge graph: ${people.observables.results.length}`);
// Get all organizations
const orgs = await graphlit.queryObservables({
filter: {
types: [ObservableTypes.Organization],
states: [EntityState.Enabled]
}
});
console.log(`Total organizations in knowledge graph: ${orgs.observables.results.length}`);
// Step 7: Find entity relationships
console.log('\nStep 7: Analyzing entity co-occurrences...\n');
const cooccurrences: Array<{ person: string; organization: string; count: number }> = [];
observations
.filter(obs => obs.type === ObservableTypes.Person)
.forEach(personObs => {
observations
.filter(obs => obs.type === ObservableTypes.Organization)
.forEach(orgObs => {
// Check if they appear on same pages
const personPages = new Set(
personObs.occurrences?.map(occ => occ.pageIndex) || []
);
const orgPages = new Set(
orgObs.occurrences?.map(occ => occ.pageIndex) || []
);
const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
if (sharedPages.length > 0) {
cooccurrences.push({
person: personObs.observable.name,
organization: orgObs.observable.name,
count: sharedPages.length
});
}
});
});
console.log('Top person-organization relationships:');
cooccurrences
.sort((a, b) => b.count - a.count)
.slice(0, 5)
.forEach(({ person, organization, count }) => {
console.log(` ${person} ↔ ${organization} (${count} pages)`);
});
console.log('\n✓ Knowledge graph analysis complete!');Run
asyncio.run(build_kg_from_pdf())
### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;
var graphlit = new Graphlit();
Console.WriteLine("=== Building Knowledge Graph from PDF ===\n");
// Step 1: Create workflow
Console.WriteLine("Step 1: Creating extraction workflow...");
var workflow = await graphlit.CreateWorkflow(
name: "PDF Entity Extraction",
preparation: new WorkflowPreparationInput
{
Jobs = new[]
{
new WorkflowPreparationJobInput
{
Connector = new FilePreparationConnectorInput
{
Type = FilePreparationServiceDocument
}
}
}
},
extraction: new WorkflowExtractionInput
{
Jobs = new[]
{
new WorkflowExtractionJobInput
{
Connector = new ExtractionConnectorInput
{
Type = ExtractionServiceModelDocument,
ExtractedTypes = new[]
{
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
}
}
}
}
}
);
Console.WriteLine($"✓ Created workflow: {workflow.CreateWorkflow.Id}\n");
// Step 2: Ingest PDF
Console.WriteLine("Step 2: Ingesting PDF document...");
var content = await graphlit.IngestUri(
name: "Research Paper",
uri: "https://arxiv.org/pdf/2301.00001.pdf",
workflow: new EntityReferenceInput { Id = workflow.CreateWorkflow.Id }
);
Console.WriteLine($"✓ Ingested: {content.IngestUri.Id}\n");
// (Continue with remaining steps...)Step-by-Step Explanation
Step 1: Create Extraction Workflow
Document Preparation:
FilePreparationServiceDocumenthandles PDFs, Word, Excel, PowerPointExtracts text, tables, images
Preserves page structure and layout
Handles encrypted PDFs (if password provided)
Vision-Based Extraction:
ExtractionServiceModelDocumentuses vision modelsAnalyzes visual layout (charts, diagrams, tables)
Better for scanned PDFs
Extracts from images within PDFs
Entity Type Selection:
Choose types relevant to your domain
More types = longer processing time
Start with Person, Organization, Place, Event
Step 2: Ingest PDF Document
Ingestion Options:
// From URL
await graphlit.ingestUri('https://example.com/document.pdf', undefined, undefined, undefined, undefined, { id: workflowId });
// From local file (base64 encoded)
const fileBuffer = fs.readFileSync('./document.pdf');
const base64 = fileBuffer.toString('base64');
await graphlit.ingestEncodedFile({
name: 'document.pdf',
data: base64,
mimeType: 'application/pdf',
workflow: { id: workflowId }
});
// From cloud storage (via feed)
const feed = await graphlit.createFeed({
type: FeedTypes.Site,
site: {
type: FeedServiceTypes.AzureFile,
// ... Azure Blob Storage config
},
workflow: { id: workflowId }
});Step 3: Poll for Completion
Processing Timeline:
Small PDF (<10 pages): 30-60 seconds
Medium PDF (10-50 pages): 1-3 minutes
Large PDF (50-200 pages): 3-10 minutes
Very large PDF (200+ pages): 10-30 minutes
Polling Strategy:
// Efficient polling with backoff
let retries = 0;
const maxRetries = 60; // 2 minutes max
const delay = 2000; // 2 seconds
while (retries < maxRetries) {
const status = await graphlit.isContentDone(contentId);
if (status.isContentDone.result) break;
await new Promise(resolve => setTimeout(resolve, delay));
retries++;
}Step 4: Retrieve Extracted Entities
Full Content Details:
const content = await graphlit.getContent(contentId);
// Access entity observations
const observations = content.content.observations || [];
// Access content metadata
console.log(`Pages: ${content.content.document?.pageCount}`);
console.log(`File size: ${content.content.fileSize}`);
console.log(`Created: ${content.content.creationDate}`);Step 5: Analyze Entities
Group by Type:
const entityGroups = observations.reduce((groups, obs) => {
if (!groups[obs.type]) {
groups[obs.type] = [];
}
groups[obs.type].push(obs.observable);
return groups;
}, {} as Record<string, Observable[]>);Deduplicate:
const uniqueEntities = new Map<string, Observable>();
observations.forEach(obs => {
uniqueEntities.set(obs.observable.id, obs.observable);
});Step 6: Query Knowledge Graph
After entities are extracted, they're available globally:
// All people across all content
const allPeople = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person] }
});
// Search for specific person
const kirkEntities = await graphlit.queryObservables({
search: "Kirk Marple",
filter: { types: [ObservableTypes.Person] }
});Step 7: Analyze Relationships
Co-occurrence Analysis:
Entities on same page likely related
Frequency indicates relationship strength
Build relationship graph from co-occurrences
Cross-document Relationships:
// Find content mentioning both entities
const relatedContent = await graphlit.queryContents({
filter: {
observations: [
{ type: ObservableTypes.Person, observable: { id: personId } },
{ type: ObservableTypes.Organization, observable: { id: orgId } }
]
}
});Configuration Options
Choosing Text vs Vision Extraction
Use Text Extraction (ModelText) When:
PDFs are text-based (born-digital)
No important visual elements
Want faster/cheaper processing
Content is primarily textual
Use Vision Extraction (ModelDocument) When:
PDFs are scanned documents
Contains important charts/diagrams
Mixed text and visual content
OCR quality is poor with text extraction
Model Selection for Quality vs Speed
// High quality (slower, more expensive)
const specGPT4 = await graphlit.createSpecification({
name: "GPT-4 Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAi,
openAI: { model: OpenAIModels.Gpt4, temperature: 0.1 }
});
// Balanced (recommended)
const specGPT4o = await graphlit.createSpecification({
name: "GPT-4o Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAi,
openAI: { model: OpenAIModels.Gpt4o, temperature: 0.1 }
});
// Fast and cost-effective
const specGemini = await graphlit.createSpecification({
name: "Gemini Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.Google,
gemini: { model: GeminiModels.Gemini15Pro, temperature: 0.1 }
});Variations
Variation 1: Legal Contract Analysis
Extract parties, dates, obligations from legal documents:
const legalWorkflow = await graphlit.createWorkflow({
name: "Legal Contract Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument,
extractedTypes: [
ObservableTypes.Person, // Parties
ObservableTypes.Organization, // Companies
ObservableTypes.Place, // Jurisdictions
ObservableTypes.Event // Effective dates, deadlines
]
}
}]
}
});
// Analyze extracted obligations
const contractContent = await graphlit.getContent(contentId);
const events = contractContent.content.observations
?.filter(obs => obs.type === ObservableTypes.Event) || [];
console.log('Contract deadlines:');
events.forEach(event => {
console.log(` - ${event.observable.name}`);
event.occurrences?.forEach(occ => {
console.log(` Page ${(occ.pageIndex || 0) + 1}, Confidence: ${occ.confidence}`);
});
});Variation 2: Research Paper Citation Network
Build academic citation graphs:
const researchWorkflow = await graphlit.createWorkflow({
name: "Research Paper Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person, // Authors
ObservableTypes.Organization, // Institutions
ObservableTypes.Category // Topics, keywords
]
}
}]
}
});
// Build author collaboration network
const papers = await graphlit.queryContents({
filter: { workflows: [{ id: researchWorkflow.createWorkflow.id }] }
});
const collaborations = new Map<string, Set<string>>();
papers.contents.results.forEach(paper => {
const authors = paper.observations
?.filter(obs => obs.type === ObservableTypes.Person)
.map(obs => obs.observable.name) || [];
// Record co-authorships
for (let i = 0; i < authors.length; i++) {
for (let j = i + 1; j < authors.length; j++) {
const key = [authors[i], authors[j]].sort().join(' & ');
if (!collaborations.has(key)) {
collaborations.set(key, new Set());
}
collaborations.get(key)!.add(paper.name);
}
}
});
console.log('Top collaborations:');
Array.from(collaborations.entries())
.sort((a, b) => b[1].size - a[1].size)
.slice(0, 10)
.forEach(([authors, papers]) => {
console.log(` ${authors}: ${papers.size} papers`);
});Variation 3: Invoice/Receipt Processing
Extract vendors, amounts, dates from financial documents:
const invoiceWorkflow = await graphlit.createWorkflow({
name: "Invoice Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Vision for logos/stamps
extractedTypes: [
ObservableTypes.Organization, // Vendor, customer
ObservableTypes.Place, // Billing/shipping address
ObservableTypes.Event, // Invoice date, due date
ObservableTypes.Product // Line items
]
}
}]
}
});
// Extract invoice metadata
const invoice = await graphlit.getContent(invoiceId);
const vendor = invoice.content.observations
?.find(obs => obs.type === ObservableTypes.Organization);
console.log(`Vendor: ${vendor?.observable.name}`);Variation 4: Medical Records Analysis
Extract medical entities from clinical documents:
const medicalWorkflow = await graphlit.createWorkflow({
name: "Medical Records Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person, // Patients, doctors
ObservableTypes.MedicalCondition, // Diagnoses
ObservableTypes.MedicalDrug, // Medications
ObservableTypes.MedicalProcedure, // Treatments
ObservableTypes.MedicalTest // Lab tests
]
}
}]
}
});
// Analyze patient record
const record = await graphlit.getContent(recordId);
const conditions = record.content.observations
?.filter(obs => obs.type === ObservableTypes.MedicalCondition) || [];
const drugs = record.content.observations
?.filter(obs => obs.type === ObservableTypes.MedicalDrug) || [];
console.log('Diagnoses:', conditions.map(c => c.observable.name));
console.log('Medications:', drugs.map(d => d.observable.name));Variation 5: Batch PDF Processing
Process multiple PDFs efficiently:
const pdfUrls = [
'https://example.com/doc1.pdf',
'https://example.com/doc2.pdf',
'https://example.com/doc3.pdf',
// ... more PDFs
];
// Ingest all PDFs
const contentIds = await Promise.all(
pdfUrls.map(uri =>
graphlit.ingestUri({ uri, workflow: { id: workflowId } })
.then(result => result.ingestUri.id)
)
);
console.log(`Ingested ${contentIds.length} PDFs`);
// Wait for all to complete
const waitForAll = async () => {
let allDone = false;
while (!allDone) {
const statuses = await Promise.all(
contentIds.map(id => graphlit.isContentDone(id))
);
allDone = statuses.every(s => s.isContentDone.result);
if (!allDone) {
console.log('Processing...');
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
};
await waitForAll();
console.log('All PDFs processed');
// Query all extracted entities
const allEntities = await graphlit.queryObservables({
filter: { types: [ObservableTypes.Person, ObservableTypes.Organization] }
});
console.log(`Total entities extracted: ${allEntities.observables.results.length}`);Common Issues & Solutions
Issue: No Entities Extracted from Scanned PDF
Problem: PDF is scanned image, text extraction fails.
Solution: Use vision model + proper preparation:
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
document: {
includeOCR: true // Enable OCR for scanned docs
}
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument // Vision model
}
}]
}Issue: Encrypted PDF Won't Process
Problem: PDF is password-protected.
Solution: Provide password in preparation:
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
document: {
password: 'document-password-here'
}
}
}]
}Issue: Missing Entities from Images/Charts
Problem: Text-based extraction misses visual elements.
Solution: Use vision model extraction:
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Analyzes images
extractedTypes: [/* ... */]
}
}]
}Issue: Processing Takes Too Long
Problem: Large PDF processing exceeds timeout.
Solutions:
Split large PDFs into smaller chunks
Use faster model (GPT-4o instead of GPT-4)
Reduce number of entity types
Process in background, poll asynchronously
// Async processing pattern
const content = await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, undefined, { id: workflowId });
// Don't block - return ID immediately
console.log(`Processing started: ${content.ingestUri.id}`);
// Poll in background job or webhookDeveloper Hints
PDF Processing Best Practices
Check file size first: >50MB PDFs may need special handling
Test with sample page: Validate extraction quality before batch
Use appropriate model: Vision for scanned, text for born-digital
Monitor confidence scores: Filter entities with confidence <0.7
Handle failures gracefully: PDFs can be corrupt or malformed
Vision Model Selection
GPT-4o Vision: Best balance (recommended)
Claude 3.5 Sonnet: Good for complex layouts
GPT-4 Vision: Highest quality but slower/expensive
Cost Optimization
Text extraction much cheaper than vision
GPT-4o significantly cheaper than GPT-4
Extract only needed entity types
Batch processing for volume discounts
Performance Optimization
Parallel ingestion up to 10 PDFs simultaneously
Poll every 2-5 seconds (not more frequently)
Cache entity results to avoid re-querying
Use collections to organize large sets of PDFs
Production Patterns
Pattern from Graphlit Samples
Graphlit_2024_09_13_Extract_People_Organizations_from_ArXiv_Papers.ipynb:
Ingests ArXiv research papers (PDFs)
Extracts Person (authors), Organization (institutions)
Filters by confidence >=0.7
Builds citation network from entities
Exports to CSV for analysis
Pattern from Legal Tech
Process contracts in batch (100s of PDFs)
Extract parties, dates, obligations
Build contract database with entity index
Enable search by party or jurisdiction
Alert on approaching deadlines
Last updated
Was this helpful?