# Build Knowledge Graph from PDF Documents

## Use Case: Build Knowledge Graph from PDF Documents

### User Intent

"How do I extract entities from PDF documents to build a knowledge graph? Show me a complete workflow from PDF ingestion to querying entities."

### Operation

**SDK Methods**: `createWorkflow()`, `ingestUri()`, `isContentDone()`, `getContent()`, `queryObservables()`\
**GraphQL**: Complete workflow + ingestion + entity querying\
**Entity**: PDF → Content → Observations → Observables (Knowledge Graph)

### Prerequisites

* Graphlit project with API credentials
* PDF documents to process (local files or URLs)
* Understanding of entity types
* Basic knowledge of workflows

***

### Complete Code Example (TypeScript)

```typescript
import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes,
  EntityState
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

console.log('=== Building Knowledge Graph from PDF ===\n');

// Step 1: Create extraction workflow
console.log('Step 1: Creating extraction workflow...');
const workflow = await graphlit.createWorkflow({
  name: "PDF Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument  // PDF, Word, Excel, etc.
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);

// Step 2: Ingest PDF
console.log('Step 2: Ingesting PDF document...');
const content = await graphlit.ingestUri('https://arxiv.org/pdf/2301.00001.pdf', "Research Paper", undefined, undefined, undefined, { id: workflow.createWorkflow.id  });

console.log(`✓ Ingested: ${content.ingestUri.id}\n`);

// Step 3: Wait for processing
console.log('Step 3: Waiting for entity extraction...');
let isDone = false;
while (!isDone) {
  const status = await graphlit.isContentDone(content.ingestUri.id);
  isDone = status.isContentDone.result;
  
  if (!isDone) {
    console.log('  Processing... (checking again in 2s)');
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
}
console.log('✓ Extraction complete\n');

// Step 4: Retrieve content with entities
console.log('Step 4: Retrieving extracted entities...');
const contentDetails = await graphlit.getContent(content.ingestUri.id);
const observations = contentDetails.content.observations || [];

console.log(`✓ Found ${observations.length} entity observations\n`);

// Step 5: Analyze entities by type
console.log('Step 5: Analyzing entities...\n');

const byType = new Map<string, Set<string>>();
observations.forEach(obs => {
  if (!byType.has(obs.type)) {
    byType.set(obs.type, new Set());
  }
  byType.get(obs.type)!.add(obs.observable.name);
});

byType.forEach((entities, type) => {
  console.log(`${type} (${entities.size} unique):`);
  Array.from(entities).slice(0, 5).forEach(name => {
    console.log(`  - ${name}`);
  });
  if (entities.size > 5) {
    console.log(`  ... and ${entities.size - 5} more`);
  }
  console.log();
});

// Step 6: Query knowledge graph
console.log('Step 6: Querying knowledge graph...\n');

// Get all unique people
const people = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Person],
    states: [EntityState.Enabled]
  }
});

console.log(`Total people in knowledge graph: ${people.observables.results.length}`);

// Get all organizations
const orgs = await graphlit.queryObservables({
  filter: {
    types: [ObservableTypes.Organization],
    states: [EntityState.Enabled]
  }
});

console.log(`Total organizations in knowledge graph: ${orgs.observables.results.length}`);

// Step 7: Find entity relationships
console.log('\nStep 7: Analyzing entity co-occurrences...\n');

const cooccurrences: Array<{ person: string; organization: string; count: number }> = [];

observations
  .filter(obs => obs.type === ObservableTypes.Person)
  .forEach(personObs => {
    observations
      .filter(obs => obs.type === ObservableTypes.Organization)
      .forEach(orgObs => {
        // Check if they appear on same pages
        const personPages = new Set(
          personObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        const orgPages = new Set(
          orgObs.occurrences?.map(occ => occ.pageIndex) || []
        );
        
        const sharedPages = Array.from(personPages).filter(p => orgPages.has(p));
        
        if (sharedPages.length > 0) {
          cooccurrences.push({
            person: personObs.observable.name,
            organization: orgObs.observable.name,
            count: sharedPages.length
          });
        }
      });
  });

console.log('Top person-organization relationships:');
cooccurrences
  .sort((a, b) => b.count - a.count)
  .slice(0, 5)
  .forEach(({ person, organization, count }) => {
    console.log(`  ${person} ↔ ${organization} (${count} pages)`);
  });

console.log('\n✓ Knowledge graph analysis complete!');
```

***

## Run

asyncio.run(build\_kg\_from\_pdf())

````

### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;

var graphlit = new Graphlit();

Console.WriteLine("=== Building Knowledge Graph from PDF ===\n");

// Step 1: Create workflow
Console.WriteLine("Step 1: Creating extraction workflow...");
var workflow = await graphlit.CreateWorkflow(
    name: "PDF Entity Extraction",
    preparation: new WorkflowPreparationInput
    {
        Jobs = new[]
        {
            new WorkflowPreparationJobInput
            {
                Connector = new FilePreparationConnectorInput
                {
                    Type = FilePreparationServiceDocument
                }
            }
        }
    },
    extraction: new WorkflowExtractionInput
    {
        Jobs = new[]
        {
            new WorkflowExtractionJobInput
            {
                Connector = new ExtractionConnectorInput
                {
                    Type = ExtractionServiceModelDocument,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization,
                        ObservableTypes.Place,
                        ObservableTypes.Event
                    }
                }
            }
        }
    }
);

Console.WriteLine($"✓ Created workflow: {workflow.CreateWorkflow.Id}\n");

// Step 2: Ingest PDF
Console.WriteLine("Step 2: Ingesting PDF document...");
var content = await graphlit.IngestUri(
    name: "Research Paper",
    uri: "https://arxiv.org/pdf/2301.00001.pdf",
    workflow: new EntityReferenceInput { Id = workflow.CreateWorkflow.Id }
);

Console.WriteLine($"✓ Ingested: {content.IngestUri.Id}\n");

// (Continue with remaining steps...)
````

***

### Step-by-Step Explanation

#### Step 1: Create Extraction Workflow

**Document Preparation**:

* `FilePreparationServiceDocument` handles PDFs, Word, Excel, PowerPoint
* Extracts text, tables, images
* Preserves page structure and layout
* Handles encrypted PDFs (if password provided)

**Vision-Based Extraction**:

* `ExtractionServiceModelDocument` uses vision models
* Analyzes visual layout (charts, diagrams, tables)
* Better for scanned PDFs
* Extracts from images within PDFs

**Entity Type Selection**:

* Choose types relevant to your domain
* More types = longer processing time
* Start with Person, Organization, Place, Event

#### Step 2: Ingest PDF Document

**Ingestion Options**:

```typescript
// From URL
await graphlit.ingestUri('https://example.com/document.pdf', undefined, undefined, undefined, undefined, { id: workflowId  });

// From local file (base64 encoded)
const fileBuffer = fs.readFileSync('./document.pdf');
const base64 = fileBuffer.toString('base64');

await graphlit.ingestEncodedFile({
  name: 'document.pdf',
  data: base64,
  mimeType: 'application/pdf',
  workflow: { id: workflowId }
});

// From cloud storage (via feed)
const feed = await graphlit.createFeed({
  type: FeedTypes.Site,
  site: {
    type: FeedServiceTypes.AzureFile,
    // ... Azure Blob Storage config
  },
  workflow: { id: workflowId }
});
```

#### Step 3: Poll for Completion

**Processing Timeline**:

* Small PDF (<10 pages): 30-60 seconds
* Medium PDF (10-50 pages): 1-3 minutes
* Large PDF (50-200 pages): 3-10 minutes
* Very large PDF (200+ pages): 10-30 minutes

**Polling Strategy**:

```typescript
// Efficient polling with backoff
let retries = 0;
const maxRetries = 60;  // 2 minutes max
const delay = 2000;     // 2 seconds

while (retries < maxRetries) {
  const status = await graphlit.isContentDone(contentId);
  if (status.isContentDone.result) break;
  
  await new Promise(resolve => setTimeout(resolve, delay));
  retries++;
}
```

#### Step 4: Retrieve Extracted Entities

**Full Content Details**:

```typescript
const content = await graphlit.getContent(contentId);

// Access entity observations
const observations = content.content.observations || [];

// Access content metadata
console.log(`Pages: ${content.content.document?.pageCount}`);
console.log(`File size: ${content.content.fileSize}`);
console.log(`Created: ${content.content.creationDate}`);
```

#### Step 5: Analyze Entities

**Group by Type**:

```typescript
const entityGroups = observations.reduce((groups, obs) => {
  if (!groups[obs.type]) {
    groups[obs.type] = [];
  }
  groups[obs.type].push(obs.observable);
  return groups;
}, {} as Record<string, Observable[]>);
```

**Deduplicate**:

```typescript
const uniqueEntities = new Map<string, Observable>();
observations.forEach(obs => {
  uniqueEntities.set(obs.observable.id, obs.observable);
});
```

#### Step 6: Query Knowledge Graph

After entities are extracted, they're available globally:

```typescript
// All people across all content
const allPeople = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person] }
});

// Search for specific person
const kirkEntities = await graphlit.queryObservables({
  search: "Kirk Marple",
  filter: { types: [ObservableTypes.Person] }
});
```

#### Step 7: Analyze Relationships

**Co-occurrence Analysis**:

* Entities on same page likely related
* Frequency indicates relationship strength
* Build relationship graph from co-occurrences

**Cross-document Relationships**:

```typescript
// Find content mentioning both entities
const relatedContent = await graphlit.queryContents({
  
    observations: [
      { type: ObservableTypes.Person, observable: { id: personId } },
      { type: ObservableTypes.Organization, observable: { id: orgId } }
    ]
  });
```

***

### Configuration Options

#### Choosing Text vs Vision Extraction

**Use Text Extraction (`ModelText`) When**:

* PDFs are text-based (born-digital)
* No important visual elements
* Want faster/cheaper processing
* Content is primarily textual

**Use Vision Extraction (`ModelDocument`) When**:

* PDFs are scanned documents
* Contains important charts/diagrams
* Mixed text and visual content
* OCR quality is poor with text extraction

#### Model Selection for Quality vs Speed

```typescript
// High quality (slower, more expensive)
const specGPT4 = await graphlit.createSpecification({
  name: "GPT-4 Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAIModels.Gpt4, temperature: 0.1 }
});

// Balanced (recommended)
const specGPT4o = await graphlit.createSpecification({
  name: "GPT-4o Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAIModels.Gpt4o, temperature: 0.1 }
});

// Fast and cost-effective
const specGemini = await graphlit.createSpecification({
  name: "Gemini Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.Google,
  gemini: { model: GeminiModels.Gemini15Pro, temperature: 0.1 }
});
```

***

### Variations

#### Variation 1: Legal Contract Analysis

Extract parties, dates, obligations from legal documents:

```typescript
const legalWorkflow = await graphlit.createWorkflow({
  name: "Legal Contract Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,
        extractedTypes: [
          ObservableTypes.Person,        // Parties
          ObservableTypes.Organization,  // Companies
          ObservableTypes.Place,         // Jurisdictions
          ObservableTypes.Event          // Effective dates, deadlines
        ]
      }
    }]
  }
});

// Analyze extracted obligations
const contractContent = await graphlit.getContent(contentId);
const events = contractContent.content.observations
  ?.filter(obs => obs.type === ObservableTypes.Event) || [];

console.log('Contract deadlines:');
events.forEach(event => {
  console.log(`  - ${event.observable.name}`);
  event.occurrences?.forEach(occ => {
    console.log(`    Page ${(occ.pageIndex || 0) + 1}, Confidence: ${occ.confidence}`);
  });
});
```

#### Variation 2: Research Paper Citation Network

Build academic citation graphs:

```typescript
const researchWorkflow = await graphlit.createWorkflow({
  name: "Research Paper Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,        // Authors
          ObservableTypes.Organization,  // Institutions
          ObservableTypes.Category       // Topics, keywords
        ]
      }
    }]
  }
});

// Build author collaboration network
const papers = await graphlit.queryContents({
   workflows: [{ id: researchWorkflow.createWorkflow.id }] });

const collaborations = new Map<string, Set<string>>();

papers.contents.results.forEach(paper => {
  const authors = paper.observations
    ?.filter(obs => obs.type === ObservableTypes.Person)
    .map(obs => obs.observable.name) || [];
  
  // Record co-authorships
  for (let i = 0; i < authors.length; i++) {
    for (let j = i + 1; j < authors.length; j++) {
      const key = [authors[i], authors[j]].sort().join(' & ');
      if (!collaborations.has(key)) {
        collaborations.set(key, new Set());
      }
      collaborations.get(key)!.add(paper.name);
    }
  }
});

console.log('Top collaborations:');
Array.from(collaborations.entries())
  .sort((a, b) => b[1].size - a[1].size)
  .slice(0, 10)
  .forEach(([authors, papers]) => {
    console.log(`  ${authors}: ${papers.size} papers`);
  });
```

#### Variation 3: Invoice/Receipt Processing

Extract vendors, amounts, dates from financial documents:

```typescript
const invoiceWorkflow = await graphlit.createWorkflow({
  name: "Invoice Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision for logos/stamps
        extractedTypes: [
          ObservableTypes.Organization,  // Vendor, customer
          ObservableTypes.Place,         // Billing/shipping address
          ObservableTypes.Event,         // Invoice date, due date
          ObservableTypes.Product        // Line items
        ]
      }
    }]
  }
});

// Extract invoice metadata
const invoice = await graphlit.getContent(invoiceId);
const vendor = invoice.content.observations
  ?.find(obs => obs.type === ObservableTypes.Organization);

console.log(`Vendor: ${vendor?.observable.name}`);
```

#### Variation 4: Medical Records Analysis

Extract medical entities from clinical documents:

```typescript
const medicalWorkflow = await graphlit.createWorkflow({
  name: "Medical Records Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,              // Patients, doctors
          ObservableTypes.MedicalCondition,    // Diagnoses
          ObservableTypes.MedicalDrug,         // Medications
          ObservableTypes.MedicalProcedure,    // Treatments
          ObservableTypes.MedicalTest          // Lab tests
        ]
      }
    }]
  }
});

// Analyze patient record
const record = await graphlit.getContent(recordId);
const conditions = record.content.observations
  ?.filter(obs => obs.type === ObservableTypes.MedicalCondition) || [];
const drugs = record.content.observations
  ?.filter(obs => obs.type === ObservableTypes.MedicalDrug) || [];

console.log('Diagnoses:', conditions.map(c => c.observable.name));
console.log('Medications:', drugs.map(d => d.observable.name));
```

#### Variation 5: Batch PDF Processing

Process multiple PDFs efficiently:

```typescript
const pdfUrls = [
  'https://example.com/doc1.pdf',
  'https://example.com/doc2.pdf',
  'https://example.com/doc3.pdf',
  // ... more PDFs
];

// Ingest all PDFs
const contentIds = await Promise.all(
  pdfUrls.map(uri =>
    graphlit.ingestUri({ uri, workflow: { id: workflowId } })
      .then(result => result.ingestUri.id)
  )
);

console.log(`Ingested ${contentIds.length} PDFs`);

// Wait for all to complete
const waitForAll = async () => {
  let allDone = false;
  while (!allDone) {
    const statuses = await Promise.all(
      contentIds.map(id => graphlit.isContentDone(id))
    );
    allDone = statuses.every(s => s.isContentDone.result);
    
    if (!allDone) {
      console.log('Processing...');
      await new Promise(resolve => setTimeout(resolve, 5000));
    }
  }
};

await waitForAll();
console.log('All PDFs processed');

// Query all extracted entities
const allEntities = await graphlit.queryObservables({
  filter: { types: [ObservableTypes.Person, ObservableTypes.Organization] }
});

console.log(`Total entities extracted: ${allEntities.observables.results.length}`);
```

***

### Common Issues & Solutions

#### Issue: No Entities Extracted from Scanned PDF

**Problem**: PDF is scanned image, text extraction fails.

**Solution**: Use vision model + proper preparation:

```typescript
preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument,
      document: {
        includeOCR: true  // Enable OCR for scanned docs
      }
    }
  }]
},
extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument  // Vision model
    }
  }]
}
```

#### Issue: Encrypted PDF Won't Process

**Problem**: PDF is password-protected.

**Solution**: Provide password in preparation:

```typescript
preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.ModelDocument,
      document: {
        password: 'document-password-here'
      }
    }
  }]
}
```

#### Issue: Missing Entities from Images/Charts

**Problem**: Text-based extraction misses visual elements.

**Solution**: Use vision model extraction:

```typescript
extraction: {
  jobs: [{
    connector: {
      type: EntityExtractionServiceTypes.ModelDocument,  // Analyzes images
      extractedTypes: [/* ... */]
    }
  }]
}
```

#### Issue: Processing Takes Too Long

**Problem**: Large PDF processing exceeds timeout.

**Solutions**:

1. Split large PDFs into smaller chunks
2. Use faster model (GPT-4o instead of GPT-4)
3. Reduce number of entity types
4. Process in background, poll asynchronously

```typescript
// Async processing pattern
const content = await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, undefined, { id: workflowId  });

// Don't block - return ID immediately
console.log(`Processing started: ${content.ingestUri.id}`);
// Poll in background job or webhook
```

***

### Developer Hints

#### PDF Processing Best Practices

1. **Check file size first**: >50MB PDFs may need special handling
2. **Test with sample page**: Validate extraction quality before batch
3. **Use appropriate model**: Vision for scanned, text for born-digital
4. **Monitor confidence scores**: Filter entities with confidence <0.7
5. **Handle failures gracefully**: PDFs can be corrupt or malformed

#### Vision Model Selection

* **GPT-4o Vision**: Best balance (recommended)
* **Claude 3.5 Sonnet**: Good for complex layouts
* **GPT-4 Vision**: Highest quality but slower/expensive

#### Cost Optimization

* Text extraction much cheaper than vision
* GPT-4o significantly cheaper than GPT-4
* Extract only needed entity types
* Batch processing for volume discounts

#### Performance Optimization

* Parallel ingestion up to 10 PDFs simultaneously
* Poll every 2-5 seconds (not more frequently)
* Cache entity results to avoid re-querying
* Use collections to organize large sets of PDFs

***

### Production Patterns

#### Pattern from Graphlit Samples

`Graphlit_2024_09_13_Extract_People_Organizations_from_ArXiv_Papers.ipynb`:

* Ingests ArXiv research papers (PDFs)
* Extracts Person (authors), Organization (institutions)
* Filters by confidence >=0.7
* Builds citation network from entities
* Exports to CSV for analysis

#### Pattern from Legal Tech

* Process contracts in batch (100s of PDFs)
* Extract parties, dates, obligations
* Build contract database with entity index
* Enable search by party or jurisdiction
* Alert on approaching deadlines

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/knowledge-graph-from-pdf-documents.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
