# How Entity Extraction Works

## Workflow: How Entity Extraction Works

### User Intent

"How does entity extraction actually work? What happens during the extraction stage?"

### Operation

* **SDK Method**: `createWorkflow()` with `extraction` stage
* **GraphQL**: Workflow with extraction configuration
* **Entity Type**: Workflow
* **Common Use Cases**: Understanding extraction pipeline, configuring extraction, choosing models

### Extraction Pipeline Overview

Entity extraction is an LLM-based process that analyzes text and identifies structured entities (people, organizations, places, etc.).

### TypeScript (Canonical)

```typescript
import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with extraction
const workflow = await graphlit.createWorkflow({
  name: "Document Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

// Ingest with extraction workflow
const content = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  undefined,
  undefined,
  undefined,
  true,
  { id: workflow.createWorkflow.id }
);

// Check extracted entities
const result = await graphlit.getContent(content.ingestUri.id);

console.log(`Extracted ${result.content.observations?.length || 0} entity observations`);

result.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
  console.log(`  Confidence: ${obs.occurrences?.[0]?.confidence}`);
});
```

### The Extraction Pipeline

#### Step-by-Step Process

```
1. Content Ingestion
   ↓
2. Preparation Stage
   - Text extraction (PDF, Word, etc.)
   - OCR (scanned documents)
   - Audio transcription
   - Text chunking
   ↓
3. Extraction Stage (THIS IS WHERE IT HAPPENS)
   - Send text to LLM (GPT-4, Claude, etc.)
   - LLM analyzes text for entities
   - LLM returns structured JSON with entities
   - Each entity has: type, name, properties, confidence
   ↓
4. Observation Creation
   - Create Observation records
   - Link to content
   - Store occurrence details (page, location, confidence)
   ↓
5. Entity Resolution
   - Check if entity already exists (by name, email, url, etc.)
   - Create new Observable OR link to existing
   - Deduplicate entities
   ↓
6. Graph Storage
   - Store in graph database
   - Create entity nodes
   - Create observation edges
   - Link to content
   ↓
7. Content State → ENABLED
```

### LLM-Based Extraction

#### What the LLM Does

```typescript
// Behind the scenes, LLM receives prompt like:

const prompt = `
Extract entities from the following text.
Return a JSON array of entities with:
- type (PERSON, ORGANIZATION, PLACE, EVENT, etc.)
- name
- properties (email, jobTitle, url, etc.)
- confidence (0.0 to 1.0)

Text:
"""
Kirk Marple is the CEO of Graphlit, a context layer for AI agents based in 
Seattle. The company was founded in 2023 and raised $2.5M in funding.
"""

Expected output:
[
  {
    "type": "PERSON",
    "name": "Kirk Marple",
    "properties": {
      "jobTitle": "CEO",
      "affiliation": "Graphlit"
    },
    "confidence": 0.95
  },
  {
    "type": "ORGANIZATION",
    "name": "Graphlit",
    "properties": {
      "description": "context layer for AI agents",
      "foundingDate": "2023"
    },
    "confidence": 0.98
  },
  {
    "type": "PLACE",
    "name": "Seattle",
    "confidence": 0.92
  }
]
`;

// LLM processes and returns structured JSON
// Graphlit parses and creates Observations
```

#### Model Selection

```typescript
// Specify model via specification
const gpt4Spec = await graphlit.createSpecification({
  name: "GPT-4 Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAiModels.Gpt4Turbo_128K,  // High quality
    temperature: 0.0  // Deterministic
  }
});

const workflow = await graphlit.createWorkflow({
  name: "High Quality Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
      }
    }]
  },
  specification: { id: gpt4Spec.createSpecification.id }
});
```

### Vision-Based Extraction

For PDFs with images, charts, diagrams:

```typescript
const visionWorkflow = await graphlit.createWorkflow({
  name: "PDF Vision Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document,
        extractImages: true,  // Extract images from PDF
        ocrImages: true       // OCR on images
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelImage,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    }]
  }
});

// Vision models can extract from:
// - Charts and diagrams
// - Organizational charts
// - Scanned documents
// - Images with text
// - Infographics
```

### Extraction Models Comparison

#### GPT-4 (OpenAI)

* **Quality**: Highest
* **Speed**: Moderate
* **Cost**: High
* **Use**: Production, high-value content

#### GPT-4o (OpenAI)

* **Quality**: High
* **Speed**: Fast
* **Cost**: Moderate
* **Use**: Balanced production workloads

#### Claude 3.5 Sonnet (Anthropic)

* **Quality**: High
* **Speed**: Fast
* **Cost**: Moderate
* **Use**: Alternative to GPT-4o, good quality

#### Gemini Pro (Google)

* **Quality**: Good
* **Speed**: Fast
* **Cost**: Lower
* **Use**: Cost-sensitive applications

```typescript
// Configure model
const spec = await graphlit.createSpecification({
  name: "Model Spec",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,  // or Anthropic, Google
  openAI: {
    model: OpenAiModels.Gpt4Turbo_128K
  }
});
```

### Prompt Engineering for Extraction

#### Default Prompts

Graphlit uses optimized prompts for each entity type:

```typescript
// Person extraction prompt (conceptual):
// "Extract people mentioned in the text. Include:
//  - Full name
//  - Email address (if mentioned)
//  - Job title (if mentioned)
//  - Affiliation/company (if mentioned)
//  - Provide confidence score"

// Organization extraction prompt:
// "Extract organizations mentioned. Include:
//  - Full organization name
//  - URL (if mentioned)
//  - Description (if available)
//  - Provide confidence score"
```

#### Custom Prompts (Advanced)

Future feature: Custom extraction prompts for domain-specific needs

### Confidence Scoring

#### How Confidence is Calculated

LLM provides confidence based on:

* Context clarity
* Explicit mentions vs inferences
* Ambiguity in text
* Supporting evidence

```typescript
// High confidence (0.9-1.0):
// "Kirk Marple is the CEO..."
// Clear, explicit, unambiguous

// Medium confidence (0.7-0.9):
// "Kirk mentioned that..."
// Implicit context, less clear

// Low confidence (0.5-0.7):
// "The CEO said..."
// Pronoun reference, ambiguous

// Very low confidence (<0.5):
// "He suggested..."
// Multiple possible referents
```

#### Using Confidence Thresholds

```typescript
// Filter by confidence
const content = await graphlit.getContent('content-id');

const highConfidence = content.content.observations?.filter(obs =>
  obs.occurrences?.some(occ => (occ.confidence || 0) >= 0.8)
);

console.log(`High confidence entities: ${highConfidence?.length}`);
```

### When Extraction Runs

#### During Workflow Processing

```typescript
// Extraction runs AFTER preparation
const workflow = await graphlit.createWorkflow({
  preparation: { /* Extract text first */ },
  extraction: { /* Then extract entities */ }
});

// Timeline:
// 1. Ingest content → State: CREATED
// 2. Preparation runs → Extract text/OCR/transcribe
// 3. Extraction runs → LLM analyzes text
// 4. Observations created
// 5. State: ENABLED
```

#### Multiple Extraction Jobs

```typescript
// Run multiple extraction jobs in parallel
const workflow = await graphlit.createWorkflow({
  name: "Multi-Model Extraction",
  extraction: {
    jobs: [
      {
        // Text-based extraction
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
        }
      },
      {
        // Vision-based extraction (runs in parallel)
        connector: {
          type: EntityExtractionServiceTypes.ModelImage,
          extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
        }
      }
    ]
  }
});
```

## Create extraction workflow

workflow = await graphlit.createWorkflow( name="Entity Extraction", preparation=input\_types.PreparationWorkflowStageInput( jobs=\[ input\_types.PreparationWorkflowJobInput( connector=input\_types.FilePreparationConnectorInput( type=enums.FilePreparationServiceDOCUMENT ) ) ] ), extraction=input\_types.ExtractionWorkflowStageInput( jobs=\[ input\_types.ExtractionWorkflowJobInput( connector=input\_types.EntityExtractionConnectorInput( type=enums.ExtractionServiceMODEL\_TEXT, extracted\_types=\[ enums.ObservablePERSON, enums.ObservableORGANIZATION ] ) ) ] ) )

## Ingest with extraction

content = await graphlit.ingestUri( uri='<https://example.com/doc.pdf>', workflow=input\_types.EntityReferenceInput(id=workflow\.create\_workflow\.id), is\_synchronous=True )

## Check entities

result = await graphlit.getContent(content.ingest\_uri.id) print(f"Extracted {len(result.content.observations or \[])} entities")

````

**C#**:
```csharp
using Graphlit;

var client = new Graphlit();

// Create extraction workflow
var workflow = await graphlit.CreateWorkflow(new WorkflowInput
{
    Name = "Entity Extraction",
    Preparation = new PreparationWorkflowStage
    {
        Jobs = new[]
        {
            new PreparationWorkflowJob
            {
                Connector = new FilePreparationConnector
                {
                    Type = FilePreparationServiceDocument
                }
            }
        }
    },
    Extraction = new ExtractionWorkflowStage
    {
        Jobs = new[]
        {
            new ExtractionWorkflowJob
            {
                Connector = new EntityExtractionConnector
                {
                    Type = ExtractionServiceModelText,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization
                    }
                }
            }
        }
    }
});

// Ingest with extraction
var content = await graphlit.IngestUri(new IngestUriInput
{
    Uri = "https://example.com/doc.pdf",
    Workflow = new EntityReference { Id = workflow.CreateWorkflow.Id },
    IsSynchronous = true
});

// Check entities
var result = await graphlit.GetContent(content.IngestUri.Id);
Console.WriteLine($"Extracted {result.Content.Observations?.Length ?? 0} entities");
````

### Developer Hints

#### Extraction Requires Preparation

```typescript
//  Won't work - extraction needs text
const workflow = await graphlit.createWorkflow({
  extraction: { /* ... */ }
  // Missing preparation stage!
});

// ✓ Correct - prepare first
const workflow = await graphlit.createWorkflow({
  preparation: { /* Extract text */ },
  extraction: { /* Then extract entities */ }
});
```

#### More Entity Types = Slower + More Expensive

```typescript
// Fast + cheap (2 types)
extractedTypes: [
  ObservableTypes.Person,
  ObservableTypes.Organization
]

// Slower + more expensive (10 types)
extractedTypes: [
  ObservableTypes.Person,
  ObservableTypes.Organization,
  ObservableTypes.Place,
  ObservableTypes.Event,
  ObservableTypes.Product,
  // ... more types
]

// Choose types relevant to your domain
```

#### Vision Models for Complex PDFs

```typescript
// Use ModelDocument (vision) for:
// - Scanned documents
// - PDFs with charts/diagrams
// - Organizational charts
// - Infographics

// Use ModelText for:
// - Plain text documents
// - Word documents
// - Clean PDFs
// - Transcribed audio
```

### Common Issues & Solutions

**Issue**: No entities extracted **Solution**: Check if workflow has extraction stage and preparation completed

```typescript
// Check content state
const content = await graphlit.getContent('content-id');
console.log(`State: ${content.content.state}`);

// Check if workflow has extraction
const workflow = await graphlit.getWorkflow('workflow-id');
console.log(`Has extraction: ${!!workflow.workflow.extraction}`);
```

**Issue**: Low confidence scores **Solution**: Text may be ambiguous or context unclear

```typescript
// Use higher quality model
const betterSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Completion,
  openAI: {
    model: OpenAiModels.Gpt4Turbo_128K  // Better than GPT-3.5
  }
});

// Apply threshold
const highConfidence = observations.filter(
  obs => obs.occurrences?.[0]?.confidence >= 0.8
);
```

**Issue**: Too many false positives **Solution**: Increase confidence threshold or narrow entity types

```typescript
// Only extract specific types
extractedTypes: [
  ObservableTypes.Person  // Just people, not everything
]

// Filter by confidence
const reliable = observations.filter(
  obs => obs.occurrences?.[0]?.confidence >= 0.85
);
```

### Production Example

```typescript
async function createProductionExtractionWorkflow() {
  console.log('Creating production extraction workflow...\n');
  
  // Create high-quality specification
  const spec = await graphlit.createSpecification({
    name: "Production Extraction",
    type: SpecificationTypes.Completion,
    serviceType: ModelServiceTypes.OpenAi,
    openAI: {
      model: OpenAiModels.Gpt4Turbo_128K,
      temperature: 0.0  // Deterministic
    }
  });
  
  console.log(`✓ Created specification: ${spec.createSpecification.id}`);
  
  // Create workflow with both text and vision extraction
  const workflow = await graphlit.createWorkflow({
    name: "Production Entity Extraction",
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.Document,
          extractImages: true,
          ocrImages: true
        }
      }]
    },
    extraction: {
      jobs: [
        {
          // Text extraction
          connector: {
            type: EntityExtractionServiceTypes.ModelText,
            extractedTypes: [
              ObservableTypes.Person,
              ObservableTypes.Organization,
              ObservableTypes.Place,
              ObservableTypes.Event,
              ObservableTypes.Product
            ]
          }
        },
        {
          // Vision extraction (for images/charts)
          connector: {
            type: EntityExtractionServiceTypes.ModelImage,
            extractedTypes: [
              ObservableTypes.Person,
              ObservableTypes.Organization
            ]
          }
        }
      ]
    },
    specification: { id: spec.createSpecification.id }
  });
  
  console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);
  
  // Test with sample document
  console.log('Testing extraction...');
  const content = await graphlit.ingestUri(
    'https://example.com/sample-document.pdf',
    undefined,
    undefined,
    undefined,
    true,
    { id: workflow.createWorkflow.id }
  );
  
  // Get results
  const result = await graphlit.getContent(content.ingestUri.id);
  
  console.log(`\n=== EXTRACTION RESULTS ===`);
  console.log(`Document: ${result.content.name}`);
  console.log(`Total observations: ${result.content.observations?.length || 0}`);
  
  // Group by type
  const byType = new Map<string, number>();
  result.content.observations?.forEach(obs => {
    byType.set(obs.type, (byType.get(obs.type) || 0) + 1);
  });
  
  console.log('\nEntities by type:');
  byType.forEach((count, type) => {
    console.log(`  ${type}: ${count}`);
  });
  
  // Confidence analysis
  const confidences = result.content.observations
    ?.flatMap(obs => obs.occurrences || [])
    .map(occ => occ.confidence || 0) || [];
  
  if (confidences.length > 0) {
    const avg = confidences.reduce((a, b) => a + b, 0) / confidences.length;
    const high = confidences.filter(c => c >= 0.8).length;
    const med = confidences.filter(c => c >= 0.6 && c < 0.8).length;
    const low = confidences.filter(c => c < 0.6).length;
    
    console.log('\nConfidence distribution:');
    console.log(`  High (≥80%): ${high}`);
    console.log(`  Medium (60-80%): ${med}`);
    console.log(`  Low (<60%): ${low}`);
    console.log(`  Average: ${(avg * 100).toFixed(1)}%`);
  }
  
  return workflow.createWorkflow.id;
}

await createProductionExtractionWorkflow();
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/workflow-extraction-how-it-works.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
