How Entity Extraction Works

Workflow: How Entity Extraction Works

User Intent

"How does entity extraction actually work? What happens during the extraction stage?"

Operation

  • SDK Method: createWorkflow() with extraction stage

  • GraphQL: Workflow with extraction configuration

  • Entity Type: Workflow

  • Common Use Cases: Understanding extraction pipeline, configuring extraction, choosing models

Extraction Pipeline Overview

Entity extraction is an LLM-based process that analyzes text and identifies structured entities (people, organizations, places, etc.).

TypeScript (Canonical)

import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with extraction
const workflow = await graphlit.createWorkflow({
  name: "Document Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

// Ingest with extraction workflow
const content = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  undefined,
  undefined,
  undefined,
  true,
  { id: workflow.createWorkflow.id }
);

// Check extracted entities
const result = await graphlit.getContent(content.ingestUri.id);

console.log(`Extracted ${result.content.observations?.length || 0} entity observations`);

result.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
  console.log(`  Confidence: ${obs.occurrences?.[0]?.confidence}`);
});

The Extraction Pipeline

Step-by-Step Process

1. Content Ingestion

2. Preparation Stage
   - Text extraction (PDF, Word, etc.)
   - OCR (scanned documents)
   - Audio transcription
   - Text chunking

3. Extraction Stage (THIS IS WHERE IT HAPPENS)
   - Send text to LLM (GPT-4, Claude, etc.)
   - LLM analyzes text for entities
   - LLM returns structured JSON with entities
   - Each entity has: type, name, properties, confidence

4. Observation Creation
   - Create Observation records
   - Link to content
   - Store occurrence details (page, location, confidence)

5. Entity Resolution
   - Check if entity already exists (by name, email, url, etc.)
   - Create new Observable OR link to existing
   - Deduplicate entities

6. Graph Storage
   - Store in graph database
   - Create entity nodes
   - Create observation edges
   - Link to content

7. Content State → ENABLED

LLM-Based Extraction

What the LLM Does

// Behind the scenes, LLM receives prompt like:

const prompt = `
Extract entities from the following text.
Return a JSON array of entities with:
- type (PERSON, ORGANIZATION, PLACE, EVENT, etc.)
- name
- properties (email, jobTitle, url, etc.)
- confidence (0.0 to 1.0)

Text:
"""
Kirk Marple is the CEO of Graphlit, a semantic memory platform based in 
Seattle. The company was founded in 2023 and raised $2.5M in funding.
"""

Expected output:
[
  {
    "type": "PERSON",
    "name": "Kirk Marple",
    "properties": {
      "jobTitle": "CEO",
      "affiliation": "Graphlit"
    },
    "confidence": 0.95
  },
  {
    "type": "ORGANIZATION",
    "name": "Graphlit",
    "properties": {
      "description": "semantic memory platform",
      "foundingDate": "2023"
    },
    "confidence": 0.98
  },
  {
    "type": "PLACE",
    "name": "Seattle",
    "confidence": 0.92
  }
]
`;

// LLM processes and returns structured JSON
// Graphlit parses and creates Observations

Model Selection

// Specify model via specification
const gpt4Spec = await graphlit.createSpecification({
  name: "GPT-4 Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAI,
  openAI: {
    model: OpenAIModels.Gpt4Turbo,  // High quality
    temperature: 0.0  // Deterministic
  }
});

const workflow = await graphlit.createWorkflow({
  name: "High Quality Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
      }
    }]
  },
  specification: { id: gpt4Spec.createSpecification.id }
});

Vision-Based Extraction

For PDFs with images, charts, diagrams:

const visionWorkflow = await graphlit.createWorkflow({
  name: "PDF Vision Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document,
        extractImages: true,  // Extract images from PDF
        ocrImages: true       // OCR on images
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision model
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    }]
  }
});

// Vision models can extract from:
// - Charts and diagrams
// - Organizational charts
// - Scanned documents
// - Images with text
// - Infographics

Extraction Models Comparison

GPT-4 (OpenAI)

  • Quality: Highest

  • Speed: Moderate

  • Cost: High

  • Use: Production, high-value content

GPT-4o (OpenAI)

  • Quality: High

  • Speed: Fast

  • Cost: Moderate

  • Use: Balanced production workloads

Claude 3.5 Sonnet (Anthropic)

  • Quality: High

  • Speed: Fast

  • Cost: Moderate

  • Use: Alternative to GPT-4o, good quality

Gemini Pro (Google)

  • Quality: Good

  • Speed: Fast

  • Cost: Lower

  • Use: Cost-sensitive applications

// Configure model
const spec = await graphlit.createSpecification({
  name: "Model Spec",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAI,  // or Anthropic, Google
  openAI: {
    model: OpenAIModels.Gpt4Turbo
  }
});

Prompt Engineering for Extraction

Default Prompts

Graphlit uses optimized prompts for each entity type:

// Person extraction prompt (conceptual):
// "Extract people mentioned in the text. Include:
//  - Full name
//  - Email address (if mentioned)
//  - Job title (if mentioned)
//  - Affiliation/company (if mentioned)
//  - Provide confidence score"

// Organization extraction prompt:
// "Extract organizations mentioned. Include:
//  - Full organization name
//  - URL (if mentioned)
//  - Description (if available)
//  - Provide confidence score"

Custom Prompts (Advanced)

Future feature: Custom extraction prompts for domain-specific needs

Confidence Scoring

How Confidence is Calculated

LLM provides confidence based on:

  • Context clarity

  • Explicit mentions vs inferences

  • Ambiguity in text

  • Supporting evidence

// High confidence (0.9-1.0):
// "Kirk Marple is the CEO..."
// Clear, explicit, unambiguous

// Medium confidence (0.7-0.9):
// "Kirk mentioned that..."
// Implicit context, less clear

// Low confidence (0.5-0.7):
// "The CEO said..."
// Pronoun reference, ambiguous

// Very low confidence (<0.5):
// "He suggested..."
// Multiple possible referents

Using Confidence Thresholds

// Filter by confidence
const content = await graphlit.getContent('content-id');

const highConfidence = content.content.observations?.filter(obs =>
  obs.occurrences?.some(occ => (occ.confidence || 0) >= 0.8)
);

console.log(`High confidence entities: ${highConfidence?.length}`);

When Extraction Runs

During Workflow Processing

// Extraction runs AFTER preparation
const workflow = await graphlit.createWorkflow({
  preparation: { /* Extract text first */ },
  extraction: { /* Then extract entities */ }
});

// Timeline:
// 1. Ingest content → State: CREATED
// 2. Preparation runs → Extract text/OCR/transcribe
// 3. Extraction runs → LLM analyzes text
// 4. Observations created
// 5. State: ENABLED

Multiple Extraction Jobs

// Run multiple extraction jobs in parallel
const workflow = await graphlit.createWorkflow({
  name: "Multi-Model Extraction",
  extraction: {
    jobs: [
      {
        // Text-based extraction
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
        }
      },
      {
        // Vision-based extraction (runs in parallel)
        connector: {
          type: EntityExtractionServiceTypes.ModelDocument,
          extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
        }
      }
    ]
  }
});

Create extraction workflow

workflow = await graphlit.createWorkflow( name="Entity Extraction", preparation=input_types.PreparationWorkflowStageInput( jobs=[ input_types.PreparationWorkflowJobInput( connector=input_types.FilePreparationConnectorInput( type=enums.FilePreparationServiceDOCUMENT ) ) ] ), extraction=input_types.ExtractionWorkflowStageInput( jobs=[ input_types.ExtractionWorkflowJobInput( connector=input_types.EntityExtractionConnectorInput( type=enums.ExtractionServiceMODEL_TEXT, extracted_types=[ enums.ObservablePERSON, enums.ObservableORGANIZATION ] ) ) ] ) )

Ingest with extraction

content = await graphlit.ingestUri( uri='https://example.com/doc.pdf', workflow=input_types.EntityReferenceInput(id=workflow.create_workflow.id), is_synchronous=True )

Check entities

result = await graphlit.getContent(content.ingest_uri.id) print(f"Extracted {len(result.content.observations or [])} entities")


**C#**:
```csharp
using Graphlit;

var client = new Graphlit();

// Create extraction workflow
var workflow = await graphlit.CreateWorkflow(new WorkflowInput
{
    Name = "Entity Extraction",
    Preparation = new PreparationWorkflowStage
    {
        Jobs = new[]
        {
            new PreparationWorkflowJob
            {
                Connector = new FilePreparationConnector
                {
                    Type = FilePreparationServiceDocument
                }
            }
        }
    },
    Extraction = new ExtractionWorkflowStage
    {
        Jobs = new[]
        {
            new ExtractionWorkflowJob
            {
                Connector = new EntityExtractionConnector
                {
                    Type = ExtractionServiceModelText,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization
                    }
                }
            }
        }
    }
});

// Ingest with extraction
var content = await graphlit.IngestUri(new IngestUriInput
{
    Uri = "https://example.com/doc.pdf",
    Workflow = new EntityReference { Id = workflow.CreateWorkflow.Id },
    IsSynchronous = true
});

// Check entities
var result = await graphlit.GetContent(content.IngestUri.Id);
Console.WriteLine($"Extracted {result.Content.Observations?.Length ?? 0} entities");

Developer Hints

Extraction Requires Preparation

//  Won't work - extraction needs text
const workflow = await graphlit.createWorkflow({
  extraction: { /* ... */ }
  // Missing preparation stage!
});

// ✓ Correct - prepare first
const workflow = await graphlit.createWorkflow({
  preparation: { /* Extract text */ },
  extraction: { /* Then extract entities */ }
});

More Entity Types = Slower + More Expensive

// Fast + cheap (2 types)
extractedTypes: [
  ObservableTypes.Person,
  ObservableTypes.Organization
]

// Slower + more expensive (10 types)
extractedTypes: [
  ObservableTypes.Person,
  ObservableTypes.Organization,
  ObservableTypes.Place,
  ObservableTypes.Event,
  ObservableTypes.Product,
  // ... more types
]

// Choose types relevant to your domain

Vision Models for Complex PDFs

// Use ModelDocument (vision) for:
// - Scanned documents
// - PDFs with charts/diagrams
// - Organizational charts
// - Infographics

// Use ModelText for:
// - Plain text documents
// - Word documents
// - Clean PDFs
// - Transcribed audio

Common Issues & Solutions

Issue: No entities extracted Solution: Check if workflow has extraction stage and preparation completed

// Check content state
const content = await graphlit.getContent('content-id');
console.log(`State: ${content.content.state}`);

// Check if workflow has extraction
const workflow = await graphlit.getWorkflow('workflow-id');
console.log(`Has extraction: ${!!workflow.workflow.extraction}`);

Issue: Low confidence scores Solution: Text may be ambiguous or context unclear

// Use higher quality model
const betterSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Completion,
  openAI: {
    model: OpenAIModels.Gpt4Turbo  // Better than GPT-3.5
  }
});

// Apply threshold
const highConfidence = observations.filter(
  obs => obs.occurrences?.[0]?.confidence >= 0.8
);

Issue: Too many false positives Solution: Increase confidence threshold or narrow entity types

// Only extract specific types
extractedTypes: [
  ObservableTypes.Person  // Just people, not everything
]

// Filter by confidence
const reliable = observations.filter(
  obs => obs.occurrences?.[0]?.confidence >= 0.85
);

Production Example

async function createProductionExtractionWorkflow() {
  console.log('Creating production extraction workflow...\n');
  
  // Create high-quality specification
  const spec = await graphlit.createSpecification({
    name: "Production Extraction",
    type: SpecificationTypes.Completion,
    serviceType: ModelServiceTypes.OpenAI,
    openAI: {
      model: OpenAIModels.Gpt4Turbo,
      temperature: 0.0  // Deterministic
    }
  });
  
  console.log(`✓ Created specification: ${spec.createSpecification.id}`);
  
  // Create workflow with both text and vision extraction
  const workflow = await graphlit.createWorkflow({
    name: "Production Entity Extraction",
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.Document,
          extractImages: true,
          ocrImages: true
        }
      }]
    },
    extraction: {
      jobs: [
        {
          // Text extraction
          connector: {
            type: EntityExtractionServiceTypes.ModelText,
            extractedTypes: [
              ObservableTypes.Person,
              ObservableTypes.Organization,
              ObservableTypes.Place,
              ObservableTypes.Event,
              ObservableTypes.Product
            ]
          }
        },
        {
          // Vision extraction (for images/charts)
          connector: {
            type: EntityExtractionServiceTypes.ModelDocument,
            extractedTypes: [
              ObservableTypes.Person,
              ObservableTypes.Organization
            ]
          }
        }
      ]
    },
    specification: { id: spec.createSpecification.id }
  });
  
  console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);
  
  // Test with sample document
  console.log('Testing extraction...');
  const content = await graphlit.ingestUri(
    'https://example.com/sample-document.pdf',
    undefined,
    undefined,
    undefined,
    true,
    { id: workflow.createWorkflow.id }
  );
  
  // Get results
  const result = await graphlit.getContent(content.ingestUri.id);
  
  console.log(`\n=== EXTRACTION RESULTS ===`);
  console.log(`Document: ${result.content.name}`);
  console.log(`Total observations: ${result.content.observations?.length || 0}`);
  
  // Group by type
  const byType = new Map<string, number>();
  result.content.observations?.forEach(obs => {
    byType.set(obs.type, (byType.get(obs.type) || 0) + 1);
  });
  
  console.log('\nEntities by type:');
  byType.forEach((count, type) => {
    console.log(`  ${type}: ${count}`);
  });
  
  // Confidence analysis
  const confidences = result.content.observations
    ?.flatMap(obs => obs.occurrences || [])
    .map(occ => occ.confidence || 0) || [];
  
  if (confidences.length > 0) {
    const avg = confidences.reduce((a, b) => a + b, 0) / confidences.length;
    const high = confidences.filter(c => c >= 0.8).length;
    const med = confidences.filter(c => c >= 0.6 && c < 0.8).length;
    const low = confidences.filter(c => c < 0.6).length;
    
    console.log('\nConfidence distribution:');
    console.log(`  High (≥80%): ${high}`);
    console.log(`  Medium (60-80%): ${med}`);
    console.log(`  Low (<60%): ${low}`);
    console.log(`  Average: ${(avg * 100).toFixed(1)}%`);
  }
  
  return workflow.createWorkflow.id;
}

await createProductionExtractionWorkflow();

Last updated

Was this helpful?