How Entity Extraction Works

Workflow: How Entity Extraction Works

User Intent

"How does entity extraction actually work? What happens during the extraction stage?"

Operation

  • SDK Method: createWorkflow() with extraction stage

  • GraphQL: Workflow with extraction configuration

  • Entity Type: Workflow

  • Common Use Cases: Understanding extraction pipeline, configuring extraction, choosing models

Extraction Pipeline Overview

Entity extraction is an LLM-based process that analyzes text and identifies structured entities (people, organizations, places, etc.).

TypeScript (Canonical)

import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with extraction
const workflow = await graphlit.createWorkflow({
  name: "Document Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

// Ingest with extraction workflow
const content = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  undefined,
  undefined,
  undefined,
  true,
  { id: workflow.createWorkflow.id }
);

// Check extracted entities
const result = await graphlit.getContent(content.ingestUri.id);

console.log(`Extracted ${result.content.observations?.length || 0} entity observations`);

result.content.observations?.forEach(obs => {
  console.log(`${obs.type}: ${obs.observable.name}`);
  console.log(`  Confidence: ${obs.occurrences?.[0]?.confidence}`);
});

The Extraction Pipeline

Step-by-Step Process

LLM-Based Extraction

What the LLM Does

Model Selection

Vision-Based Extraction

For PDFs with images, charts, diagrams:

Extraction Models Comparison

GPT-4 (OpenAI)

  • Quality: Highest

  • Speed: Moderate

  • Cost: High

  • Use: Production, high-value content

GPT-4o (OpenAI)

  • Quality: High

  • Speed: Fast

  • Cost: Moderate

  • Use: Balanced production workloads

Claude 3.5 Sonnet (Anthropic)

  • Quality: High

  • Speed: Fast

  • Cost: Moderate

  • Use: Alternative to GPT-4o, good quality

Gemini Pro (Google)

  • Quality: Good

  • Speed: Fast

  • Cost: Lower

  • Use: Cost-sensitive applications

Prompt Engineering for Extraction

Default Prompts

Graphlit uses optimized prompts for each entity type:

Custom Prompts (Advanced)

Future feature: Custom extraction prompts for domain-specific needs

Confidence Scoring

How Confidence is Calculated

LLM provides confidence based on:

  • Context clarity

  • Explicit mentions vs inferences

  • Ambiguity in text

  • Supporting evidence

Using Confidence Thresholds

When Extraction Runs

During Workflow Processing

Multiple Extraction Jobs

Create extraction workflow

workflow = await graphlit.createWorkflow( name="Entity Extraction", preparation=input_types.PreparationWorkflowStageInput( jobs=[ input_types.PreparationWorkflowJobInput( connector=input_types.FilePreparationConnectorInput( type=enums.FilePreparationServiceDOCUMENT ) ) ] ), extraction=input_types.ExtractionWorkflowStageInput( jobs=[ input_types.ExtractionWorkflowJobInput( connector=input_types.EntityExtractionConnectorInput( type=enums.ExtractionServiceMODEL_TEXT, extracted_types=[ enums.ObservablePERSON, enums.ObservableORGANIZATION ] ) ) ] ) )

Ingest with extraction

content = await graphlit.ingestUri( uri='https://example.com/doc.pdf', workflow=input_types.EntityReferenceInput(id=workflow.create_workflow.id), is_synchronous=True )

Check entities

result = await graphlit.getContent(content.ingest_uri.id) print(f"Extracted {len(result.content.observations or [])} entities")

Developer Hints

Extraction Requires Preparation

More Entity Types = Slower + More Expensive

Vision Models for Complex PDFs

Common Issues & Solutions

Issue: No entities extracted Solution: Check if workflow has extraction stage and preparation completed

Issue: Low confidence scores Solution: Text may be ambiguous or context unclear

Issue: Too many false positives Solution: Increase confidence threshold or narrow entity types

Production Example

Last updated

Was this helpful?