Configure Workflow for Entity Extraction

Use Case: Configure Workflow for Entity Extraction

User Intent

"How do I configure a workflow to extract entities from my content? What entity types should I choose and which models work best?"

Operation

SDK Method: createWorkflow() with extraction stage GraphQL Mutation: createWorkflow Entity: Workflow with extraction configuration

Prerequisites

  • Graphlit project with API credentials

  • Understanding of entity types (Person, Organization, etc.)

  • Content to process (documents, emails, messages, etc.)


Complete Code Example (TypeScript)

import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  ExtractionServiceTypes,
  ObservableTypes,
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with entity extraction
const workflow = await graphlit.createWorkflow({
  name: "Entity Extraction Workflow",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Text
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`Created workflow: ${workflow.createWorkflow.id}`);
console.log(`Extracting: Person, Organization, Place, Event`);

// Use workflow with content ingestion
const content = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  'Entity Extraction Doc',
  undefined,
  true,
  { id: workflow.createWorkflow.id }
);

console.log(`Ingesting content with entity extraction...`);

Key differences: snake_case methods, enum values

workflow_response = await graphlit.createWorkflow( name="Entity Extraction Workflow", preparation={ "jobs": [{ "connector": { "type": FilePreparationServiceTEXT } }] }, extraction={ "jobs": [{ "connector": { "type": ExtractionServiceMODEL_TEXT, "extractedTypes": [ ObservablePERSON, ObservableORGANIZATION, ObservablePLACE, ObservableEVENT ] } }] } )

print(f"Created workflow: {workflow_response.create_workflow.id}")


Step-by-Step Explanation

Step 1: Choose Extraction Type

Graphlit supports two extraction connector types:

ExtractionServiceModelText:

  • For text-based content (documents, emails, messages)

  • Uses LLM to analyze prepared text

  • Fast and cost-effective

  • Best for: PDFs with text, emails, Slack messages, web pages

ExtractionServiceModelDocument:

  • For visual document analysis

  • Uses vision models (GPT-4o Vision, Claude 3.5 Sonnet)

  • Analyzes images, charts, diagrams, scanned documents

  • Best for: PDFs with images, scanned documents, presentation slides

Step 2: Select Entity Types

Choose entity types based on your domain:

Business Documents:

Medical/Clinical Content:

Technical Documentation:

Step 3: Add Preparation Stage

Preparation extracts text before entity extraction:

Step 4: Configure Model (Optional)

Specify which LLM model to use via specification:


Configuration Options

Changing Extraction Models

GPT-4o (Default - Recommended):

  • Fast and accurate

  • Good balance of quality and cost

  • Handles 20+ entity types

  • Best for production

GPT-4:

  • Highest quality

  • More expensive

  • Slower processing

  • Best for critical accuracy

Claude 3.5 Sonnet:

  • Very good quality

  • Fast processing

  • Good for long documents

  • Alternative to GPT-4o

Gemini 1.5 Pro:

  • Cost-effective

  • Good quality

  • Fast processing

  • Budget-friendly option

Vision Model Extraction (for PDFs with Images)

Multiple Extraction Jobs

Extract different types with different models:


Variations

Variation 1: Minimal Extraction (Fast)

Extract only core entity types for speed:

Variation 2: Comprehensive Extraction

Extract all relevant entity types:

Variation 3: Medical Content Extraction

Extract medical entities:

Variation 4: Audio/Video Transcription + Extraction

Extract entities from meeting recordings:

Variation 5: GitHub Repository Analysis

Extract technical entities from code repositories:


Common Issues & Solutions

Issue: No Entities Extracted

Problem: Workflow completes but no observations found.

Solutions:

  1. Check content has text (not just images without OCR)

  2. Verify extraction stage is configured

  3. Ensure entity types are appropriate for content

  4. Check confidence threshold isn't too high

  5. Try vision model for scanned documents

Issue: Too Many Low-Quality Entities

Problem: Many entities with low confidence scores.

Solutions:

  1. Use better model (GPT-4 instead of Gemini)

  2. Filter by confidence threshold (>0.7)

  3. Reduce number of entity types

  4. Improve content quality (OCR accuracy)

Issue: Extraction Too Slow

Problem: Processing takes too long.

Solutions:

  1. Reduce number of entity types

  2. Use faster model (GPT-4o instead of GPT-4)

  3. Split large documents

  4. Process in batches

Issue: Wrong Entity Types Extracted

Problem: Entities classified incorrectly.

Solutions:

  1. Use more specific entity types

  2. Provide better preparation (clean text)

  3. Use vision models for visual documents

  4. Combine multiple extraction jobs


Developer Hints

Model Selection Guidelines

  • Production default: GPT-4o (fast, accurate, cost-effective)

  • Highest quality: GPT-4 (use for critical applications)

  • Long documents: Claude 3.5 Sonnet (128K context)

  • Budget-friendly: Gemini 1.5 Pro

  • Visual content: GPT-4o Vision or Claude 3.5 Sonnet

Entity Type Selection Strategy

  1. Start with core types (Person, Organization)

  2. Add domain-specific types (Medical*, Software, Repo)

  3. Test extraction quality

  4. Add more types incrementally

  5. Monitor processing time and cost

Performance Considerations

  • More entity types = longer processing time

  • Vision models slower than text models

  • GPT-4 slower but more accurate than GPT-4o

  • Batch processing for large volumes

  • Consider cost per extraction job

Confidence Threshold Recommendations

  • High precision needed: confidence >= 0.8

  • Balanced: confidence >= 0.7 (recommended)

  • High recall needed: confidence >= 0.5

  • Research/exploration: confidence >= 0.3

Cost Optimization

  1. Use text extraction when possible (cheaper than vision)

  2. Choose appropriate model (GPT-4o vs GPT-4)

  3. Extract only needed entity types

  4. Batch process for volume discounts

  5. Cache extracted entities


Last updated

Was this helpful?