Configure Workflow for Entity Extraction
Use Case: Configure Workflow for Entity Extraction
User Intent
"How do I configure a workflow to extract entities from my content? What entity types should I choose and which models work best?"
Operation
SDK Method: createWorkflow() with extraction stage
GraphQL Mutation: createWorkflow
Entity: Workflow with extraction configuration
Prerequisites
Graphlit project with API credentials
Understanding of entity types (Person, Organization, etc.)
Content to process (documents, emails, messages, etc.)
Complete Code Example (TypeScript)
import { Graphlit } from 'graphlit-client';
import {
FilePreparationServiceTypes,
ExtractionServiceTypes,
ObservableTypes,
} from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
// Create workflow with entity extraction
const workflow = await graphlit.createWorkflow({
name: "Entity Extraction Workflow",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Text
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
]
}
}]
}
});
console.log(`Created workflow: ${workflow.createWorkflow.id}`);
console.log(`Extracting: Person, Organization, Place, Event`);
// Use workflow with content ingestion
const content = await graphlit.ingestUri(
'https://example.com/document.pdf',
'Entity Extraction Doc',
undefined,
true,
{ id: workflow.createWorkflow.id }
);
console.log(`Ingesting content with entity extraction...`);Key differences: snake_case methods, enum values
workflow_response = await graphlit.createWorkflow( name="Entity Extraction Workflow", preparation={ "jobs": [{ "connector": { "type": FilePreparationServiceTEXT } }] }, extraction={ "jobs": [{ "connector": { "type": ExtractionServiceMODEL_TEXT, "extractedTypes": [ ObservablePERSON, ObservableORGANIZATION, ObservablePLACE, ObservableEVENT ] } }] } )
print(f"Created workflow: {workflow_response.create_workflow.id}")
Step-by-Step Explanation
Step 1: Choose Extraction Type
Graphlit supports two extraction connector types:
ExtractionServiceModelText:
For text-based content (documents, emails, messages)
Uses LLM to analyze prepared text
Fast and cost-effective
Best for: PDFs with text, emails, Slack messages, web pages
ExtractionServiceModelDocument:
For visual document analysis
Uses vision models (GPT-4o Vision, Claude 3.5 Sonnet)
Analyzes images, charts, diagrams, scanned documents
Best for: PDFs with images, scanned documents, presentation slides
Step 2: Select Entity Types
Choose entity types based on your domain:
Business Documents:
Medical/Clinical Content:
Technical Documentation:
Step 3: Add Preparation Stage
Preparation extracts text before entity extraction:
Step 4: Configure Model (Optional)
Specify which LLM model to use via specification:
Configuration Options
Changing Extraction Models
GPT-4o (Default - Recommended):
Fast and accurate
Good balance of quality and cost
Handles 20+ entity types
Best for production
GPT-4:
Highest quality
More expensive
Slower processing
Best for critical accuracy
Claude 3.5 Sonnet:
Very good quality
Fast processing
Good for long documents
Alternative to GPT-4o
Gemini 1.5 Pro:
Cost-effective
Good quality
Fast processing
Budget-friendly option
Vision Model Extraction (for PDFs with Images)
Multiple Extraction Jobs
Extract different types with different models:
Variations
Variation 1: Minimal Extraction (Fast)
Extract only core entity types for speed:
Variation 2: Comprehensive Extraction
Extract all relevant entity types:
Variation 3: Medical Content Extraction
Extract medical entities:
Variation 4: Audio/Video Transcription + Extraction
Extract entities from meeting recordings:
Variation 5: GitHub Repository Analysis
Extract technical entities from code repositories:
Common Issues & Solutions
Issue: No Entities Extracted
Problem: Workflow completes but no observations found.
Solutions:
Check content has text (not just images without OCR)
Verify extraction stage is configured
Ensure entity types are appropriate for content
Check confidence threshold isn't too high
Try vision model for scanned documents
Issue: Too Many Low-Quality Entities
Problem: Many entities with low confidence scores.
Solutions:
Use better model (GPT-4 instead of Gemini)
Filter by confidence threshold (>0.7)
Reduce number of entity types
Improve content quality (OCR accuracy)
Issue: Extraction Too Slow
Problem: Processing takes too long.
Solutions:
Reduce number of entity types
Use faster model (GPT-4o instead of GPT-4)
Split large documents
Process in batches
Issue: Wrong Entity Types Extracted
Problem: Entities classified incorrectly.
Solutions:
Use more specific entity types
Provide better preparation (clean text)
Use vision models for visual documents
Combine multiple extraction jobs
Developer Hints
Model Selection Guidelines
Production default: GPT-4o (fast, accurate, cost-effective)
Highest quality: GPT-4 (use for critical applications)
Long documents: Claude 3.5 Sonnet (128K context)
Budget-friendly: Gemini 1.5 Pro
Visual content: GPT-4o Vision or Claude 3.5 Sonnet
Entity Type Selection Strategy
Start with core types (Person, Organization)
Add domain-specific types (Medical*, Software, Repo)
Test extraction quality
Add more types incrementally
Monitor processing time and cost
Performance Considerations
More entity types = longer processing time
Vision models slower than text models
GPT-4 slower but more accurate than GPT-4o
Batch processing for large volumes
Consider cost per extraction job
Confidence Threshold Recommendations
High precision needed: confidence >= 0.8
Balanced: confidence >= 0.7 (recommended)
High recall needed: confidence >= 0.5
Research/exploration: confidence >= 0.3
Cost Optimization
Use text extraction when possible (cheaper than vision)
Choose appropriate model (GPT-4o vs GPT-4)
Extract only needed entity types
Batch process for volume discounts
Cache extracted entities
Last updated
Was this helpful?