How Entity Extraction Works
Workflow: How Entity Extraction Works
User Intent
"How does entity extraction actually work? What happens during the extraction stage?"
Operation
SDK Method:
createWorkflow()withextractionstageGraphQL: Workflow with extraction configuration
Entity Type: Workflow
Common Use Cases: Understanding extraction pipeline, configuring extraction, choosing models
Extraction Pipeline Overview
Entity extraction is an LLM-based process that analyzes text and identifies structured entities (people, organizations, places, etc.).
TypeScript (Canonical)
import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
// Create workflow with extraction
const workflow = await graphlit.createWorkflow({
name: "Document Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Document
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
]
}
}]
}
});
// Ingest with extraction workflow
const content = await graphlit.ingestUri(
'https://example.com/document.pdf',
undefined,
undefined,
undefined,
true,
{ id: workflow.createWorkflow.id }
);
// Check extracted entities
const result = await graphlit.getContent(content.ingestUri.id);
console.log(`Extracted ${result.content.observations?.length || 0} entity observations`);
result.content.observations?.forEach(obs => {
console.log(`${obs.type}: ${obs.observable.name}`);
console.log(` Confidence: ${obs.occurrences?.[0]?.confidence}`);
});The Extraction Pipeline
Step-by-Step Process
LLM-Based Extraction
What the LLM Does
Model Selection
Vision-Based Extraction
For PDFs with images, charts, diagrams:
Extraction Models Comparison
GPT-4 (OpenAI)
Quality: Highest
Speed: Moderate
Cost: High
Use: Production, high-value content
GPT-4o (OpenAI)
Quality: High
Speed: Fast
Cost: Moderate
Use: Balanced production workloads
Claude 3.5 Sonnet (Anthropic)
Quality: High
Speed: Fast
Cost: Moderate
Use: Alternative to GPT-4o, good quality
Gemini Pro (Google)
Quality: Good
Speed: Fast
Cost: Lower
Use: Cost-sensitive applications
Prompt Engineering for Extraction
Default Prompts
Graphlit uses optimized prompts for each entity type:
Custom Prompts (Advanced)
Future feature: Custom extraction prompts for domain-specific needs
Confidence Scoring
How Confidence is Calculated
LLM provides confidence based on:
Context clarity
Explicit mentions vs inferences
Ambiguity in text
Supporting evidence
Using Confidence Thresholds
When Extraction Runs
During Workflow Processing
Multiple Extraction Jobs
Create extraction workflow
workflow = await graphlit.createWorkflow( name="Entity Extraction", preparation=input_types.PreparationWorkflowStageInput( jobs=[ input_types.PreparationWorkflowJobInput( connector=input_types.FilePreparationConnectorInput( type=enums.FilePreparationServiceDOCUMENT ) ) ] ), extraction=input_types.ExtractionWorkflowStageInput( jobs=[ input_types.ExtractionWorkflowJobInput( connector=input_types.EntityExtractionConnectorInput( type=enums.ExtractionServiceMODEL_TEXT, extracted_types=[ enums.ObservablePERSON, enums.ObservableORGANIZATION ] ) ) ] ) )
Ingest with extraction
content = await graphlit.ingestUri( uri='https://example.com/doc.pdf', workflow=input_types.EntityReferenceInput(id=workflow.create_workflow.id), is_synchronous=True )
Check entities
result = await graphlit.getContent(content.ingest_uri.id) print(f"Extracted {len(result.content.observations or [])} entities")
Developer Hints
Extraction Requires Preparation
More Entity Types = Slower + More Expensive
Vision Models for Complex PDFs
Common Issues & Solutions
Issue: No entities extracted Solution: Check if workflow has extraction stage and preparation completed
Issue: Low confidence scores Solution: Text may be ambiguous or context unclear
Issue: Too many false positives Solution: Increase confidence threshold or narrow entity types
Production Example
Last updated
Was this helpful?