How Entity Extraction Works
Workflow: How Entity Extraction Works
User Intent
"How does entity extraction actually work? What happens during the extraction stage?"
Operation
SDK Method:
createWorkflow()withextractionstageGraphQL: Workflow with extraction configuration
Entity Type: Workflow
Common Use Cases: Understanding extraction pipeline, configuring extraction, choosing models
Extraction Pipeline Overview
Entity extraction is an LLM-based process that analyzes text and identifies structured entities (people, organizations, places, etc.).
TypeScript (Canonical)
import { Graphlit } from 'graphlit-client';
import { ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';
const graphlit = new Graphlit();
// Create workflow with extraction
const workflow = await graphlit.createWorkflow({
name: "Document Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Document
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event
]
}
}]
}
});
// Ingest with extraction workflow
const content = await graphlit.ingestUri(
'https://example.com/document.pdf',
undefined,
undefined,
undefined,
true,
{ id: workflow.createWorkflow.id }
);
// Check extracted entities
const result = await graphlit.getContent(content.ingestUri.id);
console.log(`Extracted ${result.content.observations?.length || 0} entity observations`);
result.content.observations?.forEach(obs => {
console.log(`${obs.type}: ${obs.observable.name}`);
console.log(` Confidence: ${obs.occurrences?.[0]?.confidence}`);
});The Extraction Pipeline
Step-by-Step Process
1. Content Ingestion
↓
2. Preparation Stage
- Text extraction (PDF, Word, etc.)
- OCR (scanned documents)
- Audio transcription
- Text chunking
↓
3. Extraction Stage (THIS IS WHERE IT HAPPENS)
- Send text to LLM (GPT-4, Claude, etc.)
- LLM analyzes text for entities
- LLM returns structured JSON with entities
- Each entity has: type, name, properties, confidence
↓
4. Observation Creation
- Create Observation records
- Link to content
- Store occurrence details (page, location, confidence)
↓
5. Entity Resolution
- Check if entity already exists (by name, email, url, etc.)
- Create new Observable OR link to existing
- Deduplicate entities
↓
6. Graph Storage
- Store in graph database
- Create entity nodes
- Create observation edges
- Link to content
↓
7. Content State → ENABLEDLLM-Based Extraction
What the LLM Does
// Behind the scenes, LLM receives prompt like:
const prompt = `
Extract entities from the following text.
Return a JSON array of entities with:
- type (PERSON, ORGANIZATION, PLACE, EVENT, etc.)
- name
- properties (email, jobTitle, url, etc.)
- confidence (0.0 to 1.0)
Text:
"""
Kirk Marple is the CEO of Graphlit, a semantic memory platform based in
Seattle. The company was founded in 2023 and raised $2.5M in funding.
"""
Expected output:
[
{
"type": "PERSON",
"name": "Kirk Marple",
"properties": {
"jobTitle": "CEO",
"affiliation": "Graphlit"
},
"confidence": 0.95
},
{
"type": "ORGANIZATION",
"name": "Graphlit",
"properties": {
"description": "semantic memory platform",
"foundingDate": "2023"
},
"confidence": 0.98
},
{
"type": "PLACE",
"name": "Seattle",
"confidence": 0.92
}
]
`;
// LLM processes and returns structured JSON
// Graphlit parses and creates ObservationsModel Selection
// Specify model via specification
const gpt4Spec = await graphlit.createSpecification({
name: "GPT-4 Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAI,
openAI: {
model: OpenAIModels.Gpt4Turbo, // High quality
temperature: 0.0 // Deterministic
}
});
const workflow = await graphlit.createWorkflow({
name: "High Quality Extraction",
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
}
}]
},
specification: { id: gpt4Spec.createSpecification.id }
});Vision-Based Extraction
For PDFs with images, charts, diagrams:
const visionWorkflow = await graphlit.createWorkflow({
name: "PDF Vision Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Document,
extractImages: true, // Extract images from PDF
ocrImages: true // OCR on images
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelDocument, // Vision model
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization
]
}
}]
}
});
// Vision models can extract from:
// - Charts and diagrams
// - Organizational charts
// - Scanned documents
// - Images with text
// - InfographicsExtraction Models Comparison
GPT-4 (OpenAI)
Quality: Highest
Speed: Moderate
Cost: High
Use: Production, high-value content
GPT-4o (OpenAI)
Quality: High
Speed: Fast
Cost: Moderate
Use: Balanced production workloads
Claude 3.5 Sonnet (Anthropic)
Quality: High
Speed: Fast
Cost: Moderate
Use: Alternative to GPT-4o, good quality
Gemini Pro (Google)
Quality: Good
Speed: Fast
Cost: Lower
Use: Cost-sensitive applications
// Configure model
const spec = await graphlit.createSpecification({
name: "Model Spec",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAI, // or Anthropic, Google
openAI: {
model: OpenAIModels.Gpt4Turbo
}
});Prompt Engineering for Extraction
Default Prompts
Graphlit uses optimized prompts for each entity type:
// Person extraction prompt (conceptual):
// "Extract people mentioned in the text. Include:
// - Full name
// - Email address (if mentioned)
// - Job title (if mentioned)
// - Affiliation/company (if mentioned)
// - Provide confidence score"
// Organization extraction prompt:
// "Extract organizations mentioned. Include:
// - Full organization name
// - URL (if mentioned)
// - Description (if available)
// - Provide confidence score"Custom Prompts (Advanced)
Future feature: Custom extraction prompts for domain-specific needs
Confidence Scoring
How Confidence is Calculated
LLM provides confidence based on:
Context clarity
Explicit mentions vs inferences
Ambiguity in text
Supporting evidence
// High confidence (0.9-1.0):
// "Kirk Marple is the CEO..."
// Clear, explicit, unambiguous
// Medium confidence (0.7-0.9):
// "Kirk mentioned that..."
// Implicit context, less clear
// Low confidence (0.5-0.7):
// "The CEO said..."
// Pronoun reference, ambiguous
// Very low confidence (<0.5):
// "He suggested..."
// Multiple possible referentsUsing Confidence Thresholds
// Filter by confidence
const content = await graphlit.getContent('content-id');
const highConfidence = content.content.observations?.filter(obs =>
obs.occurrences?.some(occ => (occ.confidence || 0) >= 0.8)
);
console.log(`High confidence entities: ${highConfidence?.length}`);When Extraction Runs
During Workflow Processing
// Extraction runs AFTER preparation
const workflow = await graphlit.createWorkflow({
preparation: { /* Extract text first */ },
extraction: { /* Then extract entities */ }
});
// Timeline:
// 1. Ingest content → State: CREATED
// 2. Preparation runs → Extract text/OCR/transcribe
// 3. Extraction runs → LLM analyzes text
// 4. Observations created
// 5. State: ENABLEDMultiple Extraction Jobs
// Run multiple extraction jobs in parallel
const workflow = await graphlit.createWorkflow({
name: "Multi-Model Extraction",
extraction: {
jobs: [
{
// Text-based extraction
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
}
},
{
// Vision-based extraction (runs in parallel)
connector: {
type: EntityExtractionServiceTypes.ModelDocument,
extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
}
}
]
}
});Create extraction workflow
workflow = await graphlit.createWorkflow( name="Entity Extraction", preparation=input_types.PreparationWorkflowStageInput( jobs=[ input_types.PreparationWorkflowJobInput( connector=input_types.FilePreparationConnectorInput( type=enums.FilePreparationServiceDOCUMENT ) ) ] ), extraction=input_types.ExtractionWorkflowStageInput( jobs=[ input_types.ExtractionWorkflowJobInput( connector=input_types.EntityExtractionConnectorInput( type=enums.ExtractionServiceMODEL_TEXT, extracted_types=[ enums.ObservablePERSON, enums.ObservableORGANIZATION ] ) ) ] ) )
Ingest with extraction
content = await graphlit.ingestUri( uri='https://example.com/doc.pdf', workflow=input_types.EntityReferenceInput(id=workflow.create_workflow.id), is_synchronous=True )
Check entities
result = await graphlit.getContent(content.ingest_uri.id) print(f"Extracted {len(result.content.observations or [])} entities")
**C#**:
```csharp
using Graphlit;
var client = new Graphlit();
// Create extraction workflow
var workflow = await graphlit.CreateWorkflow(new WorkflowInput
{
Name = "Entity Extraction",
Preparation = new PreparationWorkflowStage
{
Jobs = new[]
{
new PreparationWorkflowJob
{
Connector = new FilePreparationConnector
{
Type = FilePreparationServiceDocument
}
}
}
},
Extraction = new ExtractionWorkflowStage
{
Jobs = new[]
{
new ExtractionWorkflowJob
{
Connector = new EntityExtractionConnector
{
Type = ExtractionServiceModelText,
ExtractedTypes = new[]
{
ObservableTypes.Person,
ObservableTypes.Organization
}
}
}
}
}
});
// Ingest with extraction
var content = await graphlit.IngestUri(new IngestUriInput
{
Uri = "https://example.com/doc.pdf",
Workflow = new EntityReference { Id = workflow.CreateWorkflow.Id },
IsSynchronous = true
});
// Check entities
var result = await graphlit.GetContent(content.IngestUri.Id);
Console.WriteLine($"Extracted {result.Content.Observations?.Length ?? 0} entities");Developer Hints
Extraction Requires Preparation
// Won't work - extraction needs text
const workflow = await graphlit.createWorkflow({
extraction: { /* ... */ }
// Missing preparation stage!
});
// ✓ Correct - prepare first
const workflow = await graphlit.createWorkflow({
preparation: { /* Extract text */ },
extraction: { /* Then extract entities */ }
});More Entity Types = Slower + More Expensive
// Fast + cheap (2 types)
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization
]
// Slower + more expensive (10 types)
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event,
ObservableTypes.Product,
// ... more types
]
// Choose types relevant to your domainVision Models for Complex PDFs
// Use ModelDocument (vision) for:
// - Scanned documents
// - PDFs with charts/diagrams
// - Organizational charts
// - Infographics
// Use ModelText for:
// - Plain text documents
// - Word documents
// - Clean PDFs
// - Transcribed audioCommon Issues & Solutions
Issue: No entities extracted Solution: Check if workflow has extraction stage and preparation completed
// Check content state
const content = await graphlit.getContent('content-id');
console.log(`State: ${content.content.state}`);
// Check if workflow has extraction
const workflow = await graphlit.getWorkflow('workflow-id');
console.log(`Has extraction: ${!!workflow.workflow.extraction}`);Issue: Low confidence scores Solution: Text may be ambiguous or context unclear
// Use higher quality model
const betterSpec = await graphlit.createSpecification({
type: SpecificationTypes.Completion,
openAI: {
model: OpenAIModels.Gpt4Turbo // Better than GPT-3.5
}
});
// Apply threshold
const highConfidence = observations.filter(
obs => obs.occurrences?.[0]?.confidence >= 0.8
);Issue: Too many false positives Solution: Increase confidence threshold or narrow entity types
// Only extract specific types
extractedTypes: [
ObservableTypes.Person // Just people, not everything
]
// Filter by confidence
const reliable = observations.filter(
obs => obs.occurrences?.[0]?.confidence >= 0.85
);Production Example
async function createProductionExtractionWorkflow() {
console.log('Creating production extraction workflow...\n');
// Create high-quality specification
const spec = await graphlit.createSpecification({
name: "Production Extraction",
type: SpecificationTypes.Completion,
serviceType: ModelServiceTypes.OpenAI,
openAI: {
model: OpenAIModels.Gpt4Turbo,
temperature: 0.0 // Deterministic
}
});
console.log(`✓ Created specification: ${spec.createSpecification.id}`);
// Create workflow with both text and vision extraction
const workflow = await graphlit.createWorkflow({
name: "Production Entity Extraction",
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Document,
extractImages: true,
ocrImages: true
}
}]
},
extraction: {
jobs: [
{
// Text extraction
connector: {
type: EntityExtractionServiceTypes.ModelText,
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Event,
ObservableTypes.Product
]
}
},
{
// Vision extraction (for images/charts)
connector: {
type: EntityExtractionServiceTypes.ModelDocument,
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization
]
}
}
]
},
specification: { id: spec.createSpecification.id }
});
console.log(`✓ Created workflow: ${workflow.createWorkflow.id}\n`);
// Test with sample document
console.log('Testing extraction...');
const content = await graphlit.ingestUri(
'https://example.com/sample-document.pdf',
undefined,
undefined,
undefined,
true,
{ id: workflow.createWorkflow.id }
);
// Get results
const result = await graphlit.getContent(content.ingestUri.id);
console.log(`\n=== EXTRACTION RESULTS ===`);
console.log(`Document: ${result.content.name}`);
console.log(`Total observations: ${result.content.observations?.length || 0}`);
// Group by type
const byType = new Map<string, number>();
result.content.observations?.forEach(obs => {
byType.set(obs.type, (byType.get(obs.type) || 0) + 1);
});
console.log('\nEntities by type:');
byType.forEach((count, type) => {
console.log(` ${type}: ${count}`);
});
// Confidence analysis
const confidences = result.content.observations
?.flatMap(obs => obs.occurrences || [])
.map(occ => occ.confidence || 0) || [];
if (confidences.length > 0) {
const avg = confidences.reduce((a, b) => a + b, 0) / confidences.length;
const high = confidences.filter(c => c >= 0.8).length;
const med = confidences.filter(c => c >= 0.6 && c < 0.8).length;
const low = confidences.filter(c => c < 0.6).length;
console.log('\nConfidence distribution:');
console.log(` High (≥80%): ${high}`);
console.log(` Medium (60-80%): ${med}`);
console.log(` Low (<60%): ${low}`);
console.log(` Average: ${(avg * 100).toFixed(1)}%`);
}
return workflow.createWorkflow.id;
}
await createProductionExtractionWorkflow();Last updated
Was this helpful?