# Configure Workflow for Entity Extraction

## Use Case: Configure Workflow for Entity Extraction

### User Intent

"How do I configure a workflow to extract entities from my content? What entity types should I choose and which models work best?"

### Operation

**SDK Method**: `createWorkflow()` with `extraction` stage\
**GraphQL Mutation**: `createWorkflow`\
**Entity**: Workflow with extraction configuration

### Prerequisites

* Graphlit project with API credentials
* Understanding of entity types (Person, Organization, etc.)
* Content to process (documents, emails, messages, etc.)

***

### Complete Code Example (TypeScript)

```typescript
import { Graphlit } from 'graphlit-client';
import {
  FilePreparationServiceTypes,
  EntityExtractionServiceTypes,
  ObservableTypes,
} from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Create workflow with entity extraction
const workflow = await graphlit.createWorkflow({
  name: "Entity Extraction Workflow",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event
        ]
      }
    }]
  }
});

console.log(`Created workflow: ${workflow.createWorkflow.id}`);
console.log(`Extracting: Person, Organization, Place, Event`);

// Use workflow with content ingestion
const content = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  'Entity Extraction Doc',
  undefined,
  undefined,
  true,
  { id: workflow.createWorkflow.id }
);

console.log(`Ingesting content with entity extraction...`);
```

***

## Key differences: snake\_case methods, enum values

workflow\_response = await graphlit.createWorkflow( name="Entity Extraction Workflow", preparation={ "jobs": \[{ "connector": { "type": FilePreparationServiceTEXT } }] }, extraction={ "jobs": \[{ "connector": { "type": ExtractionServiceMODEL\_TEXT, "extractedTypes": \[ ObservablePERSON, ObservableORGANIZATION, ObservablePLACE, ObservableEVENT ] } }] } )

print(f"Created workflow: {workflow\_response.create\_workflow\.id}")

````

### C#
```csharp
using Graphlit;
using Graphlit.Api.Input;

var graphlit = new Graphlit();

// Key differences: PascalCase methods
var workflow = await graphlit.CreateWorkflow(
    name: "Entity Extraction Workflow",
    preparation: new WorkflowPreparationInput
    {
        Jobs = new[]
        {
            new WorkflowPreparationJobInput
            {
                Connector = new FilePreparationConnectorInput
                {
                    Type = FilePreparationServiceText
                }
            }
        }
    },
    extraction: new WorkflowExtractionInput
    {
        Jobs = new[]
        {
            new WorkflowExtractionJobInput
            {
                Connector = new ExtractionConnectorInput
                {
                    Type = ExtractionServiceModelText,
                    ExtractedTypes = new[]
                    {
                        ObservableTypes.Person,
                        ObservableTypes.Organization,
                        ObservableTypes.Place,
                        ObservableTypes.Event
                    }
                }
            }
        }
    }
);

Console.WriteLine($"Created workflow: {workflow.CreateWorkflow.Id}");
````

***

### Step-by-Step Explanation

#### Step 1: Choose Extraction Type

Graphlit supports two extraction connector types:

**`ExtractionServiceModelText`**:

* For text-based content (documents, emails, messages)
* Uses LLM to analyze prepared text
* Fast and cost-effective
* Best for: PDFs with text, emails, Slack messages, web pages

**`ExtractionServiceModelDocument`**:

* For visual document analysis
* Uses vision models (GPT-4o Vision, Claude 3.5 Sonnet)
* Analyzes images, charts, diagrams, scanned documents
* Best for: PDFs with images, scanned documents, presentation slides

#### Step 2: Select Entity Types

Choose entity types based on your domain:

**Business Documents**:

```typescript
extractedTypes: [
  ObservableTypes.Person,           // People, contacts
  ObservableTypes.Organization,     // Companies, departments
  ObservableTypes.Place,            // Locations, offices
  ObservableTypes.Event,            // Meetings, deadlines
  ObservableTypes.Product           // Products, services
]
```

**Medical/Clinical Content**:

```typescript
extractedTypes: [
  ObservableTypes.MedicalCondition,
  ObservableTypes.MedicalDrug,
  ObservableTypes.MedicalProcedure,
  ObservableTypes.MedicalTest,
  ObservableTypes.MedicalStudy
]
```

**Technical Documentation**:

```typescript
extractedTypes: [
  ObservableTypes.Software,         // Software products
  ObservableTypes.Repo,             // Code repositories
  ObservableTypes.Organization,     // Tech companies
  ObservableTypes.Person,           // Developers, authors
  ObservableTypes.Category          // Topics, tags
]
```

#### Step 3: Add Preparation Stage

Preparation extracts text before entity extraction:

```typescript
preparation: {
  jobs: [{
    connector: {
      type: FilePreparationServiceTypes.Document,  // PDFs, Word, etc.
      // FilePreparationServiceTypes.Text for plain text
      // FilePreparationServiceTypes.Audio for transcription
    }
  }]
}
```

#### Step 4: Configure Model (Optional)

Specify which LLM model to use via specification:

```typescript
const spec = await graphlit.createSpecification({
  name: "GPT-4 Extraction",
  type: SpecificationTypes.Completion,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAiModels.Gpt4O_128K,
    temperature: 0.1  // Low temperature for consistent extraction
  }
});

const workflow = await graphlit.createWorkflow({
  name: "High-Quality Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        extractedTypes: [/* ... */]
      }
    }]
  },
  specification: { id: spec.createSpecification.id }
});
```

***

### Configuration Options

#### Changing Extraction Models

**GPT-4o (Default - Recommended)**:

* Fast and accurate
* Good balance of quality and cost
* Handles 20+ entity types
* Best for production

**GPT-4**:

* Highest quality
* More expensive
* Slower processing
* Best for critical accuracy

**Claude 3.5 Sonnet**:

* Very good quality
* Fast processing
* Good for long documents
* Alternative to GPT-4o

**Gemini 1.5 Pro**:

* Cost-effective
* Good quality
* Fast processing
* Budget-friendly option

#### Vision Model Extraction (for PDFs with Images)

```typescript
const workflow = await graphlit.createWorkflow({
  name: "Vision-Based Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Document
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelDocument,  // Vision model
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Product
        ]
      }
    }]
  }
});
```

#### Multiple Extraction Jobs

Extract different types with different models:

```typescript
extraction: {
  jobs: [
    {
      // Fast extraction for basic types
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    },
    {
      // Vision extraction for complex types
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelDocument,
        extractedTypes: [
          ObservableTypes.Product,
          ObservableTypes.Software
        ]
      }
    }
  ]
}
```

***

### Variations

#### Variation 1: Minimal Extraction (Fast)

Extract only core entity types for speed:

```typescript
const workflow = await graphlit.createWorkflow({
  name: "Fast Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    }]
  }
});
```

#### Variation 2: Comprehensive Extraction

Extract all relevant entity types:

```typescript
const workflow = await graphlit.createWorkflow({
  name: "Comprehensive Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Event,
          ObservableTypes.Product,
          ObservableTypes.Software,
          ObservableTypes.Category,
          ObservableLabel
        ]
      }
    }]
  }
});
```

#### Variation 3: Medical Content Extraction

Extract medical entities:

```typescript
const workflow = await graphlit.createWorkflow({
  name: "Medical Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.MedicalDrug,
          ObservableMedicalDrugClass,
          ObservableTypes.MedicalProcedure,
          ObservableTypes.MedicalTest,
          ObservableTypes.MedicalStudy,
          ObservableMedicalDevice,
          ObservableMedicalTherapy
        ]
      }
    }]
  }
});
```

#### Variation 4: Audio/Video Transcription + Extraction

Extract entities from meeting recordings:

```typescript
const workflow = await graphlit.createWorkflow({
  name: "Meeting Entity Extraction",
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceAudio,
        audioTranscription: {
          model: AudioTranscriptionServiceDeepgram
        }
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Event,
          ObservableTypes.Product
        ]
      }
    }]
  }
});
```

#### Variation 5: GitHub Repository Analysis

Extract technical entities from code repositories:

```typescript
const workflow = await graphlit.createWorkflow({
  name: "GitHub Extraction",
  extraction: {
    jobs: [{
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [
          ObservableTypes.Repo,
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Software,
          ObservableTypes.Category
        ]
      }
    }]
  }
});
```

***

### Common Issues & Solutions

#### Issue: No Entities Extracted

**Problem**: Workflow completes but no observations found.

**Solutions**:

1. Check content has text (not just images without OCR)
2. Verify extraction stage is configured
3. Ensure entity types are appropriate for content
4. Check confidence threshold isn't too high
5. Try vision model for scanned documents

```typescript
// Debug: Check if extraction stage exists
const workflowDetails = await graphlit.getWorkflow(workflow.id);
console.log('Extraction jobs:', workflowDetails.workflow.extraction?.jobs);

// Try vision model if text extraction fails
extraction: {
  jobs: [{
    connector: {
      type: EntityEntityExtractionServiceTypes.ModelDocument  // Use vision
    }
  }]
}
```

#### Issue: Too Many Low-Quality Entities

**Problem**: Many entities with low confidence scores.

**Solutions**:

1. Use better model (GPT-4 instead of Gemini)
2. Filter by confidence threshold (>0.7)
3. Reduce number of entity types
4. Improve content quality (OCR accuracy)

```typescript
// Filter low-confidence entities when querying
content.observations
  .filter(obs => obs.occurrences.some(occ => occ.confidence >= 0.7))
  .forEach(obs => console.log(obs.observable.name));
```

#### Issue: Extraction Too Slow

**Problem**: Processing takes too long.

**Solutions**:

1. Reduce number of entity types
2. Use faster model (GPT-4o instead of GPT-4)
3. Split large documents
4. Process in batches

```typescript
// Optimize: Extract only critical types
extractedTypes: [
  ObservableTypes.Person,
  ObservableTypes.Organization
  // Remove less critical types for speed
]
```

#### Issue: Wrong Entity Types Extracted

**Problem**: Entities classified incorrectly.

**Solutions**:

1. Use more specific entity types
2. Provide better preparation (clean text)
3. Use vision models for visual documents
4. Combine multiple extraction jobs

```typescript
// Example: Separate medical and general extraction
extraction: {
  jobs: [
    {
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization]
      }
    },
    {
      connector: {
        type: EntityEntityExtractionServiceTypes.ModelText,
        extractedTypes: [ObservableTypes.MedicalCondition, ObservableTypes.MedicalDrug]
      }
    }
  ]
}
```

***

### Developer Hints

#### Model Selection Guidelines

* **Production default**: GPT-4o (fast, accurate, cost-effective)
* **Highest quality**: GPT-4 (use for critical applications)
* **Long documents**: Claude 3.5 Sonnet (128K context)
* **Budget-friendly**: Gemini 1.5 Pro
* **Visual content**: GPT-4o Vision or Claude 3.5 Sonnet

#### Entity Type Selection Strategy

1. Start with core types (Person, Organization)
2. Add domain-specific types (Medical\*, Software, Repo)
3. Test extraction quality
4. Add more types incrementally
5. Monitor processing time and cost

#### Performance Considerations

* More entity types = longer processing time
* Vision models slower than text models
* GPT-4 slower but more accurate than GPT-4o
* Batch processing for large volumes
* Consider cost per extraction job

#### Confidence Threshold Recommendations

* **High precision needed**: confidence >= 0.8
* **Balanced**: confidence >= 0.7 (recommended)
* **High recall needed**: confidence >= 0.5
* **Research/exploration**: confidence >= 0.3

#### Cost Optimization

1. Use text extraction when possible (cheaper than vision)
2. Choose appropriate model (GPT-4o vs GPT-4)
3. Extract only needed entity types
4. Batch process for volume discounts
5. Cache extracted entities

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/knowledge-graph/workflow-configure-entity-extraction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
