# Workflows

Workflows define **how content is processed** as it flows through Graphlit's memory formation pipeline. This is the authoritative reference for all workflow configuration options, defaults, and decision guidance.

**On this page:**

* [Overview & Core Concepts](#overview--core-concepts)
* [Default Behavior (No Workflow)](#default-behavior-no-workflow)
* [When Do You Need a Workflow?](#when-do-you-need-a-workflow)
* [Workflow Stages](#workflow-stages)
* [Complete API Reference](#complete-api-reference)
* [Production Patterns](#production-patterns)

***

## Overview & Core Concepts

### What Workflows Do

Workflows control the **memory formation pipeline** - how raw content transforms into structured, searchable semantic memory:

{% @mermaid/diagram content="graph LR
A\[Raw Content] --> B\[Ingestion Filter]
B --> C\[Indexing Config]
C --> D\[Preparation]
D --> E\[Extraction]
E --> F\[Enrichment]
F --> G\[Classification]
G --> H\[Semantic Memory]

```
style B fill:#E91E63,color:#fff
style C fill:#9C27B0,color:#fff
style D fill:#2196F3,color:#fff
style E fill:#4CAF50,color:#fff
style F fill:#FF9800,color:#fff
style G fill:#00BCD4,color:#fff" %}
```

**The Six Pipeline Stages (in execution order):**

1. **Ingestion** - Filter which content to accept (file types, paths)
2. **Indexing** - Configure embedding model and vector storage
3. **Preparation** - Extract text/markdown from files (PDFs, audio, images)
4. **Extraction** - Identify entities (people, organizations, topics)
5. **Enrichment** - Add external data (links, FHIR, Diffbot)
6. **Classification** - Categorize content using LLMs

**Additional Configuration:**

* **Storage** - Where files are stored (defaults to managed storage)
* **Actions** - Post-processing webhooks and integrations

### The Workflow Object

```typescript
interface WorkflowInput {
  name: string;                                      // Required: Workflow name
  
  // Pipeline stages (in execution order):
  ingestion?: IngestionWorkflowStageInput;           // Optional: Content filtering
  indexing?: IndexingWorkflowStageInput;             // Optional: Custom indexing
  preparation?: PreparationWorkflowStageInput;       // Optional: Text extraction
  extraction?: ExtractionWorkflowStageInput;         // Optional: Entity extraction
  enrichment?: EnrichmentWorkflowStageInput;         // Optional: External enrichment
  classification?: ClassificationWorkflowStageInput; // Optional: Content classification
  
  // Additional configuration:
  storage?: StorageWorkflowStageInput;               // Optional: Storage settings
  actions?: WorkflowActionInput[];                   // Optional: Post-processing actions
}
```

**Key insight:** All stages are **optional**. Graphlit has intelligent defaults.

***

## Default Behavior (No Workflow)

### What Happens Without a Workflow

```typescript
// Simple ingestion - NO workflow specified
await graphlit.ingestUri('https://example.com/document.pdf');
```

**Graphlit's Default Pipeline:**

| Stage           | Default Behavior                                         | Speed     |
| --------------- | -------------------------------------------------------- | --------- |
| **Preparation** | ✅ **Intelligent preparation** (see below)                | ⚡ Fast    |
| **Extraction**  | ❌ No entity extraction                                   | ⚡ Instant |
| **Enrichment**  | ❌ No external enrichment                                 | ⚡ Instant |
| **Indexing**    | ✅ **Project default embedding** (text-embedding-ada-002) | ⚡ Fast    |

### Default Preparation: Intelligent Per-Format Processing

Graphlit automatically selects the best preparation method based on content type:

**PDFs & Office Documents:**

* **Azure AI Document Intelligence (Layout model)**
* ✅ Extracts text from PDFs, Word docs, PowerPoint
* ✅ OCR for scanned documents
* ✅ Basic table recognition
* ✅ Layout analysis
* ❌ Advanced table parsing
* ❌ Image understanding (diagrams, charts)
* ❌ Complex multi-column layouts

**Audio & Video Files:**

* **Deepgram Nova 2 Transcription**
* ✅ Automatic transcription
* ✅ High accuracy
* ✅ Multiple language support
* ❌ No speaker diarization (unless you add workflow)

**Web Pages:**

* **Built-in HTML Parser**
* ✅ Smart HTML extraction
* ✅ JavaScript rendering (by default)
* ✅ Markdown conversion

**Email, Text, Markdown:**

* **Built-in Parsers**
* ✅ Native format support
* ✅ Metadata extraction

**When default preparation is sufficient:**

* Simple text-heavy PDFs (80%+ of documents)
* Audio/video transcription without speaker identification
* Most web pages
* Office documents
* Standard email/text content

### Default Indexing: Project Embedding Model

**What it does:**

* ✅ Creates vector embeddings for semantic search
* ✅ Chunks content intelligently
* ✅ Stores in vector database

**Default model:** OpenAI `text-embedding-ada-002` (if not configured otherwise)

***

## When Do You Need a Workflow?

### Decision Matrix

| Goal                                | Need Workflow? | Stage       | Why                             |
| ----------------------------------- | -------------- | ----------- | ------------------------------- |
| **Extract text from simple PDF**    | ❌ No           | -           | Default Azure AI is fine        |
| **Extract text from complex PDF**   | ✅ Yes          | Preparation | Use vision models (GPT-4o)      |
| **Handle images/diagrams in PDF**   | ✅ Yes          | Preparation | Vision models understand images |
| **Transcribe audio/video**          | ✅ Yes          | Preparation | Use Deepgram or Assembly.AI     |
| **Extract entities (people, orgs)** | ✅ Yes          | Extraction  | No extraction by default        |
| **Build knowledge graph**           | ✅ Yes          | Extraction  | Entity extraction required      |
| **Enrich with external data**       | ✅ Yes          | Enrichment  | Add Diffbot, FHIR, etc.         |
| **Use custom embedding model**      | ✅ Yes          | Indexing    | Override default embeddings     |
| **Filter content during ingestion** | ✅ Yes          | Ingestion   | Path/type filtering             |

### Common Scenarios

**Scenario 1: Simple Document Q\&A**

```typescript
// NO WORKFLOW NEEDED ✅
await graphlit.ingestUri(pdfUrl);

// Default preparation + indexing works great
const answer = await graphlit.promptConversation({
  prompt: 'What are the key points?'
});
```

**Scenario 2: Complex PDFs with Tables**

```typescript
// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
  name: 'Vision Model Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: { specification: { id: gpt4oSpecId } }
      }
    }]
  }
});

await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, true, { id: workflow.createWorkflow.id });
```

**Scenario 3: Knowledge Graph from Documents**

```typescript
// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
  name: 'Extract Entities',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId }
        }
      }
    }]
  }
});
```

***

## Workflow Stages

### Preparation Stage

**Purpose:** Extract text, markdown, and metadata from raw files.

**When you don't need it:** Default Azure AI Document Intelligence handles most documents.

**When you need it:** Complex PDFs, audio transcription, high-quality markdown extraction.

#### Complete Configuration

```typescript
interface PreparationWorkflowStageInput {
  jobs?: Array<PreparationWorkflowJobInput>;      // Preparation connectors
  summarizations?: Array<SummarizationStrategyInput>; // Auto-summarization
  disableSmartCapture?: boolean;                   // Disable JS rendering for web pages
  enableUnblockedCapture?: boolean;                // Use unblocked.com for Cloudflare bypass (10x cost)
}

interface PreparationWorkflowJobInput {
  connector: FilePreparationConnectorInput;        // Required: Preparation method
}

interface FilePreparationConnectorInput {
  type: FilePreparationServiceTypes;               // Required: Service type
  fileTypes?: Array<FileTypes>;                    // Optional: Which file types to prepare
  
  // Service-specific properties:
  modelDocument?: ModelDocumentPreparationPropertiesInput;   // Vision models
  deepgram?: DeepgramAudioPreparationPropertiesInput;        // Deepgram transcription
  assemblyAi?: AssemblyAiAudioPreparationPropertiesInput;    // Assembly.AI transcription
  document?: DocumentPreparationPropertiesInput;              // Azure AI (explicit)
  mistral?: MistralDocumentPreparationPropertiesInput;       // Mistral OCR
  reducto?: ReductoDocumentPreparationPropertiesInput;       // Reducto
  email?: EmailPreparationPropertiesInput;                    // Email parsing
  page?: PagePreparationPropertiesInput;                      // Web page extraction
}
```

#### FilePreparationServiceTypes (Recommended Order)

| Type                          | Use Case                                              | Speed        | Quality        | When to Use                               |
| ----------------------------- | ----------------------------------------------------- | ------------ | -------------- | ----------------------------------------- |
| `AZURE_DOCUMENT_INTELLIGENCE` | **Default for PDFs** - Most PDFs, Office docs         | ⚡ Fast       | ⭐⭐⭐ Good       | ✅ **Try this first** (automatic default)  |
| `REDUCTO`                     | **Specialized PDF extraction** - Better than default  | ⚡ Fast       | ⭐⭐⭐⭐ Excellent | Try if default isn't good enough          |
| `MISTRAL_DOCUMENT`            | Mistral OCR for documents                             | ⚡⚡ Very Fast | ⭐⭐⭐⭐ Excellent | Alternative to Reducto                    |
| `MODEL_DOCUMENT`              | **Vision LLMs** - General-purpose, not tuned for docs | ⚠️ Slower    | ⭐⭐⭐⭐ Very Good | Advanced: After trying defaults & Reducto |
| `DEEPGRAM`                    | **Default for audio/video** - Transcription           | ⚡ Fast       | ⭐⭐⭐⭐ Excellent | ✅ **Automatic default**                   |
| `ASSEMBLY_AI`                 | **Audio/video** with speaker diarization              | ⚡ Fast       | ⭐⭐⭐⭐ Excellent | Alternative to Deepgram                   |
| `DOCUMENT`                    | Explicit Azure AI configuration                       | ⚡ Fast       | ⭐⭐⭐ Good       | Rarely needed (use default)               |
| `EMAIL`                       | Email message parsing                                 | ⚡ Instant    | ⭐⭐⭐⭐⭐ Perfect  | ✅ **Automatic default**                   |
| `PAGE`                        | Web page extraction                                   | ⚡ Fast       | ⭐⭐⭐⭐ Excellent | ✅ **Automatic default**                   |

#### MODEL\_DOCUMENT: Vision LLMs for Documents

**Understanding the Options**

Document preparation offers a spectrum of tools, each optimized for different needs and cost profiles:

**Azure AI Document Intelligence (Default - $0)**

* Automatic OCR and layout analysis
* Fast, handles most PDFs and Office documents
* Included in your Graphlit subscription

**Reducto / Mistral Document ($)**

* Specialized PDF extraction engines
* Better at complex tables and multi-column layouts
* Higher quality than default, but adds per-page cost

**Vision LLMs (MODEL\_DOCUMENT - $$$)**

* General-purpose vision models (GPT-4o, Claude, Gemini)
* Understand content semantically, not just structurally
* Can interpret diagrams, charts, and visual relationships
* Highest cost (10x more than default), best flexibility

**When to use vision LLMs:**

* Specialized tools don't capture the visual meaning you need
* Documents require semantic understanding of images/diagrams
* Need custom prompting or model-specific behavior
* Complex visual documents where structure alone isn't enough

**Properties:**

```typescript
interface ModelDocumentPreparationPropertiesInput {
  specification?: EntityReferenceInput;  // Optional: LLM specification (GPT-4o, Claude, etc.)
}
```

**Model Selection:**

```typescript
// GPT-4o - Best balance (recommended)
const gpt4oSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAiModels.Gpt4O_128K }
});

// Claude Sonnet 3.7 - Best for complex documents
const claudeSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});

// Gemini 2.0 Flash - Fast and cheap
const geminiSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.Google,
  google: { model: GoogleModels.Gemini_2_0Flash }
});
```

**Cost vs. Capability:**

* Vision LLMs cost \~10x more per page than specialized tools
* Trade higher cost for semantic understanding and flexibility
* Best for documents where visual meaning matters, not just text extraction

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'High-Quality PDF Extraction',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        fileTypes: [FileTypes.Document],  // Documents (PDF, Word, etc.)
        modelDocument: {
          specification: { id: gpt4oSpecId }
        }
      }
    }]
  }
});
```

#### DEEPGRAM: Audio Transcription (Enhanced)

**Default:** Deepgram Nova 2 is used automatically for audio/video files.

**When to add workflow (enhance default):**

* Enable speaker diarization (identify who's speaking)
* Enable PII redaction
* Use different Deepgram model
* Configure language settings

**Properties:**

```typescript
interface DeepgramAudioPreparationPropertiesInput {
  key?: string;                         // Optional: Deepgram API key (uses project default if not provided)
  model?: DeepgramModels;               // Optional: Transcription model (default: NOVA_2)
  language?: string;                    // Optional: BCP 47 language code (e.g., 'en', 'en-US')
  detectLanguage?: boolean;             // Optional: Auto-detect language (default: false)
  enableSpeakerDiarization?: boolean;   // Optional: Identify speakers (default: false)
  enableRedaction?: boolean;            // Optional: Redact PII (default: false)
}
```

**Models:**

* `NOVA_2` - Best quality (recommended)
* `NOVA_2_MEDICAL` - Medical terminology
* `NOVA_2_FINANCE` - Financial terminology
* `NOVA_2_CONVERSATIONAL_AI` - Real-time conversations
* `NOVA_2_VOICEMAIL` - Voicemail transcription
* `NOVA_2_VIDEO` - Video content
* `NOVA_2_PHONE_CALL` - Phone calls

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Audio Transcription',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Deepgram,
        fileTypes: [FileTypes.Audio, FileTypes.Video],
        deepgram: {
          model: DeepgramModels.Nova2,
          enableSpeakerDiarization: true,  // Identify who's speaking
          language: 'en-US'
        }
      }
    }]
  }
});
```

#### ASSEMBLY\_AI: Alternative Audio Transcription

**When to use (alternative to default Deepgram):**

* Prefer Assembly.AI over Deepgram
* Need their specific features
* Already have Assembly.AI account/credits

**Properties:**

```typescript
interface AssemblyAiAudioPreparationPropertiesInput {
  key?: string;                         // Optional: Assembly.AI API key
  model?: AssemblyAiModels;             // Optional: Model (default: BEST)
  language?: string;                    // Optional: BCP 47 language code
  detectLanguage?: boolean;             // Optional: Auto-detect language
  enableSpeakerDiarization?: boolean;   // Optional: Identify speakers
  enableRedaction?: boolean;            // Optional: Redact PII
}
```

**Models:**

* `BEST` - Highest accuracy (default)
* `NANO` - Fastest, lower cost

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Meeting Transcription',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.AssemblyAi,
        fileTypes: [FileTypes.Audio],
        assemblyAi: {
          model: AssemblyAiModels.Best,
          enableSpeakerDiarization: true,
          enableRedaction: true  // Redact sensitive info
        }
      }
    }]
  }
});
```

#### Multi-Job Preparation

**Use case:** Different file types need different preparation methods.

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Multi-Format Processing',
  preparation: {
    jobs: [
      {
        // Job 1: PDFs with vision model
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          fileTypes: [FileTypes.Document],
          modelDocument: { specification: { id: gpt4oSpecId } }
        }
      },
      {
        // Job 2: Audio with Deepgram
        connector: {
          type: FilePreparationServiceTypes.Deepgram,
          fileTypes: [FileTypes.Audio, FileTypes.Video],
          deepgram: {
            model: DeepgramModels.Nova2,
            enableSpeakerDiarization: true
          }
        }
      },
      {
        // Job 3: Images with Azure AI (default for everything else)
        connector: {
          type: FilePreparationServiceTypes.AzureDocumentIntelligence,
          fileTypes: [FileTypes.Image]
        }
      }
    ]
  }
});
```

#### Web Page Capture Options

**Properties (on PreparationWorkflowStageInput):**

```typescript
disableSmartCapture?: boolean;      // Disable JS rendering (faster, cheaper, but may miss content)
enableUnblockedCapture?: boolean;   // Use unblocked.com to bypass Cloudflare (10x cost!)
```

**Default:** `disableSmartCapture: false`, `enableUnblockedCapture: false`

**When to adjust:**

```typescript
// Static HTML sites (faster, cheaper)
const workflow = await graphlit.createWorkflow({
  name: 'Static HTML Crawl',
  preparation: {
    disableSmartCapture: true  // Skip JS rendering
  }
});

// Sites with Cloudflare protection
const workflow = await graphlit.createWorkflow({
  name: 'Cloudflare Bypass',
  preparation: {
    enableUnblockedCapture: true  // 10x cost, but bypasses protection
  }
});
```

#### Auto-Summarization During Preparation

**Properties:**

```typescript
summarizations?: Array<SummarizationStrategyInput>;

interface SummarizationStrategyInput {
  type: SummarizationTypes;          // Required: Summarization type
  specification?: EntityReferenceInput; // Optional: LLM for summarization
  tokens?: number;                    // Optional: Max summary length
  items?: number;                     // Optional: Number of items to summarize
}
```

**SummarizationTypes:**

* `Chapters` - Transcript chapters
* `Headlines` - Extract headlines
* `Questions` - Generate questions
* `Posts` - Social media posts

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Prep with Auto-Summary',
  preparation: {
    jobs: [{ /* preparation connector */ }],
    summarizations: [{
      type: SummarizationTypes.Chapters,
      specification: { id: gpt4oSpecId },
      tokens: 500  // Max 500 tokens per chapter summary
    }]
  }
});
```

***

### Extraction Stage

**Purpose:** Identify and extract entities (people, organizations, places, topics) to build knowledge graph.

**Default:** ❌ **No extraction** - entities are NOT extracted unless you add this stage.

**When you need it:**

* Building knowledge graph
* Entity-based search/filtering
* Relationship discovery
* Semantic understanding beyond keywords

#### Complete Configuration

```typescript
interface ExtractionWorkflowStageInput {
  jobs?: Array<ExtractionWorkflowJobInput>;  // Extraction connectors
}

interface ExtractionWorkflowJobInput {
  connector: EntityExtractionConnectorInput;  // Required: Extraction method
}

interface EntityExtractionConnectorInput {
  type: EntityExtractionServiceTypes;         // Required: Service type
  extractedTypes?: Array<ObservableTypes>;    // Optional: Which entity types to extract
  customTypes?: Array<string>;                // Optional: Custom entity types
  extractedCount?: number;                    // Optional: Max entities per type (default: 100)
  fileTypes?: Array<FileTypes>;               // Optional: Which file types to extract from
  
  // Service-specific properties:
  modelText?: ModelTextExtractionPropertiesInput;  // LLM extraction (recommended)
  modelImage?: ModelImageExtractionPropertiesInput; // Image entity extraction
}
```

#### EntityExtractionServiceTypes

| Type                             | Use Case                               | Quality         |
| -------------------------------- | -------------------------------------- | --------------- |
| `MODEL_TEXT`                     | **Recommended** - LLM-based extraction | ⭐⭐⭐⭐⭐ Excellent |
| `AZURE_COGNITIVE_SERVICES_TEXT`  | Azure Text Analytics                   | ⭐⭐⭐ Good        |
| `MODEL_IMAGE`                    | Extract from images                    | ⭐⭐⭐⭐ Excellent  |
| `AZURE_COGNITIVE_SERVICES_IMAGE` | Azure Vision                           | ⭐⭐⭐ Good        |

#### MODEL\_TEXT: LLM Entity Extraction (Recommended)

**Properties:**

```typescript
interface ModelTextExtractionPropertiesInput {
  specification?: EntityReferenceInput;  // Optional: LLM specification
  tokenThreshold?: number;               // Optional: Min tokens to process (skip short sections)
}
```

**Model Selection:**

```typescript
// Claude Sonnet 3.7 - Best accuracy (recommended)
const claudeSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});

// GPT-4o - Good balance
const gpt4oSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAiModels.Gpt4O_128K }
});

// Claude Haiku - Fast and cheap
const haikuSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_5Haiku }
});
```

#### Observable Types (Standard Entities)

**Complete list of built-in entity types:**

```typescript
enum ObservableTypes {
  // People & Organizations
  PERSON                  // People, individuals, names
  ORGANIZATION            // Companies, institutions, groups
  
  // Places
  PLACE                   // Locations, addresses, landmarks
  
  // Products & Services
  PRODUCT                 // Products, services, brands
  
  // Events & Time
  EVENT                   // Events, meetings, occurrences
  
  // Creative Works
  CREATIVE_WORK           // Books, articles, movies, music
  
  // Concepts & Topics
  TOPIC                   // Abstract concepts, themes, subjects
  
  // Legal & Financial
  REGULATION              // Laws, regulations, policies
  
  // Medical (when using FHIR)
  MEDICAL_CONDITION
  MEDICAL_PROCEDURE
  MEDICATION
  MEDICAL_TEST
  
  // And more... (check SDK for complete list)
}
```

#### Custom Entity Types

**Use case:** Domain-specific entities not covered by standard types.

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Legal Document Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId }
        },
        // Extract standard entities
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ],
        // AND custom legal entities
        customTypes: [
          'Contract',
          'Clause',
          'Obligation',
          'Deadline',
          'Payment Term',
          'Jurisdiction',
          'Termination Condition',
          'Liability Limit'
        ]
      }
    }]
  }
});
```

#### Complete Example: Knowledge Graph Extraction

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Full Knowledge Graph',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId },
          tokenThreshold: 100  // Skip sections < 100 tokens
        },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Product,
          ObservableTypes.Event,
          ObservableTypes.Label
        ],
        extractedCount: 200,  // Max 200 entities per type
        contentTypes: [ContentTypes.File, ContentTypes.Page],
        fileTypes: [FileTypes.Document]
      }
    }]
  }
});
```

***

### Enrichment Stage

**Purpose:** Add external data to entities (links, FHIR medical data, Diffbot enrichment).

**Default:** ❌ **No enrichment** - external data is NOT added unless you configure this.

**When you need it:**

* Medical applications (FHIR integration)
* Web entity enrichment (Diffbot)
* Link extraction from content

#### Complete Configuration

```typescript
interface EnrichmentWorkflowStageInput {
  jobs?: Array<EnrichmentWorkflowJobInput>;  // Enrichment connectors
  link?: LinkStrategyInput;                   // Link extraction configuration
}

interface EnrichmentWorkflowJobInput {
  connector: EntityEnrichmentConnectorInput;  // Required: Enrichment method
}

interface EntityEnrichmentConnectorInput {
  type: EntityEnrichmentServiceTypes;         // Required: Service type
  enrichedTypes?: Array<ObservableTypes>;     // Optional: Which entity types to enrich
  
  // Service-specific properties:
  diffbot?: DiffbotEnrichmentPropertiesInput;  // Diffbot Knowledge Graph
  fhir?: FhirEnrichmentPropertiesInput;        // FHIR medical data
}
```

#### EntityEnrichmentServiceTypes

| Type      | Use Case                  | External API           |
| --------- | ------------------------- | ---------------------- |
| `DIFFBOT` | Web entity enrichment     | Diffbot API required   |
| `FHIR`    | Medical entity enrichment | FHIR endpoint required |

#### Link Extraction

**Properties:**

```typescript
interface LinkStrategyInput {
  allowedDomains?: Array<string>;     // Optional: Only extract links from these domains
  excludedDomains?: Array<string>;    // Optional: Don't extract links from these domains
  allowContentDomain?: boolean;       // Optional: Allow links from same domain as content
  extractUri?: boolean;               // Optional: Extract HTTP URLs from text (default: true)
  extractEmail?: boolean;             // Optional: Extract email addresses (default: true)
  extractPhoneNumber?: boolean;       // Optional: Extract phone numbers (default: false)
}
```

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Link Extraction',
  enrichment: {
    link: {
      extractUri: true,
      extractEmail: true,
      extractPhoneNumber: true,
      allowedDomains: ['example.com', 'partner.com'],  // Only these domains
      excludedDomains: ['spam.com', 'ads.com']         // Exclude these
    }
  }
});
```

#### FHIR Medical Enrichment

**Use case:** Enrich medical entities with FHIR (Fast Healthcare Interoperability Resources) data.

**Properties:**

```typescript
interface FhirEnrichmentPropertiesInput {
  endpoint: URL;  // Required: FHIR API endpoint
}
```

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Medical Content with FHIR',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.MedicalDrug,
          ObservableTypes.MedicalProcedure
        ]
      }
    }]
  },
  enrichment: {
    jobs: [{
      connector: {
        type: EntityEnrichmentServiceTypes.Fhir,
        fhir: {
          endpoint: 'https://fhir.example.org/api'
        },
        enrichedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.MedicalDrug
        ]
      }
    }]
  }
});
```

#### Diffbot Enrichment

**Use case:** Enrich entities with data from Diffbot Knowledge Graph.

**Properties:**

```typescript
interface DiffbotEnrichmentPropertiesInput {
  token: string;  // Required: Diffbot API token
}
```

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Entity Enrichment with Diffbot',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Product
        ]
      }
    }]
  },
  enrichment: {
    jobs: [{
      connector: {
        type: EntityEnrichmentServiceTypes.Diffbot,
        diffbot: {
          token: process.env.DIFFBOT_TOKEN!
        },
        enrichedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    }]
  }
});
```

***

### Indexing Stage

**Purpose:** Apply optional indexing connectors (e.g., additional language indexing).

**Default:** ✅ **Automatic indexing** is always applied.

**Embedding models:** Configure embeddings at the project level (see [Specifications](/platform/specifications.md)).

#### Complete Configuration

```typescript
interface IndexingWorkflowStageInput {
  jobs?: Array<IndexingWorkflowJobInput>;  // Indexing connectors
}

interface IndexingWorkflowJobInput {
  connector: ContentIndexingConnectorInput;  // Required: Indexing method
}

interface ContentIndexingConnectorInput {
  type?: ContentIndexingServiceTypes;  // Optional: Indexing connector
  contentType?: ContentTypes;
  fileType?: FileTypes;
}
```

**Most users don't need this stage** - the default indexing works well. Use this only if you need:

* Different embedding model than project default
* Specialized indexing configuration

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Azure AI Language Indexing',
  indexing: {
    jobs: [{
      connector: {
        type: ContentIndexingServiceTypes.AzureAiLanguage,
        contentType: ContentTypes.Page
      }
    }]
  }
});
```

***

### Ingestion Stage

**Purpose:** Filter which content gets ingested based on file type, path, or URL patterns.

**Default:** ✅ **All content ingested** - no filtering.

**When you need it:**

* Web crawling (filter by URL path)
* Selective file type ingestion
* Exclude certain paths

#### Complete Configuration

```typescript
interface IngestionWorkflowStageInput {
  filter?: IngestionContentFilterInput;  // Optional: Content filter
}

interface IngestionContentFilterInput {
  fileTypes?: Array<FileTypes>;              // Optional: Only ingest these file types
  fileExtensions?: Array<string>;            // Optional: Only ingest these extensions
  allowedPaths?: Array<string>;              // Optional: Regex patterns for allowed paths
  excludedPaths?: Array<string>;             // Optional: Regex patterns for excluded paths
}
```

**Example: Web Crawling with Path Filters**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Blog Posts Only',
  ingestion: {
    filter: {
      allowedPaths: ['^/blog/.*', '^/articles/.*'],  // Only these paths
      excludedPaths: ['^/admin/.*', '^/internal/.*'], // Exclude these
      types: [ContentTypes.Page]  // Only web pages
    }
  }
});
```

**Example: Selective File Types**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Documents Only',
  ingestion: {
    filter: {
      fileTypes: [FileTypes.Document],
      fileExtensions: ['pdf', 'docx', 'txt']
    }
  }
});
```

***

### Storage Stage

**Purpose:** Configure where and how content is stored.

**Default:** ✅ **Managed storage** - Graphlit handles all storage.

**When you need it:** Bring your own storage (Azure Blob, S3, Google Cloud Storage).

#### Complete Configuration

```typescript
interface StorageWorkflowStageInput {
  jobs?: Array<StorageWorkflowJobInput>;  // Storage connectors
}

interface StorageWorkflowJobInput {
  connector: FileStorageConnectorInput;  // Required: Storage method
}

interface FileStorageConnectorInput {
  type: FileStorageServiceTypes;       // Required: Service type
  
  // Service-specific properties:
  azureBlob?: AzureBlobStoragePropertiesInput;
  s3?: S3StoragePropertiesInput;
  google?: GoogleStoragePropertiesInput;
}
```

**Most users don't need this** - Graphlit's managed storage is recommended.

***

### Classification Stage

**Purpose:** Classify content into categories using LLMs.

**Default:** ❌ **No classification** - content is not categorized unless you add this.

**When you need it:**

* Content routing/filtering
* Auto-categorization
* Custom taxonomy

#### Complete Configuration

```typescript
interface ClassificationWorkflowStageInput {
  connector?: ContentClassificationConnectorInput;  // Optional: Classification method
}

interface ContentClassificationConnectorInput {
  type: ContentClassificationServiceTypes;  // Required: Service type
  
  // Service-specific properties:
  model?: ModelContentClassificationPropertiesInput;
}

interface ModelContentClassificationPropertiesInput {
  specification?: EntityReferenceInput;     // Optional: LLM for classification
  rules?: Array<PromptClassificationRuleInput>;  // Required: Classification rules
}

interface PromptClassificationRuleInput {
  label: string;                            // Required: Category label
  prompt: string;                           // Required: Classification criteria
}
```

**Example:**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Content Classification',
  classification: {
    connector: {
      type: ContentClassificationServiceTypes.Model,
      model: {
        specification: { id: gpt4oSpecId },
        rules: [
          {
            label: 'Technical',
            prompt: 'Content contains technical documentation, API references, or code examples'
          },
          {
            label: 'Business',
            prompt: 'Content focuses on business strategy, marketing, or sales'
          },
          {
            label: 'Support',
            prompt: 'Content is customer support documentation or FAQs'
          }
        ]
      }
    }
  }
});
```

***

### Workflow Actions

**Purpose:** Execute actions after content processing (webhooks, integrations).

**Default:** ❌ **No actions** - nothing happens after processing unless configured.

**When you need it:**

* Webhook notifications
* Integration triggers
* Post-processing automation

#### Complete Configuration

```typescript
interface WorkflowActionInput {
  connector: IntegrationConnectorInput;  // Required: Integration connector
}

interface IntegrationConnectorInput {
  type: IntegrationServiceTypes;        // Required: Service type
  uri?: URL;                             // Optional: Webhook URL
  
  // Service-specific properties:
  slack?: SlackIntegrationPropertiesInput;
  teams?: MicrosoftTeamsIntegrationPropertiesInput;
  email?: EmailIntegrationPropertiesInput;
}
```

**Example: Webhook on Completion**

```typescript
const workflow = await graphlit.createWorkflow({
  name: 'Workflow with Webhook',
  preparation: { /* ... */ },
  actions: [{
    connector: {
      type: IntegrationServiceTypes.WebHook,
      uri: 'https://api.example.com/webhook/content-processed'
    }
  }]
});
```

***

## Complete API Reference

### WorkflowInput (Top-Level)

```typescript
interface WorkflowInput {
  name: string;                                          // Required
  preparation?: PreparationWorkflowStageInput;           // Optional
  extraction?: ExtractionWorkflowStageInput;             // Optional
  enrichment?: EnrichmentWorkflowStageInput;             // Optional
  indexing?: IndexingWorkflowStageInput;                 // Optional
  ingestion?: IngestionWorkflowStageInput;               // Optional
  storage?: StorageWorkflowStageInput;                   // Optional
  classification?: ClassificationWorkflowStageInput;     // Optional
  actions?: Array<WorkflowActionInput>;                  // Optional
}
```

**All fields except `name` are optional.** Graphlit provides intelligent defaults for missing stages.

***

## Production Patterns

### Pattern 1: Multi-Tenant Workflows

```typescript
// Different workflows for different customer tiers
const workflows = {
  free: await graphlit.createWorkflow({
    name: 'Free Tier',
    // Default preparation (Azure AI)
    // No extraction
  }),
  
  pro: await graphlit.createWorkflow({
    name: 'Pro Tier',
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          modelDocument: { specification: { id: geminiFlashSpecId } }  // Cheaper vision model
        }
      }]
    }
  }),
  
  enterprise: await graphlit.createWorkflow({
    name: 'Enterprise Tier',
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          modelDocument: { specification: { id: gpt4oSpecId } }  // Best quality
        }
      }]
    },
    extraction: {
      jobs: [{
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          modelText: { specification: { id: claudeSpecId } }
        }
      }]
    }
  })
};

// Use workflow based on user tier
const workflowId = user.tier === 'enterprise' 
  ? workflows.enterprise.createWorkflow.id
  : user.tier === 'pro'
  ? workflows.pro.createWorkflow.id
  : workflows.free.createWorkflow.id;

await graphlit.ingestUri(uri, undefined, undefined, undefined, true, { id: workflowId });
```

### Pattern 2: Conditional Workflows (File Type Based)

```typescript
// Different workflows for different file types
async function ingestWithConditionalWorkflow(uri: string, fileType: FileTypes) {
  let workflowId: string | undefined;
  
  if (fileType === FileTypes.Document) {
    // Documents (PDF/Word) get vision model
    workflowId = pdfVisionWorkflowId;
  } else if (fileType === FileTypes.Audio || fileType === FileTypes.Video) {
    // Audio/video gets Deepgram
    workflowId = audioTranscriptionWorkflowId;
  } else {
    // Everything else uses default
    workflowId = undefined;
  }
  
  await graphlit.ingestUri(uri, undefined, undefined, undefined, true, workflowId ? { id: workflowId } : undefined);
}
```

### Pattern 3: Zine Production Pattern

**What Zine uses** (from `/home/kirk/projects/zine`):

```typescript
// Single workflow for all content
const workflow = await graphlit.createWorkflow({
  name: 'Zine Production Workflow',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: { specification: { id: gpt4oSpecId } }
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Label
        ]
      }
    }]
  },
  enrichment: {
    link: {
      extractUri: true,
      extractEmail: true
    }
  }
});

// Applied to all feeds
await graphlit.createFeed({
  name: 'Slack Feed',
  type: FeedTypes.Slack,
  workflow: { id: workflow.createWorkflow.id },
  // ...
});
```

**Key lessons from Zine:**

* Single workflow for simplicity
* Vision model preparation (complex Slack attachments)
* Entity extraction for knowledge graph
* Link extraction for context

### Pattern 4: Cost Optimization

```typescript
// Use cheapest options that meet quality requirements
const costOptimizedWorkflow = await graphlit.createWorkflow({
  name: 'Cost Optimized',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        fileTypes: [FileTypes.Document],  // Documents (PDF, etc.)
        modelDocument: { specification: { id: geminiFlashSpecId } }  // Cheapest vision model
      }
    }],
    // Default Azure AI handles everything else (much cheaper)
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        fileTypes: [FileTypes.Document],  // Only extract from documents
        modelText: { 
          specification: { id: haikuSpecId },  // Cheapest LLM
          tokenThreshold: 500  // Skip short sections (saves money)
        },
        extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization],  // Only essential types
        extractedCount: 50  // Limit entities per type
      }
    }]
  }
});
```

***

## Summary

**Key Takeaways:**

1. **Most workflows are optional** - Graphlit has intelligent defaults
2. **Default preparation is Azure AI Document Intelligence** - works for 80%+ of documents
3. **No extraction by default** - add extraction stage to build knowledge graph
4. **Vision models (MODEL\_DOCUMENT) are 10x more expensive** - only use for complex PDFs
5. **Audio requires explicit workflow** - use Deepgram or Assembly.AI
6. **All stages are composable** - mix and match as needed

**When in doubt:** Start without a workflow, add stages only when you hit limitations.

***

**Related Documentation:**

* [Specifications →](/platform/specifications.md) - Configure LLMs and embedding models
* [Key Concepts →](/platform/key-concepts.md) - High-level overview
* [API Guides: Workflows →](/api-guides/use-cases/workflows.md) - Code examples


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/platform/workflows.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
