Workflows

Complete reference for Graphlit workflows - memory formation pipeline configuration

Workflows define how content is processed as it flows through Graphlit's memory formation pipeline. This is the authoritative reference for all workflow configuration options, defaults, and decision guidance.

On this page:

Overview & Core Concepts

What Workflows Do

Workflows control the memory formation pipeline - how raw content transforms into structured, searchable semantic memory:

The Six Pipeline Stages (in execution order):

Ingestion - Filter which content to accept (file types, paths)
Indexing - Configure embedding model and vector storage
Preparation - Extract text/markdown from files (PDFs, audio, images)
Extraction - Identify entities (people, organizations, topics)
Enrichment - Add external data (links, FHIR, Diffbot)
Classification - Categorize content using LLMs

Additional Configuration:

Storage - Where files are stored (defaults to managed storage)
Actions - Post-processing webhooks and integrations

The Workflow Object

interface WorkflowInput {
  name: string;                                      // Required: Workflow name
  
  // Pipeline stages (in execution order):
  ingestion?: IngestionWorkflowStageInput;           // Optional: Content filtering
  indexing?: IndexingWorkflowStageInput;             // Optional: Custom indexing
  preparation?: PreparationWorkflowStageInput;       // Optional: Text extraction
  extraction?: ExtractionWorkflowStageInput;         // Optional: Entity extraction
  enrichment?: EnrichmentWorkflowStageInput;         // Optional: External enrichment
  classification?: ClassificationWorkflowStageInput; // Optional: Content classification
  
  // Additional configuration:
  storage?: StorageWorkflowStageInput;               // Optional: Storage settings
  actions?: WorkflowActionInput[];                   // Optional: Post-processing actions
}

Key insight: All stages are optional. Graphlit has intelligent defaults.

Default Behavior (No Workflow)

What Happens Without a Workflow

// Simple ingestion - NO workflow specified
await graphlit.ingestUri('https://example.com/document.pdf');

Graphlit's Default Pipeline:

Stage

Default Behavior

Speed

Preparation

✅ Intelligent preparation (see below)

⚡ Fast

Extraction

❌ No entity extraction

⚡ Instant

Enrichment

❌ No external enrichment

⚡ Instant

Indexing

✅ Project default embedding (text-embedding-ada-002)

⚡ Fast

Default Preparation: Intelligent Per-Format Processing

Graphlit automatically selects the best preparation method based on content type:

PDFs & Office Documents:

Azure AI Document Intelligence (Layout model)
✅ Extracts text from PDFs, Word docs, PowerPoint
✅ OCR for scanned documents
✅ Basic table recognition
✅ Layout analysis
❌ Advanced table parsing
❌ Image understanding (diagrams, charts)
❌ Complex multi-column layouts

Audio & Video Files:

Deepgram Nova 2 Transcription
✅ Automatic transcription
✅ High accuracy
✅ Multiple language support
❌ No speaker diarization (unless you add workflow)

Web Pages:

Built-in HTML Parser
✅ Smart HTML extraction
✅ JavaScript rendering (by default)
✅ Markdown conversion

Email, Text, Markdown:

Built-in Parsers
✅ Native format support
✅ Metadata extraction

When default preparation is sufficient:

Simple text-heavy PDFs (80%+ of documents)
Audio/video transcription without speaker identification
Most web pages
Office documents
Standard email/text content

Default Indexing: Project Embedding Model

What it does:

✅ Creates vector embeddings for semantic search
✅ Chunks content intelligently
✅ Stores in vector database

Default model: OpenAI text-embedding-ada-002 (if not configured otherwise)

When Do You Need a Workflow?

Decision Matrix

Goal

Need Workflow?

Stage

Why

Extract text from simple PDF

❌ No

Default Azure AI is fine

Extract text from complex PDF

✅ Yes

Preparation

Use vision models (GPT-4o)

Handle images/diagrams in PDF

✅ Yes

Preparation

Vision models understand images

Transcribe audio/video

✅ Yes

Preparation

Use Deepgram or Assembly.AI

Extract entities (people, orgs)

✅ Yes

Extraction

No extraction by default

Build knowledge graph

✅ Yes

Extraction

Entity extraction required

Enrich with external data

✅ Yes

Enrichment

Add Diffbot, FHIR, etc.

Use custom embedding model

✅ Yes

Indexing

Override default embeddings

Filter content during ingestion

✅ Yes

Ingestion

Path/type filtering

Common Scenarios

Scenario 1: Simple Document Q&A

// NO WORKFLOW NEEDED ✅
await graphlit.ingestUri(pdfUrl);

// Default preparation + indexing works great
const answer = await graphlit.promptConversation({
  prompt: 'What are the key points?'
});

Scenario 2: Complex PDFs with Tables

// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
  name: 'Vision Model Prep',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: { specification: { id: gpt4oSpecId } }
      }
    }]
  }
});

await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, true, { id: workflow.createWorkflow.id });

Scenario 3: Knowledge Graph from Documents

// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
  name: 'Extract Entities',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId }
        }
      }
    }]
  }
});

Workflow Stages

Preparation Stage

Purpose: Extract text, markdown, and metadata from raw files.

When you don't need it: Default Azure AI Document Intelligence handles most documents.

When you need it: Complex PDFs, audio transcription, high-quality markdown extraction.

Complete Configuration

interface PreparationWorkflowStageInput {
  jobs?: Array<PreparationWorkflowJobInput>;      // Preparation connectors
  summarizations?: Array<SummarizationStrategyInput>; // Auto-summarization
  disableSmartCapture?: boolean;                   // Disable JS rendering for web pages
  enableUnblockedCapture?: boolean;                // Use unblocked.com for Cloudflare bypass (10x cost)
}

interface PreparationWorkflowJobInput {
  connector: FilePreparationConnectorInput;        // Required: Preparation method
}

interface FilePreparationConnectorInput {
  type: FilePreparationServiceTypes;               // Required: Service type
  fileTypes?: Array<FileTypes>;                    // Optional: Which file types to prepare
  
  // Service-specific properties:
  modelDocument?: ModelDocumentPreparationPropertiesInput;   // Vision models
  deepgram?: DeepgramAudioPreparationPropertiesInput;        // Deepgram transcription
  assemblyAi?: AssemblyAiAudioPreparationPropertiesInput;    // Assembly.AI transcription
  document?: DocumentPreparationPropertiesInput;              // Azure AI (explicit)
  mistral?: MistralDocumentPreparationPropertiesInput;       // Mistral OCR
  reducto?: ReductoDocumentPreparationPropertiesInput;       // Reducto
  email?: EmailPreparationPropertiesInput;                    // Email parsing
  page?: PagePreparationPropertiesInput;                      // Web page extraction
}

FilePreparationServiceTypes (Recommended Order)

Type

Use Case

Speed

Quality

When to Use

AZURE_DOCUMENT_INTELLIGENCE

Default for PDFs - Most PDFs, Office docs

⚡ Fast

⭐⭐⭐ Good

✅ Try this first (automatic default)

REDUCTO

Specialized PDF extraction - Better than default

⚡ Fast

⭐⭐⭐⭐ Excellent

Try if default isn't good enough

MISTRAL_DOCUMENT

Mistral OCR for documents

⚡⚡ Very Fast

⭐⭐⭐⭐ Excellent

Alternative to Reducto

MODEL_DOCUMENT

Vision LLMs - General-purpose, not tuned for docs

⚠️ Slower

⭐⭐⭐⭐ Very Good

Advanced: After trying defaults & Reducto

DEEPGRAM

Default for audio/video - Transcription

⚡ Fast

⭐⭐⭐⭐ Excellent

✅ Automatic default

ASSEMBLY_AI

Audio/video with speaker diarization

⚡ Fast

⭐⭐⭐⭐ Excellent

Alternative to Deepgram

DOCUMENT

Explicit Azure AI configuration

⚡ Fast

⭐⭐⭐ Good

Rarely needed (use default)

EMAIL

Email message parsing

⚡ Instant

⭐⭐⭐⭐⭐ Perfect

✅ Automatic default

PAGE

Web page extraction

⚡ Fast

⭐⭐⭐⭐ Excellent

✅ Automatic default

MODEL_DOCUMENT: Vision LLMs for Documents

Understanding the Options

Document preparation offers a spectrum of tools, each optimized for different needs and cost profiles:

Azure AI Document Intelligence (Default - $0)

Automatic OCR and layout analysis
Fast, handles most PDFs and Office documents
Included in your Graphlit subscription

Reducto / Mistral Document ($)

Specialized PDF extraction engines
Better at complex tables and multi-column layouts
Higher quality than default, but adds per-page cost

Vision LLMs (MODEL_DOCUMENT - $$$)

General-purpose vision models (GPT-4o, Claude, Gemini)
Understand content semantically, not just structurally
Can interpret diagrams, charts, and visual relationships
Highest cost (10x more than default), best flexibility

When to use vision LLMs:

Specialized tools don't capture the visual meaning you need
Documents require semantic understanding of images/diagrams
Need custom prompting or model-specific behavior
Complex visual documents where structure alone isn't enough

Properties:

interface ModelDocumentPreparationPropertiesInput {
  specification?: EntityReferenceInput;  // Optional: LLM specification (GPT-4o, Claude, etc.)
}

Model Selection:

// GPT-4o - Best balance (recommended)
const gpt4oSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAiModels.Gpt4O_128K }
});

// Claude Sonnet 3.7 - Best for complex documents
const claudeSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});

// Gemini 2.0 Flash - Fast and cheap
const geminiSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Preparation,
  serviceType: ModelServiceTypes.Google,
  google: { model: GoogleModels.Gemini_2_0_Flash }
});

Cost vs. Capability:

Vision LLMs cost ~10x more per page than specialized tools
Trade higher cost for semantic understanding and flexibility
Best for documents where visual meaning matters, not just text extraction

Example:

const workflow = await graphlit.createWorkflow({
  name: 'High-Quality PDF Extraction',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        fileTypes: [FileTypes.Pdf, FileTypes.Document],  // Only for PDFs/docs
        modelDocument: {
          specification: { id: gpt4oSpecId }
        }
      }
    }]
  }
});

DEEPGRAM: Audio Transcription (Enhanced)

Default: Deepgram Nova 2 is used automatically for audio/video files.

When to add workflow (enhance default):

Enable speaker diarization (identify who's speaking)
Enable PII redaction
Use different Deepgram model
Configure language settings

Properties:

interface DeepgramAudioPreparationPropertiesInput {
  key?: string;                         // Optional: Deepgram API key (uses project default if not provided)
  model?: DeepgramModels;               // Optional: Transcription model (default: NOVA_2)
  language?: string;                    // Optional: BCP 47 language code (e.g., 'en', 'en-US')
  detectLanguage?: boolean;             // Optional: Auto-detect language (default: false)
  enableSpeakerDiarization?: boolean;   // Optional: Identify speakers (default: false)
  enableRedaction?: boolean;            // Optional: Redact PII (default: false)
}

Models:

NOVA_2 - Best quality (recommended)
NOVA_2_MEDICAL - Medical terminology
NOVA_2_FINANCE - Financial terminology
NOVA_2_CONVERSATIONAL_AI - Real-time conversations
NOVA_2_VOICEMAIL - Voicemail transcription
NOVA_2_VIDEO - Video content
NOVA_2_PHONE_CALL - Phone calls

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Audio Transcription',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.Deepgram,
        fileTypes: [FileTypes.Audio, FileTypes.Video],
        deepgram: {
          model: DeepgramModels.Nova2,
          enableSpeakerDiarization: true,  // Identify who's speaking
          language: 'en-US'
        }
      }
    }]
  }
});

ASSEMBLY_AI: Alternative Audio Transcription

When to use (alternative to default Deepgram):

Prefer Assembly.AI over Deepgram
Need their specific features
Already have Assembly.AI account/credits

Properties:

interface AssemblyAiAudioPreparationPropertiesInput {
  key?: string;                         // Optional: Assembly.AI API key
  model?: AssemblyAiModels;             // Optional: Model (default: BEST)
  language?: string;                    // Optional: BCP 47 language code
  detectLanguage?: boolean;             // Optional: Auto-detect language
  enableSpeakerDiarization?: boolean;   // Optional: Identify speakers
  enableRedaction?: boolean;            // Optional: Redact PII
}

Models:

BEST - Highest accuracy (default)
NANO - Fastest, lower cost

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Meeting Transcription',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.AssemblyAi,
        fileTypes: [FileTypes.Audio],
        assemblyAi: {
          model: AssemblyAiModels.Best,
          enableSpeakerDiarization: true,
          enableRedaction: true  // Redact sensitive info
        }
      }
    }]
  }
});

Multi-Job Preparation

Use case: Different file types need different preparation methods.

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Multi-Format Processing',
  preparation: {
    jobs: [
      {
        // Job 1: PDFs with vision model
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          fileTypes: [FileTypes.Pdf, FileTypes.Document],
          modelDocument: { specification: { id: gpt4oSpecId } }
        }
      },
      {
        // Job 2: Audio with Deepgram
        connector: {
          type: FilePreparationServiceTypes.Deepgram,
          fileTypes: [FileTypes.Audio, FileTypes.Video],
          deepgram: {
            model: DeepgramModels.Nova2,
            enableSpeakerDiarization: true
          }
        }
      },
      {
        // Job 3: Images with Azure AI (default for everything else)
        connector: {
          type: FilePreparationServiceTypes.AzureDocumentIntelligence,
          fileTypes: [FileTypes.Image]
        }
      }
    ]
  }
});

Web Page Capture Options

Properties (on PreparationWorkflowStageInput):

disableSmartCapture?: boolean;      // Disable JS rendering (faster, cheaper, but may miss content)
enableUnblockedCapture?: boolean;   // Use unblocked.com to bypass Cloudflare (10x cost!)

Default: disableSmartCapture: false, enableUnblockedCapture: false

When to adjust:

// Static HTML sites (faster, cheaper)
const workflow = await graphlit.createWorkflow({
  name: 'Static HTML Crawl',
  preparation: {
    disableSmartCapture: true  // Skip JS rendering
  }
});

// Sites with Cloudflare protection
const workflow = await graphlit.createWorkflow({
  name: 'Cloudflare Bypass',
  preparation: {
    enableUnblockedCapture: true  // 10x cost, but bypasses protection
  }
});

Auto-Summarization During Preparation

Properties:

summarizations?: Array<SummarizationStrategyInput>;

interface SummarizationStrategyInput {
  type: SummarizationTypes;          // Required: Summarization type
  specification?: EntityReferenceInput; // Optional: LLM for summarization
  tokens?: number;                    // Optional: Max summary length
  items?: number;                     // Optional: Number of items to summarize
}

SummarizationTypes:

CHAPTER - Summarize by chapter
PAGE - Summarize by page
SEGMENT - Summarize by segment
QUESTIONS - Generate questions
HEADLINES - Extract headlines
POSTS - Summarize posts (social media)

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Prep with Auto-Summary',
  preparation: {
    jobs: [{ /* preparation connector */ }],
    summarizations: [{
      type: SummarizationTypes.Chapter,
      specification: { id: gpt4oSpecId },
      tokens: 500  // Max 500 tokens per chapter summary
    }]
  }
});

Extraction Stage

Purpose: Identify and extract entities (people, organizations, places, topics) to build knowledge graph.

Default: ❌ No extraction - entities are NOT extracted unless you add this stage.

When you need it:

Building knowledge graph
Entity-based search/filtering
Relationship discovery
Semantic understanding beyond keywords

Complete Configuration

interface ExtractionWorkflowStageInput {
  jobs?: Array<ExtractionWorkflowJobInput>;  // Extraction connectors
}

interface ExtractionWorkflowJobInput {
  connector: EntityExtractionConnectorInput;  // Required: Extraction method
}

interface EntityExtractionConnectorInput {
  type: EntityExtractionServiceTypes;         // Required: Service type
  extractedTypes?: Array<ObservableTypes>;    // Optional: Which entity types to extract
  customTypes?: Array<string>;                // Optional: Custom entity types
  extractedCount?: number;                    // Optional: Max entities per type (default: 100)
  fileTypes?: Array<FileTypes>;               // Optional: Which file types to extract from
  
  // Service-specific properties:
  modelText?: ModelTextExtractionPropertiesInput;  // LLM extraction (recommended)
  modelImage?: ModelImageExtractionPropertiesInput; // Image entity extraction
}

EntityExtractionServiceTypes

Type

Use Case

Quality

MODEL_TEXT

Recommended - LLM-based extraction

⭐⭐⭐⭐⭐ Excellent

AZURE_COGNITIVE_SERVICES_TEXT

Azure Text Analytics

⭐⭐⭐ Good

MODEL_IMAGE

Extract from images

⭐⭐⭐⭐ Excellent

AZURE_COGNITIVE_SERVICES_IMAGE

Azure Vision

⭐⭐⭐ Good

MODEL_TEXT: LLM Entity Extraction (Recommended)

Properties:

interface ModelTextExtractionPropertiesInput {
  specification?: EntityReferenceInput;  // Optional: LLM specification
  tokenThreshold?: number;               // Optional: Min tokens to process (skip short sections)
}

Model Selection:

// Claude Sonnet 3.7 - Best accuracy (recommended)
const claudeSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});

// GPT-4o - Good balance
const gpt4oSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: { model: OpenAiModels.Gpt4O_128K }
});

// Claude Haiku - Fast and cheap
const haikuSpec = await graphlit.createSpecification({
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: { model: AnthropicModels.Claude_3_5Haiku }
});

Observable Types (Standard Entities)

Complete list of built-in entity types:

enum ObservableTypes {
  // People & Organizations
  PERSON                  // People, individuals, names
  ORGANIZATION            // Companies, institutions, groups
  
  // Places
  PLACE                   // Locations, addresses, landmarks
  
  // Products & Services
  PRODUCT                 // Products, services, brands
  
  // Events & Time
  EVENT                   // Events, meetings, occurrences
  
  // Creative Works
  CREATIVE_WORK           // Books, articles, movies, music
  
  // Concepts & Topics
  TOPIC                   // Abstract concepts, themes, subjects
  
  // Legal & Financial
  REGULATION              // Laws, regulations, policies
  
  // Medical (when using FHIR)
  MEDICAL_CONDITION
  MEDICAL_PROCEDURE
  MEDICATION
  MEDICAL_TEST
  
  // And more... (check SDK for complete list)
}

Custom Entity Types

Use case: Domain-specific entities not covered by standard types.

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Legal Document Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId }
        },
        // Extract standard entities
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ],
        // AND custom legal entities
        customTypes: [
          'Contract',
          'Clause',
          'Obligation',
          'Deadline',
          'Payment Term',
          'Jurisdiction',
          'Termination Condition',
          'Liability Limit'
        ]
      }
    }]
  }
});

Complete Example: Knowledge Graph Extraction

const workflow = await graphlit.createWorkflow({
  name: 'Full Knowledge Graph',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: claudeSpecId },
          tokenThreshold: 100  // Skip sections < 100 tokens
        },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Place,
          ObservableTypes.Product,
          ObservableTypes.Event,
          ObservableTypes.Label
        ],
        extractedCount: 200,  // Max 200 entities per type
        fileTypes: [
          FileTypes.Pdf,
          FileTypes.Document,
          FileTypes.Page
        ]
      }
    }]
  }
});

Enrichment Stage

Purpose: Add external data to entities (links, FHIR medical data, Diffbot enrichment).

Default: ❌ No enrichment - external data is NOT added unless you configure this.

When you need it:

Medical applications (FHIR integration)
Web entity enrichment (Diffbot)
Link extraction from content

Complete Configuration

interface EnrichmentWorkflowStageInput {
  jobs?: Array<EnrichmentWorkflowJobInput>;  // Enrichment connectors
  link?: LinkStrategyInput;                   // Link extraction configuration
}

interface EnrichmentWorkflowJobInput {
  connector: EntityEnrichmentConnectorInput;  // Required: Enrichment method
}

interface EntityEnrichmentConnectorInput {
  type: EntityEnrichmentServiceTypes;         // Required: Service type
  enrichedTypes?: Array<ObservableTypes>;     // Optional: Which entity types to enrich
  
  // Service-specific properties:
  diffbot?: DiffbotEnrichmentPropertiesInput;  // Diffbot Knowledge Graph
  fhir?: FhirEnrichmentPropertiesInput;        // FHIR medical data
}

EntityEnrichmentServiceTypes

Type

Use Case

External API

DIFFBOT

Web entity enrichment

Diffbot API required

FHIR

Medical entity enrichment

FHIR endpoint required

Link Extraction

Properties:

interface LinkStrategyInput {
  allowedDomains?: Array<string>;     // Optional: Only extract links from these domains
  excludedDomains?: Array<string>;    // Optional: Don't extract links from these domains
  allowContentDomain?: boolean;       // Optional: Allow links from same domain as content
  extractUri?: boolean;               // Optional: Extract HTTP URLs from text (default: true)
  extractEmail?: boolean;             // Optional: Extract email addresses (default: true)
  extractPhoneNumber?: boolean;       // Optional: Extract phone numbers (default: false)
}

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Link Extraction',
  enrichment: {
    link: {
      extractUri: true,
      extractEmail: true,
      extractPhoneNumber: true,
      allowedDomains: ['example.com', 'partner.com'],  // Only these domains
      excludedDomains: ['spam.com', 'ads.com']         // Exclude these
    }
  }
});

FHIR Medical Enrichment

Use case: Enrich medical entities with FHIR (Fast Healthcare Interoperability Resources) data.

Properties:

interface FhirEnrichmentPropertiesInput {
  endpoint: URL;  // Required: FHIR API endpoint
}

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Medical Content with FHIR',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.Medication,
          ObservableTypes.MedicalProcedure
        ]
      }
    }]
  },
  enrichment: {
    jobs: [{
      connector: {
        type: EntityEnrichmentServiceTypes.Fhir,
        fhir: {
          endpoint: 'https://fhir.example.org/api'
        },
        enrichedTypes: [
          ObservableTypes.MedicalCondition,
          ObservableTypes.Medication
        ]
      }
    }]
  }
});

Diffbot Enrichment

Use case: Enrich entities with data from Diffbot Knowledge Graph.

Properties:

interface DiffbotEnrichmentPropertiesInput {
  token: string;  // Required: Diffbot API token
}

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Entity Enrichment with Diffbot',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Product
        ]
      }
    }]
  },
  enrichment: {
    jobs: [{
      connector: {
        type: EntityEnrichmentServiceTypes.Diffbot,
        diffbot: {
          token: process.env.DIFFBOT_TOKEN!
        },
        enrichedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization
        ]
      }
    }]
  }
});

Indexing Stage

Purpose: Create vector embeddings and store content in vector database for semantic search.

Default: ✅ Automatic indexing with project default embedding model (usually text-embedding-3-small).

When you need this stage: Custom embedding model, specialized indexing requirements.

Complete Configuration

interface IndexingWorkflowStageInput {
  jobs?: Array<IndexingWorkflowJobInput>;  // Indexing connectors
}

interface IndexingWorkflowJobInput {
  connector: ContentIndexingConnectorInput;  // Required: Indexing method
}

interface ContentIndexingConnectorInput {
  type: ContentIndexingServiceTypes;      // Required: Service type
  
  // Service-specific properties:
  modelEmbedding?: ModelEmbeddingIndexingPropertiesInput;  // Custom embedding
}

Most users don't need this stage - the default indexing works well. Use this only if you need:

Different embedding model than project default
Specialized indexing configuration

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Custom Embedding Indexing',
  indexing: {
    jobs: [{
      connector: {
        type: ContentIndexingServiceTypes.ModelEmbedding,
        modelEmbedding: {
          specification: { id: voyageEmbeddingSpecId }  // Use Voyage embeddings
        }
      }
    }]
  }
});

Ingestion Stage

Purpose: Filter which content gets ingested based on file type, path, or URL patterns.

Default: ✅ All content ingested - no filtering.

When you need it:

Web crawling (filter by URL path)
Selective file type ingestion
Exclude certain paths

Complete Configuration

interface IngestionWorkflowStageInput {
  filter?: IngestionContentFilterInput;  // Optional: Content filter
}

interface IngestionContentFilterInput {
  fileTypes?: Array<FileTypes>;              // Optional: Only ingest these file types
  fileExtensions?: Array<string>;            // Optional: Only ingest these extensions
  allowedPaths?: Array<string>;              // Optional: Regex patterns for allowed paths
  excludedPaths?: Array<string>;             // Optional: Regex patterns for excluded paths
}

Example: Web Crawling with Path Filters

const workflow = await graphlit.createWorkflow({
  name: 'Blog Posts Only',
  ingestion: {
    filter: {
      allowedPaths: ['^/blog/.*', '^/articles/.*'],  // Only these paths
      excludedPaths: ['^/admin/.*', '^/internal/.*'], // Exclude these
      fileTypes: [FileTypes.Page]  // Only web pages
    }
  }
});

Example: Selective File Types

const workflow = await graphlit.createWorkflow({
  name: 'Documents Only',
  ingestion: {
    filter: {
      fileTypes: [FileTypes.Pdf, FileTypes.Document],
      fileExtensions: ['.pdf', '.docx', '.txt']
    }
  }
});

Storage Stage

Purpose: Configure where and how content is stored.

Default: ✅ Managed storage - Graphlit handles all storage.

When you need it: Bring your own storage (Azure Blob, S3, Google Cloud Storage).

Complete Configuration

interface StorageWorkflowStageInput {
  jobs?: Array<StorageWorkflowJobInput>;  // Storage connectors
}

interface StorageWorkflowJobInput {
  connector: FileStorageConnectorInput;  // Required: Storage method
}

interface FileStorageConnectorInput {
  type: FileStorageServiceTypes;       // Required: Service type
  
  // Service-specific properties:
  azureBlob?: AzureBlobStoragePropertiesInput;
  s3?: S3StoragePropertiesInput;
  google?: GoogleStoragePropertiesInput;
}

Most users don't need this - Graphlit's managed storage is recommended.

Classification Stage

Purpose: Classify content into categories using LLMs.

Default: ❌ No classification - content is not categorized unless you add this.

When you need it:

Content routing/filtering
Auto-categorization
Custom taxonomy

Complete Configuration

interface ClassificationWorkflowStageInput {
  connector?: ContentClassificationConnectorInput;  // Optional: Classification method
}

interface ContentClassificationConnectorInput {
  type: ContentClassificationServiceTypes;  // Required: Service type
  
  // Service-specific properties:
  modelContent?: ModelContentClassificationPropertiesInput;
}

interface ModelContentClassificationPropertiesInput {
  specification?: EntityReferenceInput;     // Optional: LLM for classification
  rules?: Array<PromptClassificationRuleInput>;  // Required: Classification rules
}

interface PromptClassificationRuleInput {
  label: string;                            // Required: Category label
  prompt: string;                           // Required: Classification criteria
}

Example:

const workflow = await graphlit.createWorkflow({
  name: 'Content Classification',
  classification: {
    connector: {
      type: ContentClassificationServiceTypes.ModelContent,
      modelContent: {
        specification: { id: gpt4oSpecId },
        rules: [
          {
            label: 'Technical',
            prompt: 'Content contains technical documentation, API references, or code examples'
          },
          {
            label: 'Business',
            prompt: 'Content focuses on business strategy, marketing, or sales'
          },
          {
            label: 'Support',
            prompt: 'Content is customer support documentation or FAQs'
          }
        ]
      }
    }
  }
});

Workflow Actions

Purpose: Execute actions after content processing (webhooks, integrations).

Default: ❌ No actions - nothing happens after processing unless configured.

When you need it:

Webhook notifications
Integration triggers
Post-processing automation

Complete Configuration

interface WorkflowActionInput {
  connector: IntegrationConnectorInput;  // Required: Integration connector
}

interface IntegrationConnectorInput {
  type: IntegrationServiceTypes;        // Required: Service type
  uri?: URL;                             // Optional: Webhook URL
  
  // Service-specific properties:
  slack?: SlackIntegrationPropertiesInput;
  teams?: MicrosoftTeamsIntegrationPropertiesInput;
  email?: EmailIntegrationPropertiesInput;
}

Example: Webhook on Completion

const workflow = await graphlit.createWorkflow({
  name: 'Workflow with Webhook',
  preparation: { /* ... */ },
  actions: [{
    connector: {
      type: IntegrationServiceTypes.Webhook,
      uri: 'https://api.example.com/webhook/content-processed'
    }
  }]
});

Complete API Reference

WorkflowInput (Top-Level)

interface WorkflowInput {
  name: string;                                          // Required
  preparation?: PreparationWorkflowStageInput;           // Optional
  extraction?: ExtractionWorkflowStageInput;             // Optional
  enrichment?: EnrichmentWorkflowStageInput;             // Optional
  indexing?: IndexingWorkflowStageInput;                 // Optional
  ingestion?: IngestionWorkflowStageInput;               // Optional
  storage?: StorageWorkflowStageInput;                   // Optional
  classification?: ClassificationWorkflowStageInput;     // Optional
  actions?: Array<WorkflowActionInput>;                  // Optional
}

All fields except name are optional. Graphlit provides intelligent defaults for missing stages.

Production Patterns

Pattern 1: Multi-Tenant Workflows

// Different workflows for different customer tiers
const workflows = {
  free: await graphlit.createWorkflow({
    name: 'Free Tier',
    // Default preparation (Azure AI)
    // No extraction
  }),
  
  pro: await graphlit.createWorkflow({
    name: 'Pro Tier',
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          modelDocument: { specification: { id: geminiFlashSpecId } }  // Cheaper vision model
        }
      }]
    }
  }),
  
  enterprise: await graphlit.createWorkflow({
    name: 'Enterprise Tier',
    preparation: {
      jobs: [{
        connector: {
          type: FilePreparationServiceTypes.ModelDocument,
          modelDocument: { specification: { id: gpt4oSpecId } }  // Best quality
        }
      }]
    },
    extraction: {
      jobs: [{
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          modelText: { specification: { id: claudeSpecId } }
        }
      }]
    }
  })
};

// Use workflow based on user tier
const workflowId = user.tier === 'enterprise' 
  ? workflows.enterprise.createWorkflow.id
  : user.tier === 'pro'
  ? workflows.pro.createWorkflow.id
  : workflows.free.createWorkflow.id;

await graphlit.ingestUri(uri, undefined, undefined, undefined, true, { id: workflowId });

Pattern 2: Conditional Workflows (File Type Based)

// Different workflows for different file types
async function ingestWithConditionalWorkflow(uri: string, fileType: FileTypes) {
  let workflowId: string;
  
  if (fileType === FileTypes.Pdf || fileType === FileTypes.Document) {
    // Complex PDFs get vision model
    workflowId = pdfVisionWorkflowId;
  } else if (fileType === FileTypes.Audio || fileType === FileTypes.Video) {
    // Audio/video gets Deepgram
    workflowId = audioTranscriptionWorkflowId;
  } else {
    // Everything else uses default
    workflowId = undefined;
  }
  
  await graphlit.ingestUri(uri, undefined, undefined, undefined, true, workflowId ? { id: workflowId } : undefined);
}

Pattern 3: Zine Production Pattern

What Zine uses (from /home/kirk/projects/zine):

// Single workflow for all content
const workflow = await graphlit.createWorkflow({
  name: 'Zine Production Workflow',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        modelDocument: { specification: { id: gpt4oSpecId } }
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: { specification: { id: claudeSpecId } },
        extractedTypes: [
          ObservableTypes.Person,
          ObservableTypes.Organization,
          ObservableTypes.Label
        ]
      }
    }]
  },
  enrichment: {
    link: {
      extractUri: true,
      extractEmail: true
    }
  }
});

// Applied to all feeds
await graphlit.createFeed({
  name: 'Slack Feed',
  type: FeedTypes.Slack,
  workflow: { id: workflow.createWorkflow.id },
  // ...
});

Key lessons from Zine:

Single workflow for simplicity
Vision model preparation (complex Slack attachments)
Entity extraction for knowledge graph
Link extraction for context

Pattern 4: Cost Optimization

// Use cheapest options that meet quality requirements
const costOptimizedWorkflow = await graphlit.createWorkflow({
  name: 'Cost Optimized',
  preparation: {
    jobs: [{
      connector: {
        type: FilePreparationServiceTypes.ModelDocument,
        fileTypes: [FileTypes.Pdf],  // Only complex PDFs
        modelDocument: { specification: { id: geminiFlashSpecId } }  // Cheapest vision model
      }
    }],
    // Default Azure AI handles everything else (much cheaper)
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        fileTypes: [FileTypes.Pdf, FileTypes.Document],  // Only extract from documents
        modelText: { 
          specification: { id: haikuSpecId },  // Cheapest LLM
          tokenThreshold: 500  // Skip short sections (saves money)
        },
        extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization],  // Only essential types
        extractedCount: 50  // Limit entities per type
      }
    }]
  }
});

Summary

Key Takeaways:

Most workflows are optional - Graphlit has intelligent defaults
Default preparation is Azure AI Document Intelligence - works for 80%+ of documents
No extraction by default - add extraction stage to build knowledge graph
Vision models (MODEL_DOCUMENT) are 10x more expensive - only use for complex PDFs
Audio requires explicit workflow - use Deepgram or Assembly.AI
All stages are composable - mix and match as needed

When in doubt: Start without a workflow, add stages only when you hit limitations.

Related Documentation:

Specifications → - Configure LLMs and embedding models
Key Concepts → - High-level overview
API Guides: Workflows → - Code examples

Last updated 3 days ago

Was this helpful?