Workflows
Complete reference for Graphlit workflows - memory formation pipeline configuration
Workflows define how content is processed as it flows through Graphlit's memory formation pipeline. This is the authoritative reference for all workflow configuration options, defaults, and decision guidance.
On this page:
Overview & Core Concepts
What Workflows Do
Workflows control the memory formation pipeline - how raw content transforms into structured, searchable semantic memory:
The Six Pipeline Stages (in execution order):
Ingestion - Filter which content to accept (file types, paths)
Indexing - Configure embedding model and vector storage
Preparation - Extract text/markdown from files (PDFs, audio, images)
Extraction - Identify entities (people, organizations, topics)
Enrichment - Add external data (links, FHIR, Diffbot)
Classification - Categorize content using LLMs
Additional Configuration:
Storage - Where files are stored (defaults to managed storage)
Actions - Post-processing webhooks and integrations
The Workflow Object
interface WorkflowInput {
name: string; // Required: Workflow name
// Pipeline stages (in execution order):
ingestion?: IngestionWorkflowStageInput; // Optional: Content filtering
indexing?: IndexingWorkflowStageInput; // Optional: Custom indexing
preparation?: PreparationWorkflowStageInput; // Optional: Text extraction
extraction?: ExtractionWorkflowStageInput; // Optional: Entity extraction
enrichment?: EnrichmentWorkflowStageInput; // Optional: External enrichment
classification?: ClassificationWorkflowStageInput; // Optional: Content classification
// Additional configuration:
storage?: StorageWorkflowStageInput; // Optional: Storage settings
actions?: WorkflowActionInput[]; // Optional: Post-processing actions
}Key insight: All stages are optional. Graphlit has intelligent defaults.
Default Behavior (No Workflow)
What Happens Without a Workflow
// Simple ingestion - NO workflow specified
await graphlit.ingestUri('https://example.com/document.pdf');Graphlit's Default Pipeline:
Preparation
✅ Intelligent preparation (see below)
⚡ Fast
Extraction
❌ No entity extraction
⚡ Instant
Enrichment
❌ No external enrichment
⚡ Instant
Indexing
✅ Project default embedding (text-embedding-ada-002)
⚡ Fast
Default Preparation: Intelligent Per-Format Processing
Graphlit automatically selects the best preparation method based on content type:
PDFs & Office Documents:
Azure AI Document Intelligence (Layout model)
✅ Extracts text from PDFs, Word docs, PowerPoint
✅ OCR for scanned documents
✅ Basic table recognition
✅ Layout analysis
❌ Advanced table parsing
❌ Image understanding (diagrams, charts)
❌ Complex multi-column layouts
Audio & Video Files:
Deepgram Nova 2 Transcription
✅ Automatic transcription
✅ High accuracy
✅ Multiple language support
❌ No speaker diarization (unless you add workflow)
Web Pages:
Built-in HTML Parser
✅ Smart HTML extraction
✅ JavaScript rendering (by default)
✅ Markdown conversion
Email, Text, Markdown:
Built-in Parsers
✅ Native format support
✅ Metadata extraction
When default preparation is sufficient:
Simple text-heavy PDFs (80%+ of documents)
Audio/video transcription without speaker identification
Most web pages
Office documents
Standard email/text content
Default Indexing: Project Embedding Model
What it does:
✅ Creates vector embeddings for semantic search
✅ Chunks content intelligently
✅ Stores in vector database
Default model: OpenAI text-embedding-ada-002 (if not configured otherwise)
When Do You Need a Workflow?
Decision Matrix
Extract text from simple PDF
❌ No
-
Default Azure AI is fine
Extract text from complex PDF
✅ Yes
Preparation
Use vision models (GPT-4o)
Handle images/diagrams in PDF
✅ Yes
Preparation
Vision models understand images
Transcribe audio/video
✅ Yes
Preparation
Use Deepgram or Assembly.AI
Extract entities (people, orgs)
✅ Yes
Extraction
No extraction by default
Build knowledge graph
✅ Yes
Extraction
Entity extraction required
Enrich with external data
✅ Yes
Enrichment
Add Diffbot, FHIR, etc.
Use custom embedding model
✅ Yes
Indexing
Override default embeddings
Filter content during ingestion
✅ Yes
Ingestion
Path/type filtering
Common Scenarios
Scenario 1: Simple Document Q&A
// NO WORKFLOW NEEDED ✅
await graphlit.ingestUri(pdfUrl);
// Default preparation + indexing works great
const answer = await graphlit.promptConversation({
prompt: 'What are the key points?'
});Scenario 2: Complex PDFs with Tables
// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
name: 'Vision Model Prep',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: { specification: { id: gpt4oSpecId } }
}
}]
}
});
await graphlit.ingestUri(pdfUrl, undefined, undefined, undefined, true, { id: workflow.createWorkflow.id });Scenario 3: Knowledge Graph from Documents
// WORKFLOW NEEDED ✅
const workflow = await graphlit.createWorkflow({
name: 'Extract Entities',
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: {
specification: { id: claudeSpecId }
}
}
}]
}
});Workflow Stages
Preparation Stage
Purpose: Extract text, markdown, and metadata from raw files.
When you don't need it: Default Azure AI Document Intelligence handles most documents.
When you need it: Complex PDFs, audio transcription, high-quality markdown extraction.
Complete Configuration
interface PreparationWorkflowStageInput {
jobs?: Array<PreparationWorkflowJobInput>; // Preparation connectors
summarizations?: Array<SummarizationStrategyInput>; // Auto-summarization
disableSmartCapture?: boolean; // Disable JS rendering for web pages
enableUnblockedCapture?: boolean; // Use unblocked.com for Cloudflare bypass (10x cost)
}
interface PreparationWorkflowJobInput {
connector: FilePreparationConnectorInput; // Required: Preparation method
}
interface FilePreparationConnectorInput {
type: FilePreparationServiceTypes; // Required: Service type
fileTypes?: Array<FileTypes>; // Optional: Which file types to prepare
// Service-specific properties:
modelDocument?: ModelDocumentPreparationPropertiesInput; // Vision models
deepgram?: DeepgramAudioPreparationPropertiesInput; // Deepgram transcription
assemblyAi?: AssemblyAiAudioPreparationPropertiesInput; // Assembly.AI transcription
document?: DocumentPreparationPropertiesInput; // Azure AI (explicit)
mistral?: MistralDocumentPreparationPropertiesInput; // Mistral OCR
reducto?: ReductoDocumentPreparationPropertiesInput; // Reducto
email?: EmailPreparationPropertiesInput; // Email parsing
page?: PagePreparationPropertiesInput; // Web page extraction
}FilePreparationServiceTypes (Recommended Order)
AZURE_DOCUMENT_INTELLIGENCE
Default for PDFs - Most PDFs, Office docs
⚡ Fast
⭐⭐⭐ Good
✅ Try this first (automatic default)
REDUCTO
Specialized PDF extraction - Better than default
⚡ Fast
⭐⭐⭐⭐ Excellent
Try if default isn't good enough
MISTRAL_DOCUMENT
Mistral OCR for documents
⚡⚡ Very Fast
⭐⭐⭐⭐ Excellent
Alternative to Reducto
MODEL_DOCUMENT
Vision LLMs - General-purpose, not tuned for docs
⚠️ Slower
⭐⭐⭐⭐ Very Good
Advanced: After trying defaults & Reducto
DEEPGRAM
Default for audio/video - Transcription
⚡ Fast
⭐⭐⭐⭐ Excellent
✅ Automatic default
ASSEMBLY_AI
Audio/video with speaker diarization
⚡ Fast
⭐⭐⭐⭐ Excellent
Alternative to Deepgram
DOCUMENT
Explicit Azure AI configuration
⚡ Fast
⭐⭐⭐ Good
Rarely needed (use default)
EMAIL
Email message parsing
⚡ Instant
⭐⭐⭐⭐⭐ Perfect
✅ Automatic default
PAGE
Web page extraction
⚡ Fast
⭐⭐⭐⭐ Excellent
✅ Automatic default
MODEL_DOCUMENT: Vision LLMs for Documents
Understanding the Options
Document preparation offers a spectrum of tools, each optimized for different needs and cost profiles:
Azure AI Document Intelligence (Default - $0)
Automatic OCR and layout analysis
Fast, handles most PDFs and Office documents
Included in your Graphlit subscription
Reducto / Mistral Document ($)
Specialized PDF extraction engines
Better at complex tables and multi-column layouts
Higher quality than default, but adds per-page cost
Vision LLMs (MODEL_DOCUMENT - $$$)
General-purpose vision models (GPT-4o, Claude, Gemini)
Understand content semantically, not just structurally
Can interpret diagrams, charts, and visual relationships
Highest cost (10x more than default), best flexibility
When to use vision LLMs:
Specialized tools don't capture the visual meaning you need
Documents require semantic understanding of images/diagrams
Need custom prompting or model-specific behavior
Complex visual documents where structure alone isn't enough
Properties:
interface ModelDocumentPreparationPropertiesInput {
specification?: EntityReferenceInput; // Optional: LLM specification (GPT-4o, Claude, etc.)
}Model Selection:
// GPT-4o - Best balance (recommended)
const gpt4oSpec = await graphlit.createSpecification({
type: SpecificationTypes.Preparation,
serviceType: ModelServiceTypes.OpenAi,
openAI: { model: OpenAiModels.Gpt4O_128K }
});
// Claude Sonnet 3.7 - Best for complex documents
const claudeSpec = await graphlit.createSpecification({
type: SpecificationTypes.Preparation,
serviceType: ModelServiceTypes.Anthropic,
anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});
// Gemini 2.0 Flash - Fast and cheap
const geminiSpec = await graphlit.createSpecification({
type: SpecificationTypes.Preparation,
serviceType: ModelServiceTypes.Google,
google: { model: GoogleModels.Gemini_2_0_Flash }
});Cost vs. Capability:
Vision LLMs cost ~10x more per page than specialized tools
Trade higher cost for semantic understanding and flexibility
Best for documents where visual meaning matters, not just text extraction
Example:
const workflow = await graphlit.createWorkflow({
name: 'High-Quality PDF Extraction',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
fileTypes: [FileTypes.Pdf, FileTypes.Document], // Only for PDFs/docs
modelDocument: {
specification: { id: gpt4oSpecId }
}
}
}]
}
});DEEPGRAM: Audio Transcription (Enhanced)
Default: Deepgram Nova 2 is used automatically for audio/video files.
When to add workflow (enhance default):
Enable speaker diarization (identify who's speaking)
Enable PII redaction
Use different Deepgram model
Configure language settings
Properties:
interface DeepgramAudioPreparationPropertiesInput {
key?: string; // Optional: Deepgram API key (uses project default if not provided)
model?: DeepgramModels; // Optional: Transcription model (default: NOVA_2)
language?: string; // Optional: BCP 47 language code (e.g., 'en', 'en-US')
detectLanguage?: boolean; // Optional: Auto-detect language (default: false)
enableSpeakerDiarization?: boolean; // Optional: Identify speakers (default: false)
enableRedaction?: boolean; // Optional: Redact PII (default: false)
}Models:
NOVA_2- Best quality (recommended)NOVA_2_MEDICAL- Medical terminologyNOVA_2_FINANCE- Financial terminologyNOVA_2_CONVERSATIONAL_AI- Real-time conversationsNOVA_2_VOICEMAIL- Voicemail transcriptionNOVA_2_VIDEO- Video contentNOVA_2_PHONE_CALL- Phone calls
Example:
const workflow = await graphlit.createWorkflow({
name: 'Audio Transcription',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.Deepgram,
fileTypes: [FileTypes.Audio, FileTypes.Video],
deepgram: {
model: DeepgramModels.Nova2,
enableSpeakerDiarization: true, // Identify who's speaking
language: 'en-US'
}
}
}]
}
});ASSEMBLY_AI: Alternative Audio Transcription
When to use (alternative to default Deepgram):
Prefer Assembly.AI over Deepgram
Need their specific features
Already have Assembly.AI account/credits
Properties:
interface AssemblyAiAudioPreparationPropertiesInput {
key?: string; // Optional: Assembly.AI API key
model?: AssemblyAiModels; // Optional: Model (default: BEST)
language?: string; // Optional: BCP 47 language code
detectLanguage?: boolean; // Optional: Auto-detect language
enableSpeakerDiarization?: boolean; // Optional: Identify speakers
enableRedaction?: boolean; // Optional: Redact PII
}Models:
BEST- Highest accuracy (default)NANO- Fastest, lower cost
Example:
const workflow = await graphlit.createWorkflow({
name: 'Meeting Transcription',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.AssemblyAi,
fileTypes: [FileTypes.Audio],
assemblyAi: {
model: AssemblyAiModels.Best,
enableSpeakerDiarization: true,
enableRedaction: true // Redact sensitive info
}
}
}]
}
});Multi-Job Preparation
Use case: Different file types need different preparation methods.
Example:
const workflow = await graphlit.createWorkflow({
name: 'Multi-Format Processing',
preparation: {
jobs: [
{
// Job 1: PDFs with vision model
connector: {
type: FilePreparationServiceTypes.ModelDocument,
fileTypes: [FileTypes.Pdf, FileTypes.Document],
modelDocument: { specification: { id: gpt4oSpecId } }
}
},
{
// Job 2: Audio with Deepgram
connector: {
type: FilePreparationServiceTypes.Deepgram,
fileTypes: [FileTypes.Audio, FileTypes.Video],
deepgram: {
model: DeepgramModels.Nova2,
enableSpeakerDiarization: true
}
}
},
{
// Job 3: Images with Azure AI (default for everything else)
connector: {
type: FilePreparationServiceTypes.AzureDocumentIntelligence,
fileTypes: [FileTypes.Image]
}
}
]
}
});Web Page Capture Options
Properties (on PreparationWorkflowStageInput):
disableSmartCapture?: boolean; // Disable JS rendering (faster, cheaper, but may miss content)
enableUnblockedCapture?: boolean; // Use unblocked.com to bypass Cloudflare (10x cost!)Default: disableSmartCapture: false, enableUnblockedCapture: false
When to adjust:
// Static HTML sites (faster, cheaper)
const workflow = await graphlit.createWorkflow({
name: 'Static HTML Crawl',
preparation: {
disableSmartCapture: true // Skip JS rendering
}
});
// Sites with Cloudflare protection
const workflow = await graphlit.createWorkflow({
name: 'Cloudflare Bypass',
preparation: {
enableUnblockedCapture: true // 10x cost, but bypasses protection
}
});Auto-Summarization During Preparation
Properties:
summarizations?: Array<SummarizationStrategyInput>;
interface SummarizationStrategyInput {
type: SummarizationTypes; // Required: Summarization type
specification?: EntityReferenceInput; // Optional: LLM for summarization
tokens?: number; // Optional: Max summary length
items?: number; // Optional: Number of items to summarize
}SummarizationTypes:
CHAPTER- Summarize by chapterPAGE- Summarize by pageSEGMENT- Summarize by segmentQUESTIONS- Generate questionsHEADLINES- Extract headlinesPOSTS- Summarize posts (social media)
Example:
const workflow = await graphlit.createWorkflow({
name: 'Prep with Auto-Summary',
preparation: {
jobs: [{ /* preparation connector */ }],
summarizations: [{
type: SummarizationTypes.Chapter,
specification: { id: gpt4oSpecId },
tokens: 500 // Max 500 tokens per chapter summary
}]
}
});Extraction Stage
Purpose: Identify and extract entities (people, organizations, places, topics) to build knowledge graph.
Default: ❌ No extraction - entities are NOT extracted unless you add this stage.
When you need it:
Building knowledge graph
Entity-based search/filtering
Relationship discovery
Semantic understanding beyond keywords
Complete Configuration
interface ExtractionWorkflowStageInput {
jobs?: Array<ExtractionWorkflowJobInput>; // Extraction connectors
}
interface ExtractionWorkflowJobInput {
connector: EntityExtractionConnectorInput; // Required: Extraction method
}
interface EntityExtractionConnectorInput {
type: EntityExtractionServiceTypes; // Required: Service type
extractedTypes?: Array<ObservableTypes>; // Optional: Which entity types to extract
customTypes?: Array<string>; // Optional: Custom entity types
extractedCount?: number; // Optional: Max entities per type (default: 100)
fileTypes?: Array<FileTypes>; // Optional: Which file types to extract from
// Service-specific properties:
modelText?: ModelTextExtractionPropertiesInput; // LLM extraction (recommended)
modelImage?: ModelImageExtractionPropertiesInput; // Image entity extraction
}EntityExtractionServiceTypes
MODEL_TEXT
Recommended - LLM-based extraction
⭐⭐⭐⭐⭐ Excellent
AZURE_COGNITIVE_SERVICES_TEXT
Azure Text Analytics
⭐⭐⭐ Good
MODEL_IMAGE
Extract from images
⭐⭐⭐⭐ Excellent
AZURE_COGNITIVE_SERVICES_IMAGE
Azure Vision
⭐⭐⭐ Good
MODEL_TEXT: LLM Entity Extraction (Recommended)
Properties:
interface ModelTextExtractionPropertiesInput {
specification?: EntityReferenceInput; // Optional: LLM specification
tokenThreshold?: number; // Optional: Min tokens to process (skip short sections)
}Model Selection:
// Claude Sonnet 3.7 - Best accuracy (recommended)
const claudeSpec = await graphlit.createSpecification({
type: SpecificationTypes.Extraction,
serviceType: ModelServiceTypes.Anthropic,
anthropic: { model: AnthropicModels.Claude_3_7Sonnet }
});
// GPT-4o - Good balance
const gpt4oSpec = await graphlit.createSpecification({
type: SpecificationTypes.Extraction,
serviceType: ModelServiceTypes.OpenAi,
openAI: { model: OpenAiModels.Gpt4O_128K }
});
// Claude Haiku - Fast and cheap
const haikuSpec = await graphlit.createSpecification({
type: SpecificationTypes.Extraction,
serviceType: ModelServiceTypes.Anthropic,
anthropic: { model: AnthropicModels.Claude_3_5Haiku }
});Observable Types (Standard Entities)
Complete list of built-in entity types:
enum ObservableTypes {
// People & Organizations
PERSON // People, individuals, names
ORGANIZATION // Companies, institutions, groups
// Places
PLACE // Locations, addresses, landmarks
// Products & Services
PRODUCT // Products, services, brands
// Events & Time
EVENT // Events, meetings, occurrences
// Creative Works
CREATIVE_WORK // Books, articles, movies, music
// Concepts & Topics
TOPIC // Abstract concepts, themes, subjects
// Legal & Financial
REGULATION // Laws, regulations, policies
// Medical (when using FHIR)
MEDICAL_CONDITION
MEDICAL_PROCEDURE
MEDICATION
MEDICAL_TEST
// And more... (check SDK for complete list)
}Custom Entity Types
Use case: Domain-specific entities not covered by standard types.
Example:
const workflow = await graphlit.createWorkflow({
name: 'Legal Document Extraction',
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: {
specification: { id: claudeSpecId }
},
// Extract standard entities
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization
],
// AND custom legal entities
customTypes: [
'Contract',
'Clause',
'Obligation',
'Deadline',
'Payment Term',
'Jurisdiction',
'Termination Condition',
'Liability Limit'
]
}
}]
}
});Complete Example: Knowledge Graph Extraction
const workflow = await graphlit.createWorkflow({
name: 'Full Knowledge Graph',
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: {
specification: { id: claudeSpecId },
tokenThreshold: 100 // Skip sections < 100 tokens
},
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Place,
ObservableTypes.Product,
ObservableTypes.Event,
ObservableTypes.Label
],
extractedCount: 200, // Max 200 entities per type
fileTypes: [
FileTypes.Pdf,
FileTypes.Document,
FileTypes.Page
]
}
}]
}
});Enrichment Stage
Purpose: Add external data to entities (links, FHIR medical data, Diffbot enrichment).
Default: ❌ No enrichment - external data is NOT added unless you configure this.
When you need it:
Medical applications (FHIR integration)
Web entity enrichment (Diffbot)
Link extraction from content
Complete Configuration
interface EnrichmentWorkflowStageInput {
jobs?: Array<EnrichmentWorkflowJobInput>; // Enrichment connectors
link?: LinkStrategyInput; // Link extraction configuration
}
interface EnrichmentWorkflowJobInput {
connector: EntityEnrichmentConnectorInput; // Required: Enrichment method
}
interface EntityEnrichmentConnectorInput {
type: EntityEnrichmentServiceTypes; // Required: Service type
enrichedTypes?: Array<ObservableTypes>; // Optional: Which entity types to enrich
// Service-specific properties:
diffbot?: DiffbotEnrichmentPropertiesInput; // Diffbot Knowledge Graph
fhir?: FhirEnrichmentPropertiesInput; // FHIR medical data
}EntityEnrichmentServiceTypes
DIFFBOT
Web entity enrichment
Diffbot API required
FHIR
Medical entity enrichment
FHIR endpoint required
Link Extraction
Properties:
interface LinkStrategyInput {
allowedDomains?: Array<string>; // Optional: Only extract links from these domains
excludedDomains?: Array<string>; // Optional: Don't extract links from these domains
allowContentDomain?: boolean; // Optional: Allow links from same domain as content
extractUri?: boolean; // Optional: Extract HTTP URLs from text (default: true)
extractEmail?: boolean; // Optional: Extract email addresses (default: true)
extractPhoneNumber?: boolean; // Optional: Extract phone numbers (default: false)
}Example:
const workflow = await graphlit.createWorkflow({
name: 'Link Extraction',
enrichment: {
link: {
extractUri: true,
extractEmail: true,
extractPhoneNumber: true,
allowedDomains: ['example.com', 'partner.com'], // Only these domains
excludedDomains: ['spam.com', 'ads.com'] // Exclude these
}
}
});FHIR Medical Enrichment
Use case: Enrich medical entities with FHIR (Fast Healthcare Interoperability Resources) data.
Properties:
interface FhirEnrichmentPropertiesInput {
endpoint: URL; // Required: FHIR API endpoint
}Example:
const workflow = await graphlit.createWorkflow({
name: 'Medical Content with FHIR',
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: { specification: { id: claudeSpecId } },
extractedTypes: [
ObservableTypes.MedicalCondition,
ObservableTypes.Medication,
ObservableTypes.MedicalProcedure
]
}
}]
},
enrichment: {
jobs: [{
connector: {
type: EntityEnrichmentServiceTypes.Fhir,
fhir: {
endpoint: 'https://fhir.example.org/api'
},
enrichedTypes: [
ObservableTypes.MedicalCondition,
ObservableTypes.Medication
]
}
}]
}
});Diffbot Enrichment
Use case: Enrich entities with data from Diffbot Knowledge Graph.
Properties:
interface DiffbotEnrichmentPropertiesInput {
token: string; // Required: Diffbot API token
}Example:
const workflow = await graphlit.createWorkflow({
name: 'Entity Enrichment with Diffbot',
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: { specification: { id: claudeSpecId } },
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Product
]
}
}]
},
enrichment: {
jobs: [{
connector: {
type: EntityEnrichmentServiceTypes.Diffbot,
diffbot: {
token: process.env.DIFFBOT_TOKEN!
},
enrichedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization
]
}
}]
}
});Indexing Stage
Purpose: Create vector embeddings and store content in vector database for semantic search.
Default: ✅ Automatic indexing with project default embedding model (usually text-embedding-3-small).
When you need this stage: Custom embedding model, specialized indexing requirements.
Complete Configuration
interface IndexingWorkflowStageInput {
jobs?: Array<IndexingWorkflowJobInput>; // Indexing connectors
}
interface IndexingWorkflowJobInput {
connector: ContentIndexingConnectorInput; // Required: Indexing method
}
interface ContentIndexingConnectorInput {
type: ContentIndexingServiceTypes; // Required: Service type
// Service-specific properties:
modelEmbedding?: ModelEmbeddingIndexingPropertiesInput; // Custom embedding
}Most users don't need this stage - the default indexing works well. Use this only if you need:
Different embedding model than project default
Specialized indexing configuration
Example:
const workflow = await graphlit.createWorkflow({
name: 'Custom Embedding Indexing',
indexing: {
jobs: [{
connector: {
type: ContentIndexingServiceTypes.ModelEmbedding,
modelEmbedding: {
specification: { id: voyageEmbeddingSpecId } // Use Voyage embeddings
}
}
}]
}
});Ingestion Stage
Purpose: Filter which content gets ingested based on file type, path, or URL patterns.
Default: ✅ All content ingested - no filtering.
When you need it:
Web crawling (filter by URL path)
Selective file type ingestion
Exclude certain paths
Complete Configuration
interface IngestionWorkflowStageInput {
filter?: IngestionContentFilterInput; // Optional: Content filter
}
interface IngestionContentFilterInput {
fileTypes?: Array<FileTypes>; // Optional: Only ingest these file types
fileExtensions?: Array<string>; // Optional: Only ingest these extensions
allowedPaths?: Array<string>; // Optional: Regex patterns for allowed paths
excludedPaths?: Array<string>; // Optional: Regex patterns for excluded paths
}Example: Web Crawling with Path Filters
const workflow = await graphlit.createWorkflow({
name: 'Blog Posts Only',
ingestion: {
filter: {
allowedPaths: ['^/blog/.*', '^/articles/.*'], // Only these paths
excludedPaths: ['^/admin/.*', '^/internal/.*'], // Exclude these
fileTypes: [FileTypes.Page] // Only web pages
}
}
});Example: Selective File Types
const workflow = await graphlit.createWorkflow({
name: 'Documents Only',
ingestion: {
filter: {
fileTypes: [FileTypes.Pdf, FileTypes.Document],
fileExtensions: ['.pdf', '.docx', '.txt']
}
}
});Storage Stage
Purpose: Configure where and how content is stored.
Default: ✅ Managed storage - Graphlit handles all storage.
When you need it: Bring your own storage (Azure Blob, S3, Google Cloud Storage).
Complete Configuration
interface StorageWorkflowStageInput {
jobs?: Array<StorageWorkflowJobInput>; // Storage connectors
}
interface StorageWorkflowJobInput {
connector: FileStorageConnectorInput; // Required: Storage method
}
interface FileStorageConnectorInput {
type: FileStorageServiceTypes; // Required: Service type
// Service-specific properties:
azureBlob?: AzureBlobStoragePropertiesInput;
s3?: S3StoragePropertiesInput;
google?: GoogleStoragePropertiesInput;
}Most users don't need this - Graphlit's managed storage is recommended.
Classification Stage
Purpose: Classify content into categories using LLMs.
Default: ❌ No classification - content is not categorized unless you add this.
When you need it:
Content routing/filtering
Auto-categorization
Custom taxonomy
Complete Configuration
interface ClassificationWorkflowStageInput {
connector?: ContentClassificationConnectorInput; // Optional: Classification method
}
interface ContentClassificationConnectorInput {
type: ContentClassificationServiceTypes; // Required: Service type
// Service-specific properties:
modelContent?: ModelContentClassificationPropertiesInput;
}
interface ModelContentClassificationPropertiesInput {
specification?: EntityReferenceInput; // Optional: LLM for classification
rules?: Array<PromptClassificationRuleInput>; // Required: Classification rules
}
interface PromptClassificationRuleInput {
label: string; // Required: Category label
prompt: string; // Required: Classification criteria
}Example:
const workflow = await graphlit.createWorkflow({
name: 'Content Classification',
classification: {
connector: {
type: ContentClassificationServiceTypes.ModelContent,
modelContent: {
specification: { id: gpt4oSpecId },
rules: [
{
label: 'Technical',
prompt: 'Content contains technical documentation, API references, or code examples'
},
{
label: 'Business',
prompt: 'Content focuses on business strategy, marketing, or sales'
},
{
label: 'Support',
prompt: 'Content is customer support documentation or FAQs'
}
]
}
}
}
});Workflow Actions
Purpose: Execute actions after content processing (webhooks, integrations).
Default: ❌ No actions - nothing happens after processing unless configured.
When you need it:
Webhook notifications
Integration triggers
Post-processing automation
Complete Configuration
interface WorkflowActionInput {
connector: IntegrationConnectorInput; // Required: Integration connector
}
interface IntegrationConnectorInput {
type: IntegrationServiceTypes; // Required: Service type
uri?: URL; // Optional: Webhook URL
// Service-specific properties:
slack?: SlackIntegrationPropertiesInput;
teams?: MicrosoftTeamsIntegrationPropertiesInput;
email?: EmailIntegrationPropertiesInput;
}Example: Webhook on Completion
const workflow = await graphlit.createWorkflow({
name: 'Workflow with Webhook',
preparation: { /* ... */ },
actions: [{
connector: {
type: IntegrationServiceTypes.Webhook,
uri: 'https://api.example.com/webhook/content-processed'
}
}]
});Complete API Reference
WorkflowInput (Top-Level)
interface WorkflowInput {
name: string; // Required
preparation?: PreparationWorkflowStageInput; // Optional
extraction?: ExtractionWorkflowStageInput; // Optional
enrichment?: EnrichmentWorkflowStageInput; // Optional
indexing?: IndexingWorkflowStageInput; // Optional
ingestion?: IngestionWorkflowStageInput; // Optional
storage?: StorageWorkflowStageInput; // Optional
classification?: ClassificationWorkflowStageInput; // Optional
actions?: Array<WorkflowActionInput>; // Optional
}All fields except name are optional. Graphlit provides intelligent defaults for missing stages.
Production Patterns
Pattern 1: Multi-Tenant Workflows
// Different workflows for different customer tiers
const workflows = {
free: await graphlit.createWorkflow({
name: 'Free Tier',
// Default preparation (Azure AI)
// No extraction
}),
pro: await graphlit.createWorkflow({
name: 'Pro Tier',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: { specification: { id: geminiFlashSpecId } } // Cheaper vision model
}
}]
}
}),
enterprise: await graphlit.createWorkflow({
name: 'Enterprise Tier',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: { specification: { id: gpt4oSpecId } } // Best quality
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: { specification: { id: claudeSpecId } }
}
}]
}
})
};
// Use workflow based on user tier
const workflowId = user.tier === 'enterprise'
? workflows.enterprise.createWorkflow.id
: user.tier === 'pro'
? workflows.pro.createWorkflow.id
: workflows.free.createWorkflow.id;
await graphlit.ingestUri(uri, undefined, undefined, undefined, true, { id: workflowId });Pattern 2: Conditional Workflows (File Type Based)
// Different workflows for different file types
async function ingestWithConditionalWorkflow(uri: string, fileType: FileTypes) {
let workflowId: string;
if (fileType === FileTypes.Pdf || fileType === FileTypes.Document) {
// Complex PDFs get vision model
workflowId = pdfVisionWorkflowId;
} else if (fileType === FileTypes.Audio || fileType === FileTypes.Video) {
// Audio/video gets Deepgram
workflowId = audioTranscriptionWorkflowId;
} else {
// Everything else uses default
workflowId = undefined;
}
await graphlit.ingestUri(uri, undefined, undefined, undefined, true, workflowId ? { id: workflowId } : undefined);
}Pattern 3: Zine Production Pattern
What Zine uses (from /home/kirk/projects/zine):
// Single workflow for all content
const workflow = await graphlit.createWorkflow({
name: 'Zine Production Workflow',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
modelDocument: { specification: { id: gpt4oSpecId } }
}
}]
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
modelText: { specification: { id: claudeSpecId } },
extractedTypes: [
ObservableTypes.Person,
ObservableTypes.Organization,
ObservableTypes.Label
]
}
}]
},
enrichment: {
link: {
extractUri: true,
extractEmail: true
}
}
});
// Applied to all feeds
await graphlit.createFeed({
name: 'Slack Feed',
type: FeedTypes.Slack,
workflow: { id: workflow.createWorkflow.id },
// ...
});Key lessons from Zine:
Single workflow for simplicity
Vision model preparation (complex Slack attachments)
Entity extraction for knowledge graph
Link extraction for context
Pattern 4: Cost Optimization
// Use cheapest options that meet quality requirements
const costOptimizedWorkflow = await graphlit.createWorkflow({
name: 'Cost Optimized',
preparation: {
jobs: [{
connector: {
type: FilePreparationServiceTypes.ModelDocument,
fileTypes: [FileTypes.Pdf], // Only complex PDFs
modelDocument: { specification: { id: geminiFlashSpecId } } // Cheapest vision model
}
}],
// Default Azure AI handles everything else (much cheaper)
},
extraction: {
jobs: [{
connector: {
type: EntityExtractionServiceTypes.ModelText,
fileTypes: [FileTypes.Pdf, FileTypes.Document], // Only extract from documents
modelText: {
specification: { id: haikuSpecId }, // Cheapest LLM
tokenThreshold: 500 // Skip short sections (saves money)
},
extractedTypes: [ObservableTypes.Person, ObservableTypes.Organization], // Only essential types
extractedCount: 50 // Limit entities per type
}
}]
}
});Summary
Key Takeaways:
Most workflows are optional - Graphlit has intelligent defaults
Default preparation is Azure AI Document Intelligence - works for 80%+ of documents
No extraction by default - add extraction stage to build knowledge graph
Vision models (MODEL_DOCUMENT) are 10x more expensive - only use for complex PDFs
Audio requires explicit workflow - use Deepgram or Assembly.AI
All stages are composable - mix and match as needed
When in doubt: Start without a workflow, add stages only when you hit limitations.
Related Documentation:
Specifications → - Configure LLMs and embedding models
Key Concepts → - High-level overview
API Guides: Workflows → - Code examples
Last updated
Was this helpful?