# Create Extraction Workflow

## Workflow: Create Extraction Workflow

### User Intent

"I want to extract entities (people, organizations, topics) from my documents"

### Operation

* **SDK Method**: `graphlit.createWorkflow()` with extraction stage
* **GraphQL**: `createWorkflow` mutation
* **Entity Type**: Workflow
* **Common Use Cases**: Entity extraction, knowledge graph building, document enrichment

### TypeScript (Canonical)

```typescript
import { Graphlit } from 'graphlit-client';
import { EntityState, ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Step 1: Create specification for extraction model (optional but recommended)
const specificationResponse = await graphlit.createSpecification({
  name: 'Claude Sonnet 3.7 for Extraction',
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: {
    model: AnthropicModels.Claude_3_7Sonnet
  }
});

const specId = specificationResponse.createSpecification.id;

// Step 2: Create extraction workflow
const workflowInput: WorkflowInput = {
  name: 'Entity Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId }
        }
      }
    }]
  }
};

const response = await graphlit.createWorkflow(workflowInput);
const workflowId = response.createWorkflow.id;

console.log(`Workflow created: ${workflowId}`);

// Step 3: Use workflow during content ingestion
const contentResponse = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  undefined,  // name
  undefined,  // id
  undefined,  // identifier
  true,       // isSynchronous
  { id: workflowId }  // workflow
);

const contentId = contentResponse.ingestUri.id;

// Step 4: Query extracted entities
const entitiesResponse = await graphlit.queryObservables({
  observableTypes: [
    ObservableTypes.Person,
    ObservableTypes.Organization,
    ObservableTypes.Label
  ]
});

console.log(`Extracted ${entitiesResponse.observables.results.length} entities`);
```

## Create specification

spec\_response = await graphlit.createSpecification( input\_types.SpecificationInput( name="Claude Sonnet 3.7 for Extraction", type=SpecificationTypes.Extraction, service\_type=ModelServiceTypes.Anthropic, anthropic=input\_types.AnthropicModelPropertiesInput( model=AnthropicModels.Claude3\_7Sonnet ) ) )

## Create extraction workflow (snake\_case)

workflow\_input = input\_types.WorkflowInput( name="Entity Extraction", extraction=input\_types.ExtractionWorkflowStageInput( jobs=\[ input\_types.ExtractionWorkflowJobInput( connector=input\_types.EntityExtractionConnectorInput( type=EntityExtractionServiceTypes.ModelText, model\_text=input\_types.ModelTextExtractionPropertiesInput( specification=input\_types.EntityReferenceInput( id=spec\_response.create\_specification.id ) ) ) ) ] ) )

response = await graphlit.createWorkflow(workflow\_input) workflow\_id = response.create\_workflow\.id

````

**C#**:
```csharp
using Graphlit;

var client = new Graphlit();

// Create specification
var specResponse = await graphlit.CreateSpecification(new SpecificationInput {
    Name = "Claude Sonnet 3.7 for Extraction",
    Type = SpecificationTypes.Extraction,
    ServiceType = ModelServiceTypes.Anthropic,
    Anthropic = new AnthropicModelPropertiesInput {
        Model = AnthropicModels.Claude_3_7Sonnet
    }
});

// Create extraction workflow (PascalCase)
var workflowInput = new WorkflowInput {
    Name = "Entity Extraction",
    Extraction = new ExtractionWorkflowStageInput {
        Jobs = new[] {
            new ExtractionWorkflowJobInput {
                Connector = new EntityExtractionConnectorInput {
                    Type = EntityExtractionServiceTypes.ModelText,
                    ModelText = new ModelTextExtractionPropertiesInput {
                        Specification = new EntityReferenceInput {
                            Id = specResponse.CreateSpecification.Id
                        }
                    }
                }
            }
        }
    }
};

var response = await graphlit.CreateWorkflow(workflowInput);
var workflowId = response.CreateWorkflow.Id;
````

### Parameters

#### WorkflowInput (Required)

* **`name`** (string): Workflow name
* **`extraction`** (ExtractionWorkflowStageInput): Extraction configuration

#### ExtractionWorkflowStageInput

* **`jobs`** (ExtractionWorkflowJobInput\[]): Array of extraction jobs
  * Multiple jobs can run in parallel

#### ExtractionWorkflowJobInput

* **`connector`** (EntityExtractionConnectorInput): Extraction connector configuration

#### EntityExtractionConnectorInput

* **`type`** (EntityExtractionServiceTypes): Extraction service type
  * `MODEL_TEXT` - LLM-based extraction (recommended)
  * `AZURE_DOCUMENT_INTELLIGENCE` - Azure OCR + extraction
* **`modelText`** (ModelTextExtractionPropertiesInput): LLM extraction config
  * **`specification`** (EntityReferenceInput): Reference to extraction specification
  * **`observables`** (ObservableTypes\[]): Types to extract (optional)
    * `PERSON`, `ORGANIZATION`, `PLACE`, `PRODUCT`, `EVENT`, `TOPIC`, etc.
  * **`customTypes`** (string\[]): Custom entity types (optional)

### Response

```typescript
{
  createWorkflow: {
    id: string;                           // Workflow ID
    name: string;                         // Workflow name
    state: EntityState;                   // ENABLED
    extraction: {
      jobs: ExtractionWorkflowJob[];
    }
  }
}
```

### Developer Hints

#### Workflow vs Direct Extraction

**During Ingestion** (recommended):

```typescript
// Workflow applied automatically during content ingestion
await graphlit.ingestUri(uri, undefined, undefined, undefined, true, { id: workflowId });
```

**After Ingestion**:

```typescript
// Extract from already-ingested content (not yet available in SDK)
// Currently workflows must be applied during ingestion
```

**Important**: Workflows are applied during content ingestion, not retroactively to existing content.

#### Default vs Custom Entity Types

**Default Observable Types** (automatically extracted):

* `PERSON` - People, names
* `ORGANIZATION` - Companies, institutions
* `PLACE` - Locations, addresses
* `PRODUCT` - Products, brands
* `EVENT` - Events, happenings
* `TOPIC` - Topics, concepts

**Custom Types**:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Custom Entity Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId },
          customTypes: ['Contract', 'Regulation', 'Obligation', 'Risk']
        }
      }
    }]
  }
};
```

#### Choosing Extraction Model

**Best Models for Extraction**:

* **Claude Sonnet 3.7** - Best accuracy, higher cost
* **GPT-4o** - Good balance of speed/accuracy
* **Claude Haiku 3.5** - Fast, lower cost

```typescript
// Claude Sonnet (highest accuracy)
const spec = await graphlit.createSpecification({
  name: 'Claude Sonnet Extraction',
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: {
    model: AnthropicModels.Claude_3_7Sonnet
  }
});

// GPT-4o (good balance)
const spec = await graphlit.createSpecification({
  name: 'GPT-4o Extraction',
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.OpenAi,
  openAI: {
    model: OpenAiModels.Gpt4O_128K
  }
});
```

#### Multi-Job Extraction

```typescript
// Run multiple extraction jobs in parallel
const workflowInput: WorkflowInput = {
  name: 'Multi-Job Extraction',
  extraction: {
    jobs: [
      {
        // Job 1: Extract entities with Claude
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          modelText: {
            specification: { id: claudeSpecId },
            observables: [
              ObservableTypes.Person,
              ObservableTypes.Organization
            ]
          }
        }
      },
      {
        // Job 2: Extract custom domain entities with GPT-4
        connector: {
          type: EntityExtractionServiceTypes.ModelText,
          modelText: {
            specification: { id: gpt4SpecId },
            customTypes: ['Contract', 'Obligation', 'Risk']
          }
        }
      }
    ]
  }
};
```

### Variations

#### 1. Basic Entity Extraction

Simplest extraction workflow:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Basic Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId }
        }
      }
    }]
  }
};

const response = await graphlit.createWorkflow(workflowInput);
```

#### 2. Extract Specific Entity Types

Target only specific entity types:

```typescript
const workflowInput: WorkflowInput = {
  name: 'People and Orgs Only',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId },
          observables: [
            ObservableTypes.Person,
            ObservableTypes.Organization
          ]
        }
      }
    }]
  }
};
```

#### 3. Custom Legal Entity Extraction

Domain-specific entity extraction:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Legal Document Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId },
          customTypes: [
            'Contract',
            'Party',
            'Obligation',
            'Deadline',
            'Payment Term',
            'Jurisdiction',
            'Liability Clause',
            'Termination Clause'
          ]
        }
      }
    }]
  }
};
```

#### 4. Medical/Scientific Entity Extraction

Healthcare-specific entities:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Medical Entity Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId },
          customTypes: [
            'Disease',
            'Medication',
            'Symptom',
            'Diagnosis',
            'Treatment',
            'Dosage',
            'Medical Procedure',
            'Body Part'
          ]
        }
      }
    }]
  }
};
```

#### 5. Combined Preparation + Extraction

Workflow with both preparation and extraction:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Prepare and Extract',
  preparation: {
    jobs: [{
      connector: {
        type: ContentPreparationServiceTypes.ModelDocument,
        modelDocument: {
          specification: { id: visionSpecId }  // Vision model for PDFs
        }
      }
    }]
  },
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: extractionSpecId }
        }
      }
    }]
  }
};

// Preparation runs first, then extraction on prepared content
```

#### 6. Azure Document Intelligence Extraction

Use Azure for OCR + extraction:

```typescript
const workflowInput: WorkflowInput = {
  name: 'Azure Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.AzureDocumentIntelligence,
        azureDocument: {
          model: AzureDocumentIntelligenceModels.Layout
        }
      }
    }]
  }
};
```

### Common Issues

**Issue**: No entities extracted from content\
**Solution**: Ensure content has meaningful text. Check specification model is appropriate. Vision models needed for image-heavy PDFs.

**Issue**: `Specification not found` error\
**Solution**: Create specification before creating workflow. Verify specification ID is correct.

**Issue**: Wrong entity types extracted\
**Solution**: Use `observables` parameter to specify exact types. Add `customTypes` for domain-specific entities.

**Issue**: Extraction too slow\
**Solution**: Use faster models (Claude Haiku, GPT-4o-mini) or reduce content size.

**Issue**: Workflow created but not applied\
**Solution**: Ensure workflow is passed during `ingestUri()`. Workflows don't apply retroactively.

### Production Example

**Complete extraction pipeline**:

```typescript
// 1. Create extraction specification
const spec = await graphlit.createSpecification({
  name: 'Claude Extraction',
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: {
    model: AnthropicModels.Claude_3_7Sonnet
  }
});

// 2. Create extraction workflow
const workflow = await graphlit.createWorkflow({
  name: 'Entity Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: spec.createSpecification.id }
        }
      }
    }]
  }
});

// 3. Ingest with workflow
await graphlit.ingestUri(
  documentUri,
  undefined, undefined, undefined,
  true,
  { id: workflow.createWorkflow.id }
);

// 4. Query entities
const entities = await graphlit.queryObservables({
  observableTypes: [ObservableTypes.Person, ObservableTypes.Organization]
});

console.log(`Extracted: ${entities.observables.results.length} entities`);
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.graphlit.dev/api-guides/use-cases/workflows/workflow-create-extraction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
