Create Extraction Workflow

Workflow: Create Extraction Workflow

User Intent

"I want to extract entities (people, organizations, topics) from my documents"

Operation

  • SDK Method: graphlit.createWorkflow() with extraction stage

  • GraphQL: createWorkflow mutation

  • Entity Type: Workflow

  • Common Use Cases: Entity extraction, knowledge graph building, document enrichment

TypeScript (Canonical)

import { Graphlit } from 'graphlit-client';
import { EntityState, ModelServiceTypes, ObservableTypes, SpecificationTypes } from 'graphlit-client/dist/generated/graphql-types';

const graphlit = new Graphlit();

// Step 1: Create specification for extraction model (optional but recommended)
const specificationResponse = await graphlit.createSpecification({
  name: 'Claude Sonnet 3.7 for Extraction',
  type: SpecificationTypes.Extraction,
  serviceType: ModelServiceTypes.Anthropic,
  anthropic: {
    model: AnthropicModels.Claude_3_7Sonnet
  }
});

const specId = specificationResponse.createSpecification.id;

// Step 2: Create extraction workflow
const workflowInput: WorkflowInput = {
  name: 'Entity Extraction',
  extraction: {
    jobs: [{
      connector: {
        type: EntityExtractionServiceTypes.ModelText,
        modelText: {
          specification: { id: specId }
        }
      }
    }]
  }
};

const response = await graphlit.createWorkflow(workflowInput);
const workflowId = response.createWorkflow.id;

console.log(`Workflow created: ${workflowId}`);

// Step 3: Use workflow during content ingestion
const contentResponse = await graphlit.ingestUri(
  'https://example.com/document.pdf',
  undefined,  // name
  undefined,  // id
  undefined,  // identifier
  true,       // isSynchronous
  { id: workflowId }  // workflow
);

const contentId = contentResponse.ingestUri.id;

// Step 4: Query extracted entities
const entitiesResponse = await graphlit.queryObservables({
  observableTypes: [
    ObservableTypes.Person,
    ObservableTypes.Organization,
    ObservableTypes.Label
  ]
});

console.log(`Extracted ${entitiesResponse.observables.results.length} entities`);

Create specification

spec_response = await graphlit.createSpecification( input_types.SpecificationInput( name="Claude Sonnet 3.7 for Extraction", type=SpecificationTypes.Extraction, service_type=ModelServiceTypes.Anthropic, anthropic=input_types.AnthropicModelPropertiesInput( model=AnthropicModels.Claude3_7Sonnet ) ) )

Create extraction workflow (snake_case)

workflow_input = input_types.WorkflowInput( name="Entity Extraction", extraction=input_types.ExtractionWorkflowStageInput( jobs=[ input_types.ExtractionWorkflowJobInput( connector=input_types.EntityExtractionConnectorInput( type=EntityExtractionServiceTypes.ModelText, model_text=input_types.ModelTextExtractionPropertiesInput( specification=input_types.EntityReferenceInput( id=spec_response.create_specification.id ) ) ) ) ] ) )

response = await graphlit.createWorkflow(workflow_input) workflow_id = response.create_workflow.id

Parameters

WorkflowInput (Required)

  • name (string): Workflow name

  • extraction (ExtractionWorkflowStageInput): Extraction configuration

ExtractionWorkflowStageInput

  • jobs (ExtractionWorkflowJobInput[]): Array of extraction jobs

    • Multiple jobs can run in parallel

ExtractionWorkflowJobInput

  • connector (EntityExtractionConnectorInput): Extraction connector configuration

EntityExtractionConnectorInput

  • type (EntityExtractionServiceTypes): Extraction service type

    • MODEL_TEXT - LLM-based extraction (recommended)

    • AZURE_DOCUMENT_INTELLIGENCE - Azure OCR + extraction

  • modelText (ModelTextExtractionPropertiesInput): LLM extraction config

    • specification (EntityReferenceInput): Reference to extraction specification

    • observables (ObservableTypes[]): Types to extract (optional)

      • PERSON, ORGANIZATION, PLACE, PRODUCT, EVENT, TOPIC, etc.

    • customTypes (string[]): Custom entity types (optional)

Response

Developer Hints

Workflow vs Direct Extraction

During Ingestion (recommended):

After Ingestion:

Important: Workflows are applied during content ingestion, not retroactively to existing content.

Default vs Custom Entity Types

Default Observable Types (automatically extracted):

  • PERSON - People, names

  • ORGANIZATION - Companies, institutions

  • PLACE - Locations, addresses

  • PRODUCT - Products, brands

  • EVENT - Events, happenings

  • TOPIC - Topics, concepts

Custom Types:

Choosing Extraction Model

Best Models for Extraction:

  • Claude Sonnet 3.7 - Best accuracy, higher cost

  • GPT-4o - Good balance of speed/accuracy

  • Claude Haiku 3.5 - Fast, lower cost

Multi-Job Extraction

Variations

1. Basic Entity Extraction

Simplest extraction workflow:

2. Extract Specific Entity Types

Target only specific entity types:

Domain-specific entity extraction:

4. Medical/Scientific Entity Extraction

Healthcare-specific entities:

5. Combined Preparation + Extraction

Workflow with both preparation and extraction:

6. Azure Document Intelligence Extraction

Use Azure for OCR + extraction:

Common Issues

Issue: No entities extracted from content Solution: Ensure content has meaningful text. Check specification model is appropriate. Vision models needed for image-heavy PDFs.

Issue: Specification not found error Solution: Create specification before creating workflow. Verify specification ID is correct.

Issue: Wrong entity types extracted Solution: Use observables parameter to specify exact types. Add customTypes for domain-specific entities.

Issue: Extraction too slow Solution: Use faster models (Claude Haiku, GPT-4o-mini) or reduce content size.

Issue: Workflow created but not applied Solution: Ensure workflow is passed during ingestUri(). Workflows don't apply retroactively.

Production Example

Complete extraction pipeline:

Last updated

Was this helpful?