Graphlit Platform
Developer PortalChangelogPlatform StatusMore InformationJoin Discord
  • Graphlit Platform
    • What is Graphlit?
    • Key Concepts
  • Getting Started
    • Sign up for Graphlit
    • Create Graphlit Project
    • For Python Developers
    • For Node.js Developers
    • For .NET Developers
  • 🚀Quickstart
    • Next.js applications
      • GitHub Code
    • Python applications
      • GitHub Code
  • Graphlit Data API
    • API Usage
      • API Endpoints
      • API Authentication
      • API Explorer
      • GraphQL 101
    • API Reference
      • Content
        • Ingest With Workflow
        • Ingest File
        • Ingest Encoded File
        • Ingest Web Page
        • Ingest Text
        • Semantic Search
          • Query All Content
          • Query Facets
          • Query By Name
          • Filter By Contents
        • Metadata Filtering
          • Filter By Observations
          • Filter By Feeds
          • Filter By Collections
          • Filter By Content Type
          • Filter By File Type
          • Filter By File Size Range
          • Filter By Date Range
        • Summarize Contents
        • Extract Contents
        • Publish Contents
      • Knowledge Graph
        • Labels
        • Categories
        • Persons
        • Organizations
        • Places
        • Events
        • Products
        • Repos
        • Software
      • Collections
      • Feeds
        • Create Feed With Workflow
        • Create RSS Feed
        • Create Podcast Feed
        • Create Web Feed
        • Create Web Search Feed
        • Create Reddit Feed
        • Create Notion Feed
        • Create YouTube Feed
        • User Storage Feeds
          • Create OneDrive Feed
          • Create Google Drive Feed
          • Create SharePoint Feed
        • Cloud Storage Feeds
          • Create Amazon S3 Feed
          • Create Azure Blob Feed
          • Create Azure File Feed
          • Create Google Blob Feed
        • Messaging Feeds
          • Create Slack Feed
          • Create Microsoft Teams Feed
          • Create Discord Feed
        • Email Feeds
          • Create Google Mail Feed
          • Create Microsoft Outlook Feed
        • Issue Feeds
          • Create Linear Feed
          • Create Jira Feed
          • Create GitHub Issues Feed
        • Configuration Options
      • Workflows
        • Ingestion
        • Indexing
        • Preparation
        • Extraction
        • Enrichment
        • Actions
      • Conversations
      • Specifications
        • Azure OpenAI
        • OpenAI
        • Anthropic
        • Mistral
        • Groq
        • Deepseek
        • Replicate
        • Configuration Options
      • Alerts
        • Create Slack Audio Alert
        • Create Slack Text Alert
      • Projects
    • API Changelog
    • Multi-tenant Applications
  • JSON Mode
    • Overview
    • Document JSON
    • Transcript JSON
  • Content Types
    • Files
      • Documents
      • Audio
      • Video
      • Images
      • Animations
      • Data
      • Emails
      • Code
      • Packages
      • Other
    • Web Pages
    • Text
    • Posts
    • Messages
    • Emails
    • Issues
  • Data Sources
    • Feeds
  • Platform
    • Developer Portal
      • Projects
    • Cloud Platform
      • Security
      • Subprocessors
  • Resources
    • Community
Powered by GitBook
On this page
  • Entity & Content Extraction
  • Named Entities: Azure Cognitive Services Text Analytics
  • Named Entities: LLMs
  • PII Categorization: Azure Cognitive Services Text Analytics
  • Image Labeling, Text Extraction, and Descriptions

Was this helpful?

  1. Graphlit Data API
  2. API Reference
  3. Workflows

Extraction

Configure entity and content extraction.

Last updated 11 months ago

Was this helpful?

One of the core features of Graphlit is the knowledge graph. As content is ingested, text is extracted from documents, web pages, etc., and audio is transcribed, but there is hidden value in that text which can be unlocked.

By using entity extraction (aka ), Graphlit can identify entities, i.e. people, places and things, and add relationships called "observations" that link the content and these observed entities.

In addition, with the advent of Large Multimodal Models (LMMs) like OpenAI GPT-4 Vision, Graphlit can read text from images, and generate textual descriptions and labels.

Learn more about .

Entity & Content Extraction

LLM-based extraction (i.e. entity extraction) incurs Graphlit credit usage, based on the number of LLM tokens processed. API-based extraction (i.e. text analytics) incurs Graphlit credit usage, based on the number of document pages or transcript segments.

Named Entities: Azure Cognitive Services Text Analytics

By configuring the extraction stage of the workflow, you can use Azure Cognitive Services Text Analytics to observe any entities in text from documents, web pages, or even audio transcripts.

You will want to assign AZURE_COGNITIVE_SERVICES_TEXT to the type parameter in the extraction connector to use Azure Cognitive Services Text Analytics.

Also, you can assign confidenceThreshold to set a lower bound of confidence for observations. If the confidence of the observed entity is below this threshold, no observation will be created.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureText {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "a898708e-db00-45a6-b659-1bc5b7bb4ac3",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

You can optionally specify a list of desired extracted entity types, via the extractedTypes property. If this array is not assigned, all observed entities will be extracted.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          azureText {
            confidenceThreshold
            enablePII
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "confidenceThreshold": 0.8
            },
            "extractedTypes": [
              "PERSON",
              "PLACE",
              "ORGANIZATION"
            ]
          }
        }
      ]
    },
    "name": "Observed Entities"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "confidenceThreshold": 0.8
          },
          "extractedTypes": [
            "PERSON",
            "PLACE",
            "ORGANIZATION"
          ]
        }
      }
    ]
  },
  "id": "348594f4-ec99-44bb-9caa-96b2c8bc25cd",
  "name": "Observed Entities",
  "state": "ENABLED"
}

Named Entities: LLMs

By configuring the extraction stage of the workflow, you can use LLMs, such as OpenAI GPT-4, to observe any entities in text from documents, web pages, or even audio transcripts.

LLM extraction can accept an optional specification to specify which LLM model (and optional API key) to be used.

If a specification is not assigned, Graphlit will use the OpenAI GPT-4o 128k model by default.

LLM extraction requires a EXTRACTION specification, which has to be assigned via the type parameter when creating the specification object.

Optional: LLM Extraction Specification

Here is an example of creating an extraction specification, using OpenAI GPT-4. Note how type is assigned to EXTRACTION, which is different than the default COMPLETION type used for conversations.

Mutation:

mutation CreateSpecification($specification: SpecificationInput!) {
  createSpecification(specification: $specification) {
    id
    name
    state
    type
    serviceType
  }
}

Variables:

{
  "specification": {
    "type": "EXTRACTION",
    "serviceType": "OPEN_AI",
    "openAI": {
      "model": "GPT4_TURBO_128K_1106",
      "temperature": 0.1,
      "probability": 0.2
    },
    "name": "LLM Entity Extraction"
  }
}

Response:

{
  "type": "EXTRACTION",
  "serviceType": "OPEN_AI",
  "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07",
  "name": "LLM Entity Extraction",
  "state": "ENABLED"
}

You can assign MODEL_TEXT to the type parameter in the extraction connector to use an LLM for entity extraction. Here we are assigning the custom GPT-4 specification we created, but that can be skipped to use the default.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          modelText {
            specification {
              id
            }
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "MODEL_TEXT",
            "modelText": {
              "specification": {
                "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07"
              }
            }
          }
        }
      ]
    },
    "name": "LLM Entity Extraction"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "MODEL_TEXT",
          "modelText": {
            "specification": {
              "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07"
            }
          }
        }
      }
    ]
  },
  "id": "17c8820e-c100-4109-bba6-590b4dad9ce5",
  "name": "LLM Entity Extraction",
  "state": "ENABLED"
}

PII Categorization: Azure Cognitive Services Text Analytics

When using Azure Cognitive Services Text Analytics, you can optionally assign the enablePII property to true to categorize the content with any Personally Identifiable Information (PII). For example, if a credit card number was recognized in the text, Graphlit will assign the category of "Credit Card Number" to the content.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureText {
            enablePII
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "enablePII": true
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "enablePII": true
          }
        }
      }
    ]
  },
  "id": "24452cd9-4fcc-42bb-9609-84e85519cbbd",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

Image Labeling, Text Extraction, and Descriptions

Image content can be analyzed using AI models, and identify visual objects as well as labels that apply to the entire image.

For these observations, Graphlit will assign a label to the content, which contains a bounding box (in pixel coordinates) of where the objects or labels were observed.

We can use Azure Cognitive Services Visual Analytics to generate labels from images.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureImage {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
            "azureImage": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
          "azureImage": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "249644aa-ee2a-4905-9955-585a4d6540e6",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

In addition to visual object labeling, Azure Cognitive Services Image Analytics can be used for text extraction from images. If any text is visible in the image, it will be extracted into the content text property, and made searchable via semantic search.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          azureImage {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
            "azureImage": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Image Text Extraction"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
          "azureImage": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "3f860a36-15a5-4bae-bd74-c1b579c0cd4d",
  "name": "Image Text Extraction",
  "state": "ENABLED"
}

If any text is visible in the image, it will be extracted into the content text property, and made searchable via semantic search. A detailed description of the image will be extracted into the content description property, which also is made searchable via semantic search.

The GPT-4 Vision model will also attempt to generate labels, which are assigned as observations on the analyzed content.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          openAIImage {
            detailLevel
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "OPEN_AI_IMAGE",
            "openAIImage": {
              "detailLevel": "LOW"
            },
            "extractedTypes": [
              "LABEL"
            ]
          }
        }
      ]
    },
    "name": "GPT-4 Vision Workflow"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "OPEN_AI_IMAGE",
          "openAIImage": {
            "detailLevel": "LOW"
          },
          "extractedTypes": [
            "LABEL"
          ]
        }
      }
    ]
  },
  "id": "9058d763-a6be-4c21-8364-67a6ad527f4b",
  "name": "GPT-4 Vision Workflow",
  "state": "ENABLED"
}

Graphlit also supports the model for text extraction, as well as generating descriptions of the content of the image.

OpenAI GPT-4 Vision
named entity recognition
observations here
Named Entities: Azure Cognitive Services Text Analytics
Named Entities: LLMs
PII Categorization: Azure Cognitive Services Text Analytics
Image Labeling, Text Extraction, and Descriptions