Extraction

Configure entity and content extraction.

One of the core features of Graphlit is the knowledge graph. As content is ingested, text is extracted from documents, web pages, etc., and audio is transcribed, but there is hidden value in that text which can be unlocked.

By using entity extraction (aka named entity recognition), Graphlit can identify entities, i.e. people, places and things, and add relationships called "observations" that link the content and these observed entities.

In addition, with the advent of Large Multimodal Models (LMMs) like OpenAI GPT-4 Vision, Graphlit can read text from images, and generate textual descriptions and labels.

Learn more about observations here.

Entity & Content Extraction

LLM-based extraction (i.e. entity extraction) incurs Graphlit credit usage, based on the number of LLM tokens processed. API-based extraction (i.e. text analytics) incurs Graphlit credit usage, based on the number of document pages or transcript segments.

Named Entities: Azure Cognitive Services Text Analytics

By configuring the extraction stage of the workflow, you can use Azure Cognitive Services Text Analytics to observe any entities in text from documents, web pages, or even audio transcripts.

You will want to assign AZURE_COGNITIVE_SERVICES_TEXT to the type parameter in the extraction connector to use Azure Cognitive Services Text Analytics.

Also, you can assign confidenceThreshold to set a lower bound of confidence for observations. If the confidence of the observed entity is below this threshold, no observation will be created.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureText {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "a898708e-db00-45a6-b659-1bc5b7bb4ac3",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

You can optionally specify a list of desired extracted entity types, via the extractedTypes property. If this array is not assigned, all observed entities will be extracted.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          azureText {
            confidenceThreshold
            enablePII
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "confidenceThreshold": 0.8
            },
            "extractedTypes": [
              "PERSON",
              "PLACE",
              "ORGANIZATION"
            ]
          }
        }
      ]
    },
    "name": "Observed Entities"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "confidenceThreshold": 0.8
          },
          "extractedTypes": [
            "PERSON",
            "PLACE",
            "ORGANIZATION"
          ]
        }
      }
    ]
  },
  "id": "348594f4-ec99-44bb-9caa-96b2c8bc25cd",
  "name": "Observed Entities",
  "state": "ENABLED"
}

Named Entities: LLMs

By configuring the extraction stage of the workflow, you can use LLMs, such as OpenAI GPT-4, to observe any entities in text from documents, web pages, or even audio transcripts.

LLM extraction can accept an optional specification to specify which LLM model (and optional API key) to be used.

If a specification is not assigned, Graphlit will use the OpenAI GPT-4o 128k model by default.

LLM extraction requires a EXTRACTION specification, which has to be assigned via the type parameter when creating the specification object.

Optional: LLM Extraction Specification

Here is an example of creating an extraction specification, using OpenAI GPT-4. Note how type is assigned to EXTRACTION, which is different than the default COMPLETION type used for conversations.

Mutation:

mutation CreateSpecification($specification: SpecificationInput!) {
  createSpecification(specification: $specification) {
    id
    name
    state
    type
    serviceType
  }
}

Variables:

{
  "specification": {
    "type": "EXTRACTION",
    "serviceType": "OPEN_AI",
    "openAI": {
      "model": "GPT4_TURBO_128K_1106",
      "temperature": 0.1,
      "probability": 0.2
    },
    "name": "LLM Entity Extraction"
  }
}

Response:

{
  "type": "EXTRACTION",
  "serviceType": "OPEN_AI",
  "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07",
  "name": "LLM Entity Extraction",
  "state": "ENABLED"
}

You can assign MODEL_TEXT to the type parameter in the extraction connector to use an LLM for entity extraction. Here we are assigning the custom GPT-4 specification we created, but that can be skipped to use the default.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          modelText {
            specification {
              id
            }
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "MODEL_TEXT",
            "modelText": {
              "specification": {
                "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07"
              }
            }
          }
        }
      ]
    },
    "name": "LLM Entity Extraction"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "MODEL_TEXT",
          "modelText": {
            "specification": {
              "id": "2dd3e23b-5146-40f2-b63e-63fcb2ab5a07"
            }
          }
        }
      }
    ]
  },
  "id": "17c8820e-c100-4109-bba6-590b4dad9ce5",
  "name": "LLM Entity Extraction",
  "state": "ENABLED"
}

PII Categorization: Azure Cognitive Services Text Analytics

When using Azure Cognitive Services Text Analytics, you can optionally assign the enablePII property to true to categorize the content with any Personally Identifiable Information (PII). For example, if a credit card number was recognized in the text, Graphlit will assign the category of "Credit Card Number" to the content.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureText {
            enablePII
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_TEXT",
            "azureText": {
              "enablePII": true
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_TEXT",
          "azureText": {
            "enablePII": true
          }
        }
      }
    ]
  },
  "id": "24452cd9-4fcc-42bb-9609-84e85519cbbd",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

Image Labeling, Text Extraction, and Descriptions

Image content can be analyzed using AI models, and identify visual objects as well as labels that apply to the entire image.

For these observations, Graphlit will assign a label to the content, which contains a bounding box (in pixel coordinates) of where the objects or labels were observed.

We can use Azure Cognitive Services Visual Analytics to generate labels from images.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          contentTypes
          fileTypes
          extractedTypes
          azureImage {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
            "azureImage": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Extraction Stage"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
          "azureImage": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "249644aa-ee2a-4905-9955-585a4d6540e6",
  "name": "Extraction Stage",
  "state": "ENABLED"
}

In addition to visual object labeling, Azure Cognitive Services Image Analytics can be used for text extraction from images. If any text is visible in the image, it will be extracted into the content text property, and made searchable via semantic search.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          azureImage {
            confidenceThreshold
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
            "azureImage": {
              "confidenceThreshold": 0.8
            }
          }
        }
      ]
    },
    "name": "Image Text Extraction"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "AZURE_COGNITIVE_SERVICES_IMAGE",
          "azureImage": {
            "confidenceThreshold": 0.8
          }
        }
      }
    ]
  },
  "id": "3f860a36-15a5-4bae-bd74-c1b579c0cd4d",
  "name": "Image Text Extraction",
  "state": "ENABLED"
}

Graphlit also supports the OpenAI GPT-4 Vision model for text extraction, as well as generating descriptions of the content of the image.

If any text is visible in the image, it will be extracted into the content text property, and made searchable via semantic search. A detailed description of the image will be extracted into the content description property, which also is made searchable via semantic search.

The GPT-4 Vision model will also attempt to generate labels, which are assigned as observations on the analyzed content.

Mutation:

mutation CreateWorkflow($workflow: WorkflowInput!) {
  createWorkflow(workflow: $workflow) {
    id
    name
    state
    extraction {
      jobs {
        connector {
          type
          extractedTypes
          openAIImage {
            detailLevel
          }
        }
      }
    }
  }
}

Variables:

{
  "workflow": {
    "extraction": {
      "jobs": [
        {
          "connector": {
            "type": "OPEN_AI_IMAGE",
            "openAIImage": {
              "detailLevel": "LOW"
            },
            "extractedTypes": [
              "LABEL"
            ]
          }
        }
      ]
    },
    "name": "GPT-4 Vision Workflow"
  }
}

Response:

{
  "extraction": {
    "jobs": [
      {
        "connector": {
          "type": "OPEN_AI_IMAGE",
          "openAIImage": {
            "detailLevel": "LOW"
          },
          "extractedTypes": [
            "LABEL"
          ]
        }
      }
    ]
  },
  "id": "9058d763-a6be-4c21-8364-67a6ad527f4b",
  "name": "GPT-4 Vision Workflow",
  "state": "ENABLED"
}

Last updated